Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render action randomly fails when exporting to PDF: ERROR: Couldn't find open server #45

Open
rogerbramon opened this issue Sep 13, 2022 · 37 comments
Assignees
Labels
bug Something isn't working

Comments

@rogerbramon
Copy link

rogerbramon commented Sep 13, 2022

Occasionally, quarto-actions/render@v2 fails to run, causing ERROR: Couldn't find open server. I've only experienced this problem when rendering to PDF, HTML output seems fine.

image

Has anyone experienced it?

Tested on:
Ubuntu 20.04 and 22.04
Quarto: latest version (1.1.189) and fixing one (1.0.37)

Workflow example:

name: Generate PDF
on:
  pull_request:
    types: [opened, synchronize, reopened, labeled]

jobs:
  generate-pdf:
    - name: Setup Quarto
      uses: quarto-dev/quarto-actions/setup@v2
      with:
        tinytex: true

    - name: Quarto check
      run: |
        quarto check --log-level info

    - name: Render Quarto Project
      uses: quarto-dev/quarto-actions/render@v2
      with:
        to: pdf
        path: input/P123-qmd-file.qmd

This is the output of all Quarto actions of the workflow when fails:

quarto-render-output

@cderv
Copy link
Collaborator

cderv commented Sep 13, 2022

This error does not ring a bell to me. @cscheid did you already encountered such error ?

@rogerbramon can you share more on the .qmd content where this error happens ? Do you have a link of the project ?

@cscheid
Copy link
Contributor

cscheid commented Sep 13, 2022

This is the first time I see this error. Thank you for the report!

@rogerbramon
Copy link
Author

@rogerbramon can you share more on the .qmd content where this error happens ? Do you have a link of the project ?

Content is basically pure Markdown and a mermaid diagram using ```{mermaid} format.
Unfortunately, it's a private repo, but I could try to create a test one and try to reproduce it there.

@cderv
Copy link
Collaborator

cderv commented Sep 13, 2022

And a mermaid diagram using ```{mermaid} format.

I think this could come from that content. For PDF output, it needs to be printed to image using Chrome or Chromium. And it seems this cause issue (with server not found)

Google Chrome should be installed on GHA runners, but it seems something is not working as expected.

I could try to create a test one and try to reproduce it there.

If you do a test repo to reproduce, that will be really useful to reproduce and investigate

@rogerbramon
Copy link
Author

Here you have the test repo, and I was able to reproduce the problem at the 3rd attempt (same code, just re-run jobs).

Successful attempt: https://github.com/rogerbramon/test-quarto/actions/runs/3045981181/attempts/2
Failure attempt: https://github.com/rogerbramon/test-quarto/actions/runs/3045981181

@cderv
Copy link
Collaborator

cderv commented Sep 13, 2022

Thanks a lot!

I was able to reproduce the problem at the 3rd attempt (same code, just re-run jobs).

It seems not good that this is working on some run, and not on other... Like internal issue in the runners with connecting to chrome headless. 🤔

@rogerbramon
Copy link
Author

Could it be that sometimes the Google Chrome takes a bit longer to spin up? If I get it right, you have a timeout of only 3 seconds on waitForServer.

https://github.com/quarto-dev/quarto-cli/blob/ed57d10eba33c34b5e1df2c22ed372a7e28da5d0/src/core/cri/cri.ts#L80-L90

I tried adding a previous step that calls the Chrome headless, and the problem is not occurring then. That's why timeout is my guess... Any thoughts?

    - name: Check chrome
      run: |
        echo $(which google-chrome)
        $(which google-chrome) --headless --single-process https://www.chromestatus.com

@cscheid
Copy link
Contributor

cscheid commented Sep 13, 2022

Somewhat disconcertingly, the only hit on google for that error is another open issue on quarto quarto-dev/quarto-cli#1822

@cderv
Copy link
Collaborator

cderv commented Sep 13, 2022

Oh interesting. Thanks @cscheid !

I am still trying to find a clean environment to reproduce the issue as now it working in my WSL, after I install the deb google-chrome. I just don't know the change. Before that I could reproduce each time though...

here it is also an error in Ubuntu in GHA. Maybe this will help us find the reason if I manage to debug this in the workflow directly.

@rogerbramon
Copy link
Author

@cderv, you may use the action-tmate action to get access to the runner system via SSH and debug there.

@rogerbramon
Copy link
Author

rogerbramon commented Sep 19, 2022

Just to add a bit of info, I've also experienced this problem on macOS a couple of times. But it's much more difficult to reproduce.

Not sure if that adds noise or are different issues, but besides the Couldn't find open server sometimes I also get on macOS:

  • ERROR: No inspectable targets

It's not easy to reproduce, but it's easier to get the error when there's no instance of Google Chrome opened.
HTH

@cderv
Copy link
Collaborator

cderv commented Sep 19, 2022

So I manage to reproduce locally for quarto-dev/quarto-cli#1822 - the issue there is systematic not occasional and probably due to missing system requirement for the chromium that we install from puppeteer. I believe Github action has it all so not the issue here because occasional also.

Somewhat disconcertingly, the only hit on google for that error is another open issue on quarto

By the way this is the only hit because this is an error thrown by Quarto
https://github.com/quarto-dev/quarto-cli/blob/ed57d10eba33c34b5e1df2c22ed372a7e28da5d0/src/core/cri/cri.ts#L91

When the https://localhost:<port> can't be reach withing the timeout limit we through an error. This can happen if there is an issue with launching chrome or if chrome running. Currently we throw only the error.

I have made a PR in Quarto so that we can have more information if the error is a failed attempt to run chrome.
See quarto-dev/quarto-cli#2499

If I get it right, you have a timeout of only 3 seconds on waitForServer.

We could indeed look in the timeout. However, I think we attempt already 60 times with 50s between each attempt.

Maybe quarto-dev/quarto-cli#2499 will show some issue with running chrome itself and not with timeout. It should be merged soon and available in a pre-release to use it.

ERROR: No inspectable targets

@rogerbramon this issue is thrown by quarto also when targets are not valid somehow for the chrome remote interface. It means the chrome was launched correctly, quarto connected correctly to the remote debugging port, but... there is something else not correct in the interaction with the headless browser. If you have an example to share where this happens, that would help

@cderv cderv self-assigned this Sep 19, 2022
@rogerbramon
Copy link
Author

Thanks @cderv for your time. I tried the latest pre-release version (1.2.134) on the testing repo and, unfortunately, I've not been able to get the error because the render step now hangs (3 of 7 attempts). I had to cancel the workflow after 5min.

https://github.com/rogerbramon/test-quarto/actions/runs/3088361081/jobs/4994835499

Regarding the No inspectable targets, I have no further info to share, it occurred when running the same render command on the macOS.

@cderv
Copy link
Collaborator

cderv commented Sep 20, 2022

That is bad. It is probably a different issue than quarto-dev/quarto-cli#1822 - I'll look into that next;

@rogerbramon
Copy link
Author

Hi @cderv, not sure if it's related to that but with version 1.3 sometimes it hangs forever.

@cderv
Copy link
Collaborator

cderv commented May 11, 2023

oh no... sorry about that.

We added printing stack trace by default when there is an error. Do you have more error too share ? or a link to an action log ?
It is possibly related to chrome and our usage with puppeteer within the Github Action context. Log can confirms that probably

Is this happening every time ?

@rogerbramon
Copy link
Author

It happens randomly like before, but now it doesn't fail but keeps running, and you need to cancel it. I'm not able to see any log.

Using the same test repo, I just updated to use the latest version. You can see that Attempt #1 and Attempt#3 got stuck and I had to cancel them, and Attempt#2 succeeded.

I enabled debug logging on the latest attempt, but I don't see many insights.

Thanks.

@cderv cderv added the bug Something isn't working label May 12, 2023
@cderv
Copy link
Collaborator

cderv commented May 12, 2023

That will not be easy to debug. I am surprised we get no log at all, no trace. Thanks for the report again.

I don't think that will change anything, because it is probably not the action itself and something on GHA runners with quarto render.

Sorry for the inconvenience, I will try to investigate but not sure where to look exactly.

@rogerbramon
Copy link
Author

Thanks @cderv, what I've noticed is that this issue seems to disappear when adding a step that calls chromium before running quarto:

    - name: Check chromium
      run: |
        echo $(which chromium-browser)
        $(which chromium-browser) --headless https://www.chromestatus.com

With this step, the action ran successfully for 10 times in a row. However, as soon as I removed it, it started freezing again.

HTH

@cderv
Copy link
Collaborator

cderv commented May 25, 2023

Oh really interesting debugging step. 🤔 Can you help me understand further your testing ?

a step that calls chromium before running quarto

Does this step only call chromium and close it ? Or does it leave it open for next step you think ?

I wonder what is the effect this command could have. I don't know if you have the environment needed, but did you observe the freezing on a non-gha unbuntu machine ?

i'll read through the code in quarto with this new information in mind. @cscheid if you have ideas, feel free to chime in.

@cscheid
Copy link
Contributor

cscheid commented May 25, 2023

@rogerbramon That is super fascinating. I wonder if that check causes some deferred library loading that takes a while, and prevents an eventual race condition. I would be happy keeping the check in our actions. In fact, I wonder if this check would also fix some of the hard-to-track bugs we've been seeing with chromium in Linux on the quarto-cli repo!

@cderv What do you think about simply adding that action to our render step?

@cderv
Copy link
Collaborator

cderv commented May 25, 2023

Oh great idea ! Not sure what it did not occured to me 😅

I guess it cost nothing to do it just in case someone needs chrome with Quarto. Sounds good !

@cscheid
Copy link
Contributor

cscheid commented May 25, 2023

I guess it cost nothing to do it just in case someone needs chrome with Quarto. Sounds good !

It does cost the time to run it, but that really shouldn't be much. Let's try that!

@cderv
Copy link
Collaborator

cderv commented May 26, 2023

I have a new v2 release of the action including this fix for Ubuntu only right now. MacOs and Windows runner needs som adjustment.

  • Adapt for MacOs runner
  • Adapt for Windows runner
  • Add to quarto-publish or even setup-quarto instead

@rogerbramon
Copy link
Author

rogerbramon commented May 26, 2023

Thank you guys for looking into this. I don't have more info at this moment to answer your questions. I'll need to invest more time. So far, I've only experienced this issue on GHA. Locally, I use Mac. I can try to use devcontainers or codespaces to see if this issue can be reproduced.

@cderv What do you think about simply adding that action to our render step?

Would make sense to add this workaround on the Setup action instead of the render one? I'm saying this because the Publish action also renders by default and sometimes, depending on the parameters you need, you have to use a shell step to directly run quarto render.

@cderv
Copy link
Collaborator

cderv commented May 26, 2023

Would make sense to add this workaround on the Setup action instead of the render one? I'm saying this because the Publish action also renders by default and sometimes, depending on the parameters you need, you have to use a shell step to directly run quarto render.

Oh indeed... It would probably make sense to add that to the setup action instead so that it covers render and publish.
Otherwise I can add it to publish also.

Thanks for the feedback !

@rogerbramon
Copy link
Author

In my case, I don't use the render action because I need to define extra parameters to the render command which is not possible using the action, so I would need to manually add the step if that's not included in the Setup.

Not use if anyone is using the render or publish action without the setup, but in my case I find it very useful. I don't want to force anything, just explaining my use case.

@cderv
Copy link
Collaborator

cderv commented May 26, 2023

I don't use the render action because I need to define extra parameters to the render command which is not possible using the action

Maybe we should allow that too ?

Not use if anyone is using the render or publish action without the setup, but in my case I find it very useful. I don't want to force anything, just explaining my use case.

That makes sense. We could probably expect someone using the publish or render action to have used the setup action in the first place. Maybe we should document this chromium trick also

@rogerbramon
Copy link
Author

I don't use the render action because I need to define extra parameters to the render command which is not possible using the action

Maybe we should allow that too ?

Would be handy to have a free parameter to add whatever parameters you need.

@rogerbramon
Copy link
Author

Hey, just to let you know that I've been using this workaround for a while, but unfortunately we still experience the problem randomly.

@cderv
Copy link
Collaborator

cderv commented Jun 19, 2023

but unfortunately we still experience the problem randomly.

So even for ubuntu this is not enough ?

It seems really related to GHA runners. We are looking at new chrome development that may improve things for Quarto https://developer.chrome.com/blog/chrome-for-testing/

@sebffischer
Copy link

I just wanted to report that we were experience something similar in the mlr3 book, i.e. the actions just times out randomly.

When rendering with the --execute-debug flag, it turned out that the postprocessing failed for one of the chapters in our book. While other chapters ended with:

  |............................................| 100%                     	 
                                                                                                               	 
output file: preprocessing.knit.md

[knitr engine]: writing results
[knitr engine]: exiting
[knitr engine]: postprocess
[knitr engine]: writing results
[knitr engine]: exiting

the chapter that uses mermaid ended with

 |                                              	 
  |.............................................| 100%                    	 
                                                                                                              	 
output file: advanced_technical_aspects_of_mlr3.knit.md

[knitr engine]: writing results
[knitr engine]: exiting
Error: The operation was canceled.

One CI run where this happened can be found here.

We are rendering to both html and pdf.
This also happened randomly (around 50% of the runs) so it was not so easy to track this down as we also don't get any log output really.

What we also observed (not with 100% certainty, as the error is stochastic) is that rendering to html and pdf in two separate CI steps (quarto render book/ --cache-refresh --to html and then quarto render book/ --cache-refresh --to pdf) made the error disappear. When we removed the mermaid diagram, the error also seemed to disappear.

We also included the installation of chromium in our CI:

      - name: Install headless chromium
        run: quarto tools install chromium

@cderv
Copy link
Collaborator

cderv commented Nov 7, 2023

Thanks a lot for the detailed explanation.

Error: The operation was canceled.

There was a hang in the CI and it was cancelled because The job running on runner GitHub Actions 7 has exceeded the maximum execution time of 360 minutes.

I am not surprised that the chapter with mermaid is the issue. We think this issue is related to using Chrome on GHA runner. But we really don't know what happens really, and how to solve.

It seems initiating Chromium somehow helped for some times (#45 (comment)) but still error appears.

Maybe the version we allow to install with quarto tools install chromium is not good enough for this usage on CI. I know about a specific action to install chromium: https://github.com/browser-actions/setup-chrome

It could worth a try and see if you still encounter the issue maybe ?

@sebffischer
Copy link

Thanks for the quick response! For now I will just render the mermaid diagram once and include it as a figure, until it is clear what the bug was and how it can be solved.

What was also surprising retrospect, is why the --execute-debug flag gives information about the knitr engine?
when running quarto help this flag is explained as

 --execute-debug                     - Show debug output for Jupyter kernel.  

In our particular case the additional information about the knitr engine allowed us to track down the bug.

@cderv
Copy link
Collaborator

cderv commented Nov 7, 2023

is why the --execute-debug flag gives information about the knitr engine?

This is mainly a documentation issue; This flag will set a debug variable to TRUE for which also applies for knitr rendering
Added a while back : quarto-dev/quarto-cli@9c86e98

@sebffischer
Copy link

Thanks! I have created an issue about this: quarto-dev/quarto-cli#7502

sebffischer added a commit to mlr-org/mlr3book that referenced this issue Nov 7, 2023
For a while, we have suffered from random timeouts when
rendering the book. The log-output from GHA during the
"Render book" stage just ended with:

|.............................................| 100%
output file: advanced_technical_aspects_of_mlr3.knit.md

Error: The operation was canceled.

after hitting the maximum runtime of 6 hours.
(the success rate was around 50/50).

When rendering with the --execute-debug flag
more log-output was given.
The log output at the end of rendering the technical chapter (pdf)
was:

 |
  |.............................................| 100%

output file: advanced_technical_aspects_of_mlr3.knit.md

[knitr engine]: writing results
[knitr engine]: exiting
Error: The operation was canceled.

for the other chapters when rendering to pdf, the output was

  |............................................| 100%

output file: preprocessing.knit.md

[knitr engine]: writing results
[knitr engine]: exiting
[knitr engine]: postprocess
[knitr engine]: writing results
[knitr engine]: exiting

--> something with the postprocessing went wrong and the bug was
identified to be in the technical chapter

The problem was **NOT** the large-scale benchmarking chapter

quarto-dev/quarto-actions#45
sebffischer added a commit to mlr-org/mlr3book that referenced this issue Nov 7, 2023
For a while, we have suffered from random timeouts when rendering the book.
The log-output from GHA during the "Render book" stage just ended with:

|.............................................| 100% output file: advanced_technical_aspects_of_mlr3.knit.md

Error: The operation was canceled.

after hitting the maximum runtime of 6 hours.
(the success rate was around 50/50).

When rendering with the --execute-debug flag
more log-output was given.
The log output at the end of rendering the technical chapter (pdf) was:

|
|.............................................| 100%

output file: advanced_technical_aspects_of_mlr3.knit.md

[knitr engine]: writing results
[knitr engine]: exiting
Error: The operation was canceled.

for the other chapters when rendering to pdf, the output was

|............................................| 100%

output file: preprocessing.knit.md

[knitr engine]: writing results
[knitr engine]: exiting
[knitr engine]: postprocess
[knitr engine]: writing results
[knitr engine]: exiting

--> something with the postprocessing went wrong and the bug was identified to be in the technical chapter

The problem was NOT the large-scale benchmarking chapter

quarto-dev/quarto-actions#45 (comment)
cderv added a commit that referenced this issue Aug 29, 2024
we need to deal with initial issue considering new chrome update
#45
cderv added a commit to quarto-dev/quarto-cli that referenced this issue Aug 29, 2024
@cderv
Copy link
Collaborator

cderv commented Aug 29, 2024

FWIW chrome update is causing some problem, and the previous "fix" discussed above (#45 (comment)) is creating issues (it is handing the workflow).

It seems using chrome headless in CI is not that easy. I am going to probably revert the change of adding the above line by default in the render action. It is causing hanging in all actions using render currently.

coatless added a commit to coatless-textbooks/c4ds that referenced this issue Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants