Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser randomly fails to start in docker container #3

Open
MMaster opened this issue Jul 20, 2024 · 19 comments
Open

Browser randomly fails to start in docker container #3

MMaster opened this issue Jul 20, 2024 · 19 comments
Labels
bug Something isn't working

Comments

@MMaster
Copy link

MMaster commented Jul 20, 2024

When running the docker container the browser fails to start quite often throwing the following exception:

[INFO] launching the python script
[INFO] launching browser.
Traceback (most recent call last):
  File "/usr/app/src/index.py", line 46, in <module>
    loop().run_until_complete(main())
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/usr/app/src/index.py", line 10, in main
    browser = await start(headless=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/util.py", line 74, in start
    return await Browser.create(config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/browser.py", line 87, in create
    await instance.start()
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/browser.py", line 343, in start
    raise Exception(
Exception:
                ---------------------
                Failed to connect to browser
                ---------------------
                One of the causes could be when you are running as root.
                In that case you need to pass no_sandbox=True

Exception ignored in atexit callback: <function deconstruct_browser at 0x7fbf3f225580>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/util.py", line 124, in deconstruct_browser
    _.stop()
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/browser.py", line 545, in stop
    asyncio.get_event_loop().create_task(self.connection.aclose())
                                         ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'aclose'

Catching the exception and trying again after 3 seconds until it succeeds seems to fix this issue.

@unixfox
Copy link
Member

unixfox commented Jul 20, 2024

How much RAM does your system have?

Could you try to add --shm-size=2G in the docker run command?

@MMaster
Copy link
Author

MMaster commented Jul 20, 2024

Currently the VM has 8GB RAM and half of it is free. I sure can increase the shared memory size for that, but I'm not really sure how that should help. I tried it and even the very first run failed.

The browser starts successfully 1 time and then fails to start 3 times on 4 different runs (it was doing the same thing with headless browser when X was not part of the docker image).

I've fixed the issue by simply doing try except on the start call and doing max 5 retries before giving up and it works, so memory doesn't seem to be the issue here.

@unixfox
Copy link
Member

unixfox commented Jul 20, 2024

Ok. That's strange because I can't replicate the issue. I launched the script 5 times in a row and never had the issue:

Are you sure you are running the latest version of the script?

image

@unixfox
Copy link
Member

unixfox commented Jul 20, 2024

@unixfox unixfox added the bug Something isn't working label Jul 20, 2024
@MMaster
Copy link
Author

MMaster commented Jul 20, 2024

Yeah I am on latest version.

The related issues that you marked are actually 2 separate issues.
The first one happens even if the browser starts successfully and returns the tokens - I suspect it is because the async event loop is not properly stopped, but instead sys.exit is called.
The second one may be related, but the solutions there don't really apply since I don't have any zombie processes, the docker container stops after the run with no leftovers and also the corrupted user data doesn't apply since it's immutable docker image.

Anyway I noticed even on stackoverflow that some people have this issue randomly with nodriver with no reliable solution.

@unixfox
Copy link
Member

unixfox commented Jul 21, 2024

Ok I was able to reproduce the issue on a VM with just 2 cores and 1GB of RAM.

@MMaster
Copy link
Author

MMaster commented Jul 21, 2024

Ok I was able to reproduce the issue on a VM with just 2 cores and 1GB of RAM.

fyi: Yesterday it stopped happening completely on the original VM. But it happened on dedicated machine with 128 GB RAM and 10 cores / 20 threads. It's completely random for me.

@unixfox
Copy link
Member

unixfox commented Jul 21, 2024

Ok I have narrowed down the issue, nodriver doesn't wait enough time before giving up trying to connect chromium instance: https://github.com/ultrafunkamsterdam/nodriver/blob/main/nodriver/core/browser.py#L340-L346

I have pushed a dirty patch in the Dockerfile for waiting more time: 0551c92#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R22. I validated that it works fine on my VM with 2vCPU and 1GB of RAM.

But I'm waiting for an official implementation, for which I have created a PR:

@markus583
Copy link

Hi, even with the dirty patch and also when increasing the sleep time further, I still get the same error.
VM resources are not the issue, it has much more than 2vCPU and 1GB of RAM.

Did anyone encounter this issues in other settings/has fixes for them? Thanks!

@hashamyounis9

This comment was marked as off-topic.

@markus583

This comment was marked as off-topic.

@bitnol
Copy link

bitnol commented Dec 7, 2024

Ok I have narrowed down the issue, nodriver doesn't wait enough time before giving up trying to connect chromium instance: https://github.com/ultrafunkamsterdam/nodriver/blob/main/nodriver/core/browser.py#L340-L346

I have pushed a dirty patch in the Dockerfile for waiting more time: 0551c92#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R22. I validated that it works fine on my VM with 2vCPU and 1GB of RAM.

But I'm waiting for an official implementation, for which I have created a PR:

I have tested the change of wait time but there was no change.
But when I changed the headless=False to True in extractor.py L104, it started working.

@unixfox
Copy link
Member

unixfox commented Dec 7, 2024

Ok I have narrowed down the issue, nodriver doesn't wait enough time before giving up trying to connect chromium instance: https://github.com/ultrafunkamsterdam/nodriver/blob/main/nodriver/core/browser.py#L340-L346

I have pushed a dirty patch in the Dockerfile for waiting more time: 0551c92#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R22. I validated that it works fine on my VM with 2vCPU and 1GB of RAM.

But I'm waiting for an official implementation, for which I have created a PR:

I have tested the change of wait time but there was no change.
But when I changed the headless=False to True in extractor.py L104, it started working.

The token should be invalid if headless is true. Did it work for you?

@achiragaming
Copy link

same thing happend with my vps too

[INFO] internally launching GUI (X11 environment)
[INFO] starting Xvfb
[INFO] launching chromium instance
2024/12/07 18:54:45.655 [extractor] [INFO] update started
Traceback (most recent call last):
  File "/app/potoken-generator.py", line 4, in <module>
    potoken_generator.main.main()
  File "/app/potoken_generator/main.py", line 98, in main
    loop.run_until_complete(main_task)
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/potoken_generator/main.py", line 35, in run
    token = await potoken_extractor.run_once()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/potoken_generator/extractor.py", line 46, in run_once
    await self._update()
  File "/app/potoken_generator/extractor.py", line 91, in _update
    await asyncio.wait_for(self._perform_update(), timeout=600)
  File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/app/potoken_generator/extractor.py", line 104, in _perform_update
    browser = await nodriver.start(headless=False,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/util.py", line 74, in start
    return await Browser.create(config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/browser.py", line 87, in create
    await instance.start()
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/browser.py", line 343, in start
    raise Exception(
Exception: 
                ---------------------
                Failed to connect to browser
                ---------------------
                One of the causes could be when you are running as root.
                In that case you need to pass no_sandbox=True 
                
Exception ignored in atexit callback: <function deconstruct_browser at 0x7f9180998540>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/util.py", line 124, in deconstruct_browser
�
    _.stop()
  File "/usr/local/lib/python3.12/site-packages/nodriver/core/browser.py", line 545, in stop
    asyncio.get_event_loop().create_task(self.connection.aclose())
                                         ^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'aclose'```

@Svenito
Copy link

Svenito commented Dec 18, 2024

Just wanted to chip in here that I was having the same issue. I found this thread and the --shm-size=2G didn't resolve anything, but --shm-size=3G made it work reliably.

@CelluloidRacer2
Copy link

--shm-size=2G didn't resolve anything, but --shm-size=3G made it work reliably.

Trying --shm-size=3G through --shm-size=8G didn't work for me. I'm guessing it has to do with load times & the above mentioned race condition.

Do you know what CPU do you have on that host? I'm running on baremetal with an older Xeon E5-1650v1.

@Svenito
Copy link

Svenito commented Dec 26, 2024

The box is running an Intel(R) Core(TM) i3-4130 CPU @ 3.40GHz - it's in a homelab running linux

@CelluloidRacer2
Copy link

Definitely was not a CPU issue

My issue was absolutely not related to the code- I ended up discovering a network issue that caused some downloads to fail, which I'm assuming made Chromium's first launch fail, and consequently return the error seen here

Note to self: don't forget to set TCP-MSS/MTU on Wireguard's loopback adapters if passing outbound network traffic across a VPN on the network side (and not on network clients)

@JC9mm

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants