Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to check async workers status #32

Open
fr33l opened this issue Apr 20, 2021 · 5 comments
Open

Add ability to check async workers status #32

fr33l opened this issue Apr 20, 2021 · 5 comments

Comments

@fr33l
Copy link
Member

fr33l commented Apr 20, 2021

As of now, this lib has a major flow which negates its purpose.

If a service async worker dies we don't detect that and still continue to respond with 200 on status.

So we can't rely on those status endpoints in our monitoring which makes it hard to understand why we have issues like this:

https://github.com/yola/production/issues/9075
https://github.com/yola/production/issues/9076

@a-milogradov
Copy link

I'm still under impression, that celery should raise an error when workers lost, like this.

Not sure, what was the exact situation on Sunday, didn't have a look into it yet.

@Toshakins
Copy link
Contributor

@a-milogradov We had similar issues previously, for example, ecwidservice local worker went silent due to incorrect parameter order in the configuration.

@Toshakins Toshakins added this to the Sprint 65 milestone Apr 20, 2021
@a-milogradov
Copy link

a-milogradov commented Apr 20, 2021

I believe it died because of this:
[2021-04-18 10:07:04,831: CRITICAL/MainProcess] Unrecoverable error: RecursionError('maximum recursion depth exceeded',)
Traceback (most recent call last):
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/transport/virtual/base.py", line 918, in create_channel
    return self._avail_channels.pop()
IndexError: pop from empty list

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 685, in check_health
    if nativestr(self.read_response()) != 'PONG':
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 484, in read_response
    raise response
redis.exceptions.BusyLoadingError: Redis is loading the dataset in memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 685, in check_health
    if nativestr(self.read_response()) != 'PONG':
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 484, in read_response
    raise response
redis.exceptions.BusyLoadingError: Redis is loading the dataset in memory

During handling of the above exception, another exception occurred:

................ deep deep same exception chain (redis.exceptions.BusyLoadingError: Redis is loading the dataset in memory) ...................................


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/worker.py", line 203, in start
    self.blueprint.start(self)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/bootsteps.py", line 116, in start
    step.start(parent)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/bootsteps.py", line 365, in start
    return self.obj.start()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 311, in start
    blueprint.start(self)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/bootsteps.py", line 116, in start
    step.start(parent)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/consumer/connection.py", line 21, in start
    c.connection = c.connect()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 398, in connect
    conn = self.connection_for_read(heartbeat=self.amqheartbeat)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 405, in connection_for_read
    self.app.connection_for_read(heartbeat=heartbeat))
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 432, in ensure_connected
    callback=maybe_shutdown,
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/connection.py", line 383, in ensure_connection
    self._ensure_connection(*args, **kwargs)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/connection.py", line 439, in _ensure_connection
    callback, timeout=timeout
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/utils/functional.py", line 325, in retry_over_time
    return fun(*args, **kwargs)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/connection.py", line 866, in _connection_factory
    self._connection = self._establish_connection()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/connection.py", line 801, in _establish_connection

................... big recursive stack ...........................

  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 701, in send_packed_command
    self.check_health()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 690, in check_health
    self.send_command('PING', check_health=False)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 698, in send_packed_command
    self.connect()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 567, in connect
    self.on_connect()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 663, in on_connect
    self.send_command('SELECT', self.db)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 701, in send_packed_command
    self.check_health()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 690, in check_health
    self.send_command('PING', check_health=False)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 698, in send_packed_command
    self.connect()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 567, in connect
    self.on_connect()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 663, in on_connect
    self.send_command('SELECT', self.db)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 701, in send_packed_command
    self.check_health()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 690, in check_health
    self.send_command('PING', check_health=False)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 698, in send_packed_command
    self.connect()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 585, in _connect
    socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 747, in getaddrinfo
    addrlist.append((_intenum_converter(af, AddressFamily),
  File "/usr/lib/python3.6/socket.py", line 103, in _intenum_converter
    return enum_klass(value)
RecursionError: maximum recursion depth exceeded

but before that, connection was lost, maybe redis was restarted:

[2021-04-18 10:06:58,588: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 311, in start
    blueprint.start(self)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/bootsteps.py", line 116, in start
    step.start(parent)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 592, in start
    c.loop(*c.loop_args())
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/celery/worker/loops.py", line 81, in asynloop
    next(loop)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 361, in create_loop
    cb(*cbargs)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/transport/redis.py", line 1087, in on_readable
    self.cycle.on_readable(fileno)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/transport/redis.py", line 358, in on_readable
    chan.handlers[type]()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/transport/redis.py", line 692, in _receive
    ret.append(self._receive_one(c))
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/kombu/transport/redis.py", line 702, in _receive_one
    response = c.parse_response()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/client.py", line 3505, in parse_response
    response = self._execute(conn, conn.read_response)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/client.py", line 3479, in _execute
    return command(*args, **kwargs)
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 739, in read_response
    response = self._parser.read_response()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 470, in read_response
    self.read_from_socket()
  File "/srv/ws-template-importer/live/virtualenv/lib/python3.6/site-packages/redis/connection.py", line 429, in read_from_socket
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

upd. reduced verbosity of comment

@fr33l
Copy link
Member Author

fr33l commented Apr 20, 2021

yes, it did died because of that.

But this ticket's aim is to improve detection, not prevent the cause.

@a-milogradov
Copy link

Sentry event is a part of detection system.

I think it's important understanding why it died silently without sentry report.

@aleksuk aleksuk removed this from the Sprint 65 milestone Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants