jestream pull_subscribe() not reconnecting after server reboot #504

martinsson · 2023-10-09T08:52:08Z

What version were you using?

v2.9.6

What environment was the server running in?

docker

Is this defect reproducible?

When the nats servers becomes unavailable for a short while any pull_subscription is lost (i.e. stops receiving messages), whereas push subscribers works as expected.

Reproduce

Run the following script. Reboot the nats server. Emit messages on the queue testnats.reconnec. Nothing is received

import asyncio

import nats

SUBJECT = "testnats.reconnect"
CLUSTER_ADDRESS = "nats://localhost:4222"


async def run(loop):
    async def disconnected_cb():
        print("Disconnected!")

    async def reconnected_cb():
        print("Reconnected!")

    async def error_cb(e):
        print("Error:", e)

    nc = await nats.connect(
        servers=[CLUSTER_ADDRESS],
        error_cb=error_cb,
        disconnected_cb=disconnected_cb,
        reconnected_cb=reconnected_cb,
        max_reconnect_attempts=-1,

    )

    js = nc.jetstream()

    async def message_handler(msg):
        subject = msg.subject
        reply = msg.reply
        data = msg.data.decode()
        print(
            "Received a message on '{subject} {reply}': {data}".format(
                subject=subject, reply=reply, data=data
            )
        )
        await msg.ack()


    # await js.subscribe(SUBJECT, cb=message_handler)
    subscription = await js.pull_subscribe(SUBJECT, "durable_name", "TEST_STREAM")
    while True:
        try:
            messages = await subscription.fetch(1, 10000000)
            for message in messages:
                await message_handler(message)
        except Exception as e:
            print(e)


if __name__ == "__main__":
    loop = asyncio.new_event_loop()
    loop.create_task(run(loop))
    loop.run_forever()

It works as expected if I uncomment the pull_subscribe logic and replace it with the commented out js.subscribe() line

Given the capability you are leveraging, describe your expectation?

Pull subscribers should reconnect after reboot

Given the expectation, what is the defect you are observing?

pull subscribers are not resubscribing

The text was updated successfully, but these errors were encountered:

charbonnierg · 2023-10-09T14:26:45Z

I tried to create a new test under tests/test_js.py:PullSubscribeTest and ran it:

    @async_long_test
    async def test_pull_subscribe_reconnect(self):
        srv = self.server_pool[0]
        nc = NATS()
        await nc.connect()
        js = nc.jetstream()

        await js.add_stream(name="TEST1", subjects=["foo.1", "bar"])

        for i in range(10):
            ack = await js.publish("foo.1", f"Hello from NATS!".encode())
            assert ack.stream == "TEST1"
            assert ack.seq == i + 1

        consumer = await js.pull_subscribe("foo.1", "dur")
        for i in range(10):
            # Test fails if we uncomment the following lines
            # srv.stop()
            # srv.start()
            msg, *rest = await consumer.fetch(1, timeout=10)
            assert msg.data == b"Hello from NATS!"
            assert msg.metadata.stream == "TEST1"
            await msg.ack()

This test passes, but if I uncomment the srv.stop() and srv.start() lines the test does not pass.

@wallyqs does this test describe correctly the auto-reconnect feature that we should expect from nats-py ?

Edit

The following test passes successfully (the call to .fetch_one() is wrapped within a while True loop with a lower timeout):

    @async_long_test
    async def test_pull_subscribe_reconnect(self):
        srv = self.server_pool[0]
        nc = NATS()
        await nc.connect(
            max_reconnect_attempts=-1,
            allow_reconnect=True,
            reconnect_time_wait=1,
        )
        js = nc.jetstream()

        await js.add_stream(name="TEST1", subjects=["foo.1", "bar"])

        for i in range(3):
            ack = await js.publish("foo.1", "Hello from NATS!".encode())
            assert ack.stream == "TEST1"
            assert ack.seq == i + 1
        consumer = await js.pull_subscribe("foo.1", "dur")
        for i in range(3):
            srv.stop()
            srv.start()
            while True:
                try:
                    msg, *rest = await consumer.fetch(1, timeout=0.5)
                except TimeoutError:
                    continue
                break
            assert msg.data == b"Hello from NATS!"
            assert msg.metadata.stream == "TEST1"
            await msg.ack()

It seems that using a big or infinite timeout leads to the error in first test, because the fetch request is never sent to the client, and will never succeed (regardless of timeout value).

For now, it seems preferable to use a lower timeout, and retry the fetch operation.

I'd say that the following lines are "responsible" for this behavior (nats/js/client.py:L828):

            await self._nc.publish(
                self._nms,
                json.dumps(next_req).encode(),
                self._deliver,
            )
            # Wait for the response or raise timeout.
            msg = await self._sub.next_msg(timeout)

This publish operation never fails, whether a message is effectively sent to NATS or not, so we end up waiting for the whole timeout (at least that seem to be the case for the code sample provided in this issue).

I'm wondering whether it is normal that the NATS client does not send the publish once it is reconnected though, naively I would think that the message to be published stays in the pending queue, and on reconnect the message is sent, so NATS will publish the message and the coroutine waiting on the subscription queue (e.g., PullSubscription._sub.next_msg) will still receive the message before the timeout, but that does not seem to be the case ? Maybe that's because flusher loop does not know that server is disconnected, and hits a BrokenPipe error while writing to transport so this message is lost 🤨?

It may be worth knowing that setting the connect option pending_size=0 will prevent the call to publish from succeeding and will make it so that an the error nats.errors.OutboundBufferLimitError: nats: outbound buffer limit exceeded is raised on L828. This way, client will never wait if message is not published.

This last test illustrates very well what is happening (run with pytest tests/test_js.py -k test_pull_subscribe_reconnect -s to see captured stderr in console):

    @async_long_test
    async def test_pull_subscribe_reconnect(self):
        srv = self.server_pool[0]
        nc = NATS()
        await nc.connect(
            max_reconnect_attempts=-1,
            allow_reconnect=True,
            reconnect_time_wait=1,
            pending_size=0,
        )
        js = nc.jetstream()

        await js.add_stream(name="TEST1", subjects=["foo.1", "bar"])

        for i in range(3):
            ack = await js.publish("foo.1", "Hello from NATS!".encode())
            assert ack.stream == "TEST1"
            assert ack.seq == i + 1
        consumer = await js.pull_subscribe("foo.1", "dur")
        for i in range(3):
            srv.stop()
            srv.start()
            await asyncio.sleep(0)
            while True:
                try:
                    msg, *rest = await consumer.fetch(1, timeout=0.5)
                except TimeoutError:
                    if nc.is_connected:
                        print("TIMEOUT - Retrying immediately", file=sys.stderr)
                    else:
                        print(
                            "TIMEOUT - Retrying in 250ms (currently disconnected)",
                            file=sys.stderr,
                        )
                        await asyncio.sleep(0.250)
                except OutboundBufferLimitError:
                    print("NOT CONNECTED - Retrying in 250ms", file=sys.stderr)
                    await asyncio.sleep(0.250)
                else:
                    print("OK - Got message", file=sys.stderr)
                    break
            assert msg.data == b"Hello from NATS!"
            assert msg.metadata.stream == "TEST1"
            await msg.ack()

Captured Output:

[DEBUG] Server listening on port 4222 started.
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.
[DEBUG] Server listening on port 4222 started.
TIMEOUT - Retrying in 250ms (currently disconnected)
NOT CONNECTED - Retrying in 250ms
NOT CONNECTED - Retrying in 250ms
OK - Got message
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.
[DEBUG] Server listening on port 4222 started.
TIMEOUT - Retrying in 250ms (currently disconnected)
NOT CONNECTED - Retrying in 250ms
OK - Got message
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.
[DEBUG] Server listening on port 4222 started.
TIMEOUT - Retrying in 250ms (currently disconnected)
NOT CONNECTED - Retrying in 250ms
NOT CONNECTED - Retrying in 250ms
OK - Got message
[DEBUG] Server listening on 4222 will stop.
[DEBUG] Server listening on 4222 was stopped.

Second edit

I'm pretty sure that the request for next message is not published due to an error encountered when flushing (broken pipe). When an error occurs within the flusher task, pending buffer is lost. This has a much broader impact that just pull subscriptions, but if we change the _flusher method to something like that:

              try:
                  if self._pending_data_size > 0:
-                     self.transport.writlines(self._pending[:])
+                    pending = self._pending[:]
                      self._pending = []
                      self._pending_data_size = 0
+                    self._transport.writelines(pending)
-                     await self._transport.drain()
+                    try:
+                        await self._transport.drain()
+                    except BaseException:
+                        if pending:
+                            self._pending = pending + self._pending
+                            self._pending_data_size += len(pending)
+                        raise
            except OSError as e:
                if pending:
                    self._pending_data_size += len(pending)
                    self._pending = pending + self._pending

then messages are indeed published on reconnect and test passes , however, the asyncio.StreamWriter documentation (the transport used in the code samples in the issue) states that:

This is a flow control method that interacts with the underlying IO write buffer. When the size of the buffer reaches the high watermark, drain() blocks until the size of the buffer is drained down to the low watermark and writing can be resumed. When there is nothing to wait for, the drain() returns immediately.

I don't really understand what this means 😅 but I take it as "caller don't know how much of pending data has been written" so I guess the code above may send messages twice if an error occured after some data has been effectively written but before drain completes ? In the end, it seems that users should always consider the possibility that publish is never received by the server.

Regarding the specific case of PullSubscription.fetch(), it seems that it's up to users to make sure that they are using small timeouts with retries instead of None because it will always be possible that message is considered as published by client, but never received by the server.

On nats-py side, it may be worth changing the PullSubscription.fetch() signature to not accept None as a timeout value, because it will wait forever if request failed to be published ?

tkeller-moxe · 2023-12-11T14:58:24Z

I can second this bug, it hit is un production when we last did an update.
We managed to catch it quick and reboot but I would expect nats-py to properly handle server reboots!

Some notes:
Latest Nats-py version
Nats server 2.9.24.
On my end this is possible effecting pull subscribe and Request Reply/standard subscriptions.

itssimon · 2024-07-18T10:30:25Z

I've also been hitting this in production and had to restart services with pull subscriptions so they received messages again.

@wallyqs Would you mind taking a look at this please? This bug has a pretty severe impact on production systems relying on JetStream.

tkeller-moxe · 2024-07-18T14:23:30Z

I was actually doing some thinking on this yesterday,
I'm wondering if the Subscriptions are just not getting re-instantiated and are being disposed of after a disconnect happens for to long. We ran a server upgrade last week and got bit by this again as well.

yehudaNahon · 2024-07-23T13:38:57Z

hi,
encountered it in production as well
would really appreciate if you would give it a look 🙏

theobouwman · 2024-10-11T21:32:26Z

We are facing similar issues. Any updates on this?

(airtai/faststream#1581 (comment))

Jarema · 2024-10-14T08:44:46Z

Generally its not advised to set very high timeouts.

Pull Requests are not persisted throughout server reboots, and clients need to send new ones.
Regarding the publish - that would not solve the problem -as during restarts messages can be lost over the wire (plus Pull Requests are lost anyway)

The new JS API provides out of the box solution for this challenge, by providing kind of a subscription that handles timeouts, reconnects, etc.

martinsson added the defect Suspected defect such as a bug or regression label Oct 9, 2023

ovnicraft mentioned this issue Dec 26, 2023

ping/pong is not working #522

Open

Jarema assigned caspervonb Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jestream pull_subscribe() not reconnecting after server reboot #504

jestream pull_subscribe() not reconnecting after server reboot #504

martinsson commented Oct 9, 2023

charbonnierg commented Oct 9, 2023 •

edited

Loading

tkeller-moxe commented Dec 11, 2023 •

edited

Loading

itssimon commented Jul 18, 2024

tkeller-moxe commented Jul 18, 2024

yehudaNahon commented Jul 23, 2024

theobouwman commented Oct 11, 2024

Jarema commented Oct 14, 2024

jestream pull_subscribe() not reconnecting after server reboot #504

jestream pull_subscribe() not reconnecting after server reboot #504

Comments

martinsson commented Oct 9, 2023

What version were you using?

What environment was the server running in?

Is this defect reproducible?

Reproduce

Given the capability you are leveraging, describe your expectation?

Given the expectation, what is the defect you are observing?

charbonnierg commented Oct 9, 2023 • edited Loading

Edit

Second edit

tkeller-moxe commented Dec 11, 2023 • edited Loading

itssimon commented Jul 18, 2024

tkeller-moxe commented Jul 18, 2024

yehudaNahon commented Jul 23, 2024

theobouwman commented Oct 11, 2024

Jarema commented Oct 14, 2024

charbonnierg commented Oct 9, 2023 •

edited

Loading

tkeller-moxe commented Dec 11, 2023 •

edited

Loading