Telnet console death issue #2344

spikefishjohn · 2024-01-31T02:23:00Z

Since there isn't really an open issue to track the original death of the telnet console I'm opening this. I've narrowed down where the issue is. I haven't figured out what to do about it just yet.

Here is a link to the telnet_server.py with debugs showing the issue.
telnet_server.py

Here is the debug logs when the death loop happens.

2024-01-30 23:13:41 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:41 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:41 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:41 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:41 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:41 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:41 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:41 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:42 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:42 DEBUG telnet_server.py:293
2024-01-30 23:13:42 DEBUG telnet_server.py:294
2024-01-30 23:13:42 DEBUG telnet_server.py:295
2024-01-30 23:13:42 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:42 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:42 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:42 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:42 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:42 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:42 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:42 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:43 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:43 DEBUG telnet_server.py:293
2024-01-30 23:13:43 DEBUG telnet_server.py:294
2024-01-30 23:13:43 DEBUG telnet_server.py:295
2024-01-30 23:13:43 DEBUG telnet_server.py:271 __process True loop
2024-01-30 23:13:43 DEBUG telnet_server.py:273 __process  reader_read is None
2024-01-30 23:13:43 DEBUG telnet_server.py:251 Start async _get_reader
2024-01-30 23:13:43 DEBUG telnet_server.py:256 self._reader_process == network_reader
2024-01-30 23:13:43 DEBUG telnet_server.py:257 <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=11 re
ad=polling write=<idle, bufsize=0>>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=poll
ing write=<polling, bufsize=65550>>>
2024-01-30 23:13:43 DEBUG telnet_server.py:262 _get_reader Returning None
2024-01-30 23:13:43 DEBUG telnet_server.py:275 __process reader_read is still 2nd None
2024-01-30 23:13:43 DEBUG telnet_server.py:277 __process reader_read is still 3rd None
2024-01-30 23:13:44 DEBUG telnet_server.py:292 __process just before coro done check
2024-01-30 23:13:44 DEBUG telnet_server.py:293
2024-01-30 23:13:44 DEBUG telnet_server.py:294
2024-01-30 23:13:44 DEBUG telnet_server.py:295
etc etc etc

So.. the interesting thing here _get_reader is always returning None. This then causes a infinity loop in _process.

As you can see based on the debugs above _get_reader is making it all the way to that final if statement, which doesn't match and them returns None. I've looked at this a few time and I don't the intention of that compare.

BTW ignore the "__process reader_read is still 2nd None" log. I didn't have that in the right place and didn't feel like redoing everything.

Some other interesting things to point out. self._reader_process has a 0 byte buffer, but network_reader has a buffer of 65550. I have to poke around pdb to see what I can find, but if i recall from looking at this the socket on self._reader_process technically closed so its impossible to write data to it. I think this is why reader's buffer isn't getting drained. I'm not sure if maybe by calling future on a dead socket is what is causing this.

The way I'm personally triggering this is to setup a firewall rule then delete a telnet console session as i'm running find / on 2 telnet consoles. The telnet consoles will wedge in 10 seconds or so.

I still plan on finishing up proxy replacement, but its not ready. I got to the point where I can start and stop a qemu VM and the console works, but if i delete the devices I'm getting exceptions. It also doesn't support all the options the original telnet_server supports (such as passing in reader/writer).

The text was updated successfully, but these errors were encountered:

spikefishjohn · 2024-01-31T03:56:03Z

I'm really starting to question if this is the reason the new console breaks. I will say once in this loop there is nothing to delete the connection that is in a failed state.

spikefishjohn · 2024-02-02T23:13:47Z

ok I think I have this dialed in now. So many more debugs..

This shows a bad connection isn't being dealt with correctly.

024-02-02 23:09:13 DEBUG telnet_server.py:284 Start async _get_reader
2024-02-02 23:09:13 DEBUG telnet_server.py:290 Start async _get_reader - self._reader_process is not None
2024-02-02 23:09:13 DEBUG telnet_server.py:291 self._reader_process == network_reader
2024-02-02 23:09:13 DEBUG telnet_server.py:292 <StreamReader exception=TimeoutError(110, 'Connection timed out') transport=<_SelectorSocketTransport closed fd=16>> == <StreamReader waiter=<Future pending cb=[Task.task_wakeup()] created at /usr/lib/python3.10/asyncio/base_events.py:429> transport=<_SelectorSocketTransport fd=13 read=polling write=<idle, bufsize=0>>>
2024-02-02 23:09:13 DEBUG telnet_server.py:298 _get_reader Returning None

As you can see self._reader_process has a StreamReader that has an exception thrown with a dead socket, but it hasn't been removed.

I'm going to add a wait_for on a drain call. If it fails after say 30 seconds then we haven't been able to get data out to a low water mark on the socket and its most likely dead. Then yank from connection dictionary and in theory this problem goes away.

Crossing fingers...

spikefishjohn · 2024-02-03T03:21:57Z

Fix located here.

PR2347

grossmj · 2024-02-07T05:45:39Z

Thanks, I am going to check everything soon (need to catch up after holidays) 👍

grossmj · 2024-02-09T09:36:39Z

Awesome, thanks for the fix, I hope we have finally squashed that annoying bug 👍

spikefishjohn · 2024-03-01T05:08:41Z

Its.. uh.. not.

@grossmj Can you re-open this while I yell into the void in an overly dramatic manner?

spikefishjohn · 2024-03-01T21:40:23Z

Here is the exception. I don't know how long it was hung on this call, but you can see it does timeout after a while. My guess is 2 hours. Oh I also feel like maybe the keepalives aren't always being set for some reason, but no evidence.

Knee jerk reaction is asyncio.wait_for() again. I'll put sometime into it this weekend.

2024-03-01 02:36:54 ERROR base_events.py:1758 Task exception was never retrieved
future: <Task finished name='Task-1745' coro=<AsyncioTelnetServer.run() done, defined at /usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py:188> exception=TimeoutError(110, 'Connection timed out') created at /usr/lib/python3.10/asyncio/streams.py:242>
source_traceback: Object created at (most recent call last):
  File "/usr/bin/gns3server", line 8, in <module>
    sys.exit(main())
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/main.py", line 87, in main
    run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/run.py", line 267, in run
    server.run()
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/web/web_server.py", line 352, in run
    self._loop.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1901, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.10/asyncio/streams.py", line 242, in connection_made
    self._task = self._loop.create_task(res)
Traceback (most recent call last):
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 207, in run
    await self._process(network_reader, network_writer, connection)
  File "/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server/utils/asyncio/telnet_server.py", line 269, in _process
    data = coro.result()
  File "/usr/lib/python3.10/asyncio/streams.py", line 669, in read
    await self._wait_for_data('read')
  File "/usr/lib/python3.10/asyncio/streams.py", line 501, in _wait_for_data
    await self._waiter
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 862, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
TimeoutError: [Errno 110] Connection timed out

spikefishjohn mentioned this issue Feb 3, 2024

Address the telnet console death bug. #2347

Merged

grossmj closed this as completed Feb 9, 2024

grossmj reopened this Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telnet console death issue #2344

Telnet console death issue #2344

spikefishjohn commented Jan 31, 2024

spikefishjohn commented Jan 31, 2024

spikefishjohn commented Feb 2, 2024 •

edited

Loading

spikefishjohn commented Feb 3, 2024

grossmj commented Feb 7, 2024

grossmj commented Feb 9, 2024

spikefishjohn commented Mar 1, 2024

spikefishjohn commented Mar 1, 2024 •

edited

Loading

Telnet console death issue #2344

Telnet console death issue #2344

Comments

spikefishjohn commented Jan 31, 2024

spikefishjohn commented Jan 31, 2024

spikefishjohn commented Feb 2, 2024 • edited Loading

spikefishjohn commented Feb 3, 2024

grossmj commented Feb 7, 2024

grossmj commented Feb 9, 2024

spikefishjohn commented Mar 1, 2024

spikefishjohn commented Mar 1, 2024 • edited Loading

spikefishjohn commented Feb 2, 2024 •

edited

Loading

spikefishjohn commented Mar 1, 2024 •

edited

Loading