-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-write Telnet server using telnetlib3 #2289
Comments
Hi, I'm working on this and will feedback when I have something ready to review, I may need some steers and pointers on review as this is my first contribution to an open-source project. |
oooh exciting! This and not having a html5 viewer are my biggest complaints with GNS3. |
Yes this is awesome. Implementing this might also fix that old nasty bug which freezes the console connections for Qemu nodes which use Telnet. #1619 (comment). (For now we have an workaround to reset the connection and gain access to the console again without reboot). Would be nice if this feature might land into the current 3.0 version and not wait for later releases. |
Making progress folks.. FYI - I've also created a unit test for telnet_server.py, currently it's only testing basic functionality, multiple connections, telnet negotiation and graceful shutdown. Is there anything else that would be useful to test for? |
So there is a long standing bug that really there are only a few theories about. Basically a device with a telnet console will stop working. The device will still function, however the telnet console will no longer return data. You will see the telnet connect from the client to GNS3 server is able to be established, but it seems like whatever gets the data from GNS3 to, for example QEMU is no longer sending said data. closing and opening a new telnet will not fix the issue. The theory is a stateful firewall is messing with things, but as to where or why it perm upsets the console flow is unknown. Restart QEMU and the console will work again. Can you explain the process of how data is getting from say the client (like someone double clicking on a VM) and how that gets into QEMU and how the new python backend gets the data there? Does dynamips or ubridge involved? I just upgraded my setup to 2.2.45. If you need help testing please let me know. I haven't tried to recreate this issue in a little while but I could do it with previous version (2.2.33.1-aah-custom). Here is a rough write up of my setup. GNS3 client (win) -> Fortinet firewall -> GNS3 server (Ubuntu 22.04). My Ubuntu does have active nft rules, but they seem to be the ones that come baked in with libvirt to provide nat connectivity outbound. That being said i'm well versed in tcpdump and whatnot. Let me know how I can help! |
I'm all setup to see if I can replicate this. I should know something in the morning. |
ok i forgot to nohup my tcpdump process but I have re-created the issue and did additional digging and made an interesting discovery. So basically there are 2 tcp flows. I did packet captures on these and I can see whats happening. Basically the GNS3 process is eating the reply data from QEMU. For example I hit "L" on the telnet console this morning and this is the packet flow I see. Windows (telnet) -> GNS3 (sends L packet) I would be expecting the GNS3 process to echo the L back to the Windows client, but that isn't happening. I'm not sure if that means the GNS3 reader socket isn't working to GNS3 or the writer back to windows but that is one of the issues that needs to be addressed. @grossmj is any of what I said news to you by chance? That is the main issue that hopefully a new telnetlib would address. |
Interesting debug and tshoot of the flow @spikefishjohn. Is "GNS3" the VM, or is this also on bare metal? Do you have a flow of when it is working and the "L" gets echoed from GNS3 to telnet on Windows? |
GNS3 is Bare metal When its working yes the L gets echoed back to the telnet client on windows. tcpdump shows the following When it works GNS3 process wise, its the main GNS3 process. I'll take some packet capture when the wife isn't looking. |
Anything in the gns3server logs when it fails? Other than that, I leave it in @grossmj 's very capable hands. |
I just barfed log.debug statements all over telnetlib last night. Nothing standing out so far. I did notice that when this starts happening the rx/tx queues in netstat -anp output will start going up. I thought I had everything setup for replication but this morning it didn't happen. @grossmj Passing thought, what happens if there is console output from qemu but nothing attached to GNS3 (like a connected telnet client)? I'm assuming the telnetlib would need to throw away whats in the buffer at some point since there is no where to send the data. |
Thanks folks this is useful, I'll keep this in mind whilst refactoring and test writing. it'll be interesting to see how the conditions from a gns3 process and packet flow level are impacted by the change to telnetlib3..hopefully for the better! @spikefishjohn - thanks for the support, I will reach out to you when ready to test a little more vigorously as your experience will be invaluable! |
Sounds good! FYI I littered the existing telnet module with log.debug statements... aaaand now I can't replicate the issue, which is super strange. I'm going to try leaving up the GNS3 gui for the day and see if that is something tied into replicating the issue. If it doesn't come back i'll re-install the original file and try again. I really hope I can pin down how to replicate this so we know how to check if your code fixes the issue or not. |
ok got it to replicate again. I think you need to leave a telnet window open to trigger the issue. I've added more debugs to print data. Hopefully I have something useful today. |
Ok i have a better replication path to this an a crazy about of debugs have uncovered a few things.
I know there was something done to enable tcp keep alives but i'm not sure its working, but again, why doesn't the VNC console have this problem. My assumption is its not using tcp keep alives. I'll take a look at how the VNC module handles this as it doesn't seem to have this issue. |
More info.. keepalive are being turned on but the timers aren't being changed which leaves keepalives being sent at the same time as tcp dead timers on linux which seems to be 2 hours. root@compute01:~# ss -o state all | grep 7036
tcp LISTEN 0 100 0.0.0.0:7036 0.0.0.0:*
tcp ESTAB 0 0 192.168.1.100:7036 192.168.1.189:51756 timer:(keepalive,117min,0)
root@compute01:~# Changing the socket code to this writer_sock = network_writer.get_extra_info("socket")
writer_sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
writer_sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
writer_sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60)
writer_sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10)
writer_sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 4) produces this keepalive config. Send a keep alive every 60 seconds, 0 shows the current keep count. If 4 are missed linux kills the tcp session. root@compute01:/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server# ss -o state all | egrep 7036
tcp LISTEN 0 100 0.0.0.0:7036 0.0.0.0:*
tcp ESTAB 0 0 192.168.1.100:7036 192.168.1.189:51970 timer:(keepalive,44sec,0)
root@compute01:/usr/share/gns3/gns3-server/lib/python3.10/site-packages/gns3server# I've verified with tcpdump that its now sending keep alives every 60 seconds. If a keep alive is missed it will send keep alives every 10 seconds. If 4 are missing the socket is deleted.
Now, that will keep a firewall happy by preventing the tcp sesssion from idling out...however.. Everything still breaks if the tcp session dies. For example if I kill that tcp session in the firewall the consoles will still break. |
@KCarmichael Any updates? I took a swing at this and have something working if you want to compare notes / whatnot. I noticed that doing something that creates a lot of console output (like find /) spikes the cpu and thus the GNS3 process, but from looking at the original code that seems to happen there as well. |
Thanks for checking in, and my apologies for the lack of comms, so I've refactored the telnet server to operate with telnetlib3. Today, I focused on conducting further tests and adjustments to ensure that it handles session negotiations effectively, as there were some odd behaviours during the refactoring. However, since my last change, the results are promising, though I think that additional testing would be desirable. Regarding the session loss and stalling issues, I didn't encounter any in today's tests, but I'm cautiously optimistic. I hope the latest updates have ironed these out, though I'm keeping my fingers crossed until we have more conclusive results. I'm planning to push these changes to a branch in my repository in the morning. You'll be able to pull from it for your own testing once it's up. Your feedback and insights will be invaluable, especially considering the work you've done on your end. About the CPU spikes you mentioned, that's an important observation. I think once we're confident about the telnetlib3 integration, conducting performance and stress tests would be a great next step. It would help us understand its impact on functionality, like maintaining sessions under load. Thanks for your engagement and contributions to this project. Looking forward to collaborating further on this. Best regards, |
OK, my first contribution and pull request submitted! > #2342 Hopefully, this is a good place to start with plenty of opportunities to refine further, I'm eager to receive feedback on this initial pull request. If everything looks good with the implementation of telnetlib3, I'm ready to dive into addressing other bugs and issue requests. Please let me know your thoughts and any suggestions for improvement. Console Connectivity Bug (Do we need to separate this out into a new issue? - as this was raised for enhancement/refactoring with telnetlib3) @spikefishjohn - I'll start looking at testing and debugging now, first by reviewing all of your findings from above with fresh eyes, if you don't mind, could you give me an idea of the sort of topology/setup you are running? for instance, GNS Usage: In any of your packet captures where you lose 'echo replay' after an idle period, are you able to see any NVT commands/flags being sent to or from the server or nodes? I'm wondering if it's potentially related to my observation about the NVT command handling behaviour, this is present in the original code and sometimes in my own refactored version, but essentially despite having the right code to handle NVT commands, the server will from time to time not handle them, more often I see it with DO commands. My suspicion, although I'm yet to properly troubleshoot is that recently woke idling devices may attempt to renegotiate sessions with the telnet server, for instance, the client sends IAC DO ECHO to request the server enable character echoing, but these negotiations are not being acknowledged, instead error handled but no break, resulting in what looks like a loss of session. At the moment it is just a theory, I haven't done any specific testing for this exact scenario as of yet, but it was behaviour I noted while rewriting and testing the old and refactored telnet_server. I'm also interested to see if we can capture the initial negotiation from the live environment too, so I can start to build an end-to-end picture of what is happening. Also, apologies as more questions may follow as I switch to troubleshooting mode! Thanks, |
Sure, we could move this else where if you want. I'm personally running gns3 client -> local server. Basically My windows machine is connecting to a Linux box running the gns3server process on ubuntu. There is a firewall between my windows box and the Linux box. My firewall is triggering the bug with the telnet console that causes it to break. Once the bug triggers, all current and future console sessions are now dead. The only fix is to stop and start the gns3 node or reset the telnet console through the gui (forgot to test that). I think what your saying is you think there is a problem is the negotiation phase of the telnet session. I can't say for sure if there is a long standing issue with how the session negotiates, but my assumption is if there was a problem here, the serial console would always not work. I can open and close console sessions without issue. The bug is triggered by an improperly closed tcp session that exists between the client and the gns3 server process. My guess is if you could get the tcp session between GNS3 and qemu to timeout it would trigger on that side as well, but generally this happens on the local host so chances of that happening are super slim. The band-aid I added with patch-2 (cough sorry, patch-3 cough cough) was to lower the tcp keep alive timers. This forces the GNS3 server to send data to the client even if there is nothing to send, more often. This will keep a firewalls connection table fresh. Should a firewall delete the tcp session then the GNS3 console bug will trigger. I believe the root of this console breaking issue is a dead tcp session not being closed properly in GNS3 and not being deleted from the pool of hosts to broadcast data to afterwards. |
I'm seeing an issue with this patch. I have a python program written, compatible with python 3.4 since thats what ships on my vyos VM, to print special characters to the screen. import os
import subprocess
import binascii
import signal
def get_current_tty():
"""Returns the current TTY device."""
try:
return subprocess.check_output("tty", shell=True).decode().strip()
except subprocess.CalledProcessError:
raise RuntimeError("Unable to determine the current TTY")
def read_from_tty(device_file):
"""Reads one byte at a time from a TTY device and prints it in hex."""
with open(device_file, 'rb', buffering=0) as tty: # Set buffering to 0
while True:
byte = tty.read(1) # Read one byte
if byte:
# Convert the byte to hex using binascii
hex_byte = binascii.hexlify(byte)
print("{} ".format(hex_byte.decode()), end="", flush=True)
if __name__ == "__main__":
signal.signal(signal.SIGINT, signal.SIG_IGN)
TTY_DEVICE = get_current_tty()
print("Current TTY device: {}".format(TTY_DEVICE))
read_from_tty(TTY_DEVICE) This will print any special characters sent telnet session to the screen. vyos@vyos-prod-ashburn:~$ python3 print-tty-chars.py
Current TTY device: /dev/ttyS0
^C^@^@ If i go back to the orignal telnet_server.py I get this. vyos@vyos-prod-ashburn:~$ python3 print-tty-chars.py
Current TTY device: /dev/ttyS0 |
Here is my monster. https://github.com/SpikefishSolutions/SFTelnetProxyMuxer Currently i'm just using it by starting a qemu process and then attaching. High level features. 4MB read buffer. This makes the console much faster.. for a little while... then it get faster again. I don't get it either. Completely replaced old telnet library with telnetlib3. Clients are added to a set() as they connect. If there is a socket error they are removed from the set. A broadcast method is created. Any data sent to broadcast is spammed to all clients. Heartbeat support for clients! All client read operations (looking for a client to send some data) are done via a shielded wait_for. This allows us to wait 2 second, then thrown an exception. Timeout could be the socket died, but it could also mean the client just didn't send anyway. Shield() prevents the socket from closing. We then send a NOP to the clients. Any client not responding to the NOP are closed() and remove from client list. Server has a problem doing this.. i think.. its late i'm honestly not sure now. Will have to test again. What was happening was server would see the NOP packet then send it to the clients, which would get super mad for some reason then close. I'l have to test again. If the remote server goes down (like qemu crashes) the console will stay up and notify the clients that the back end just died. This may or may not be useful. I kept the keepalive code in there however it may not be needed if I can get server Heartbeats working. If no clients are connected data is thrown away. This way we can keep the output buffer of qemu's console empty. === It really bothers me that doing a "find /" on a VM (like vyos) to produce a lot of console output, will make the cpu spike to %100. Can easily hit ctrl-c to stop and cpu goes back to normal. Will need to profile this to see what i'm doing wrong. Holy exception's batman! No ascii banana in comments. My first asyncio program, buyer beware. Currently using main with ports for my vyos replication. IPs and Ports are hard coded in main currently. Have not looked into how to hook this into GNS3 yet. |
BTW I almost delete this today before stuffing into git. I might have barfed if that happened. Night! |
PR submitted - This is only for a code review. I don't expect this to be merged. No issues in my POC, could use much testing. Oh btw.. didn't relize until today telnetlib3 2.0.x requires python 3.7 or higher. :/. |
I canceled my original PR since I noticed it had debug statements from telnet_server.py which wasn't meant to be included. |
I found the same thing, I've updated my PR to see where that leaves integration. telnetlib3 did at some point support python 3.6, but support was dropped to focus on 3.9 onwards, a concern I have is how long will 3.7 be supported.. I'll see if there is a roadmap etc that gives any insight. Suppose this can be the risk to abstraction using libraries. |
The telnet client (not the server) in telnetlib3 also doesn't support windows do to use of python termio. That it me means this should be a 3.0 only thing. |
No comment to this? If its not clear, your patch is making it so special characters are being sent over the telnet session when you open the console. The python program was just to demonstrate the issue. Dang it.. ok last edit (I lied i edited this like 3 more times, now i'm really done). I'm also not sure why you removed the socket calls. |
Apologies @spikefishjohn, when you say 'your patch' are you talking about my fork:branch of gns3-server? If so my bad, I didn't realise you was using it ? |
Apologies, talking about PR2342. To really spell it out, I installed PR2342 and found it introduces a new issue where 3 special characters are sent to the GNS3 node every time a new telnet session is opened. These special characters can for example interrupt the boot process of a GNS3 node if for example they're sitting at a prompt that says press any key to interrupt (like grub boot loader). The python code here is designed to demonstrate the issue. I say I created it, but really I asked gipty. Amazing how well that works if you know exactly what your trying to do. Anyway, hopefully that is more clear if it wasn't before. |
@grossmj Just to make sure you've seen this since I've been a little spammy on this thread. telnetlib3 only supports Linux. I don't know all the places GNS3 works but if the server can run on windows then my assumption is this should only be a 3.0 thing. |
Oh good catch, I wasn't aware of that. Then indeed it should only be a 3.0 thing. |
@spikefishjohn I couldn't find anything in the doc about the telnetlib3 server can only run on Linux, please can you tell me where you found this info? Thanks 🙏 |
I made a note on my PR2356 telnetlib3 has this issue. Maybe its just the shell that only supports linux. Im not using client_shell on my PR so maybe this isn't that big of a deal? a link to the client_shell.py using termios (only supported on linux) |
@spikefishjohn good spot, thanks for the detail on the 3 special characters, will take a look and see what going on. Update: I'm yet to replicate the issue, are you only observing it on vyos appliances? |
@KCarmichael did you run the python code? The characters aren't visible otherwise. I noticed it on vyos, which is just linux with the login console attached to the serial port. Any linux os with serial console should be able to replicate it. Also backing out the change stops the issue from happening. knee jerk reaction is double telnet negotiation, but its just a theory. |
Howdy @grossmj / @KCarmichael! This (to me) is a list of issues with the PR2342. Just throwing it out there.
|
Agreed, I didn't see any super(). Looks like you are just using the telnet commands from telnetlib3 without actually have the lib handle anything.
I was wondering the same :) Thanks. |
Apologies, I've been tied up, thanks for the feedback both, I did have super initialized in another code base, I didn't catch it wasnt in this ver! (Doh!) I removed sockets as I wanted to understand how the rewrite would perform without any additional patches, I'd assumed maybe incorrectly that telnetlib3 would handle connections more robustly, perhaps change previously observed behaviours. The intention was to start with a stripped back implementation, abstract as much functionality to telnetlib3 as possible to start and then patch what new bugs are found, for instance re-adding socket calls. I'll take a look this weekend, re-add the super initialization. Bare with me, like I say, first time contributing to open source! Any further tips, advice or guidance is always appreciated! Kieran |
No worries, everyone starts somewhere. This is a dialog and we're just asking questions and look, if you don't understand something you can say that, its perfectly fine. I remember when @grossmj added those socket commands and I was like, I don't get this at all. I thought the sockets objects weren't being used but what I didn't understand was the socket object was a reference to the socket from the telnet object's socket. He didn't have to create a ref to the socket but it made it look cleaner by doing so. I think we're both on a python journey here and I'm by no means and expert, which is why I was asking. This isn't a day job and don't take it like you have to work on this this weekend. Just calling that out. |
@grossmj Question, what is the goal of migrating to telnetlib3? Like is there a feature or something your looking for with telnetlib3 or was it mostly that telnet console bug? |
Yes mostly because of the Telnet console bug and also to have a clearer code / separation in preparation of SSH console support. |
Apologies chaps, will be back on this soon. Worklife performance review season so time limited. |
No worries, we have similar issues ;) |
No description provided.
The text was updated successfully, but these errors were encountered: