Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReceivedPong doesn't consider late/not in order pong messages #131

Open
aienabled opened this issue Dec 20, 2019 · 8 comments
Open

ReceivedPong doesn't consider late/not in order pong messages #131

aienabled opened this issue Dec 20, 2019 · 8 comments

Comments

@aienabled
Copy link
Contributor

aienabled commented Dec 20, 2019

Hello!

While investigating random timeout connection issues (on perfect connections) I've noticed that ReceivedPong could handle only the latest ping request (m_sentPingNumber), which is certainly not perfect in some cases

internal void ReceivedPong(double now, int pongNumber, float remoteSendTime)

I've also noticed that NetConnection.ResetTimeout() call in the NetReliableSenderChannel to not working as expected...the client is dropping connection by timeout (set as 8 seconds) in some rare but 100% reproducible (for some clients) cases. I will keep investigating this issue.

Regards!

@kenfalco
Copy link

kenfalco commented Apr 11, 2020

Hi, we are also investigating the connection timeout.
Looking at the code, I noticed that the deadline is extended only when it receives the PONG from its PING (client).
In my opinion it should also be extended in the following cases:

  1. I get a PING from the server
  2. I receive any packet from the server

What do you think?

P.S aienabled yes we have extended the deadline also within this if
if ((byte) pongNumber! = (byte) m_sentPingNumber)

@aienabled
Copy link
Contributor Author

aienabled commented Apr 11, 2020

@kenfalco

  1. sounds good.
  2. with this change I think we might get some corner cases when the bad connection barely can manage the traffic but doesn't get a timeout. BTW, for reliable data sender it seems to be already implemented here (it's resetting the connection timeout when getting ACK under certain circumstances).

BTW, since reporting the issue I have not tried any changes in the ReceivedPong method. I'm considering trying them now but there is one observation that makes me confused. We have few users with ideal connections (ping around 30 ms and no packet loss, as indicated by WinMTR tool) which are getting disconnects consistently—often to all the servers we have around the world! So it doesn't seem to be relevant to the ping result handling as ping packets are sent relatively rarely and should be arriving in the correct order due to a low ping and no packet loss.

@kenfalco
Copy link

kenfalco commented Apr 13, 2020

Hi, I checked and I have an old version of Lidgren (2014) and I didn't have this line of code.

m_connection.ResetTimeout (now);

I have analyzed the differences between my code and new Lidgren and there is this thing that I don't like:

lock (m_connections) { foreach (NetConnection conn in m_connections) { conn.Heartbeat(now, m_frameCounter); if (conn.m_status == NetConnectionStatus.Disconnected) { // // remove connection // m_connections.Remove(conn); m_connectionLookup.Remove(conn.RemoteEndPoint); break; // can't continue iteration here } } }

Why break if you remove a connection?
For example, if you find a connection halfway through each time, it may take a long time to process some connections and maybe can time out.

This is the code in my version:

lock (m_connections) { for (int i = m_connections.Count - 1; i >= 0; i--) { var conn = m_connections[i]; conn.Heartbeat(now, m_frameCounter); if (conn.m_status == NetConnectionStatus.Disconnected) { // // remove connection // m_connections.RemoveAt(i); m_connectionLookup.Remove(conn.RemoteEndPoint); } } }

It seems more correct to me.

P.S We are investigating why with 11,000+ users at the same time we have group disconnections of about 1000-2000 users.
we don't know if can be a hardware or software problem.

Online

@aienabled
Copy link
Contributor Author

aienabled commented Apr 13, 2020

@kenfalco I've checked the master branch in this repository and there is no break in the connections heartbeat foreach loop https://github.com/lidgren/lidgren-network-gen3/blob/master/Lidgren.Network/NetPeer.Internal.cs#L374

I see, so your issue is different. Not some random user disconnecting but massive disconnects.
I think this chart alone (online stats) is not enough to make an educated guess on what could be a culprit. The connections drop could be related to:

  1. performance issue in your application which is causing socket buffers overflow or processing thread freeze
  2. GC sudden spikes taking several seconds to complete
  3. ISP or routing issues—router failure, etc. It could be also related to anti-DDoS feature in the datacenter.

It's best to gather more metrics in your application (+GC events), OS, and hardware (CPU/RAM/network).

@kenfalco
Copy link

Yes you are right I was looking at another version of Lidgren.

We have analyzed lidgren's logs and the only message that often appears in the most busy hours is:

Socket exception: System.Net.Sockets.SocketException (0x80004005): The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress
at System.Net.Sockets.Socket.ReceiveFrom (Byte [] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, EndPoint & remoteEP)
at Lidgren.Network.NetPeer.Heartbeat () in D: \ Workswc \ CardGames \ trunk \ Online Payment Server \ Lidgren.Network \ NetPeer.Internal.cs: line 432

As soon as possible we will update Lidgren to the latest version.

@aienabled
Copy link
Contributor Author

aienabled commented Apr 14, 2020

@kenfalco I've found a mention that it's related to the packet TTL expiration (even in the case of UDP packets):

For a datagram socket, this error indicates that the time to live has expired.

You can try increasing packet TTL—for example m_socket.Ttl = 255; put after this line:

m_socket.Blocking = false;

will increase it to max possible.
I would also recommend temporarily relocating to a new hardware in another datacenter just to ensure the issue is not related to the hardware/network failure.

@kenfalco
Copy link

Hi, thanks for the advice i will try it.
P.S
I tried to download the latest version of Lidgren but it doesn't compile on IOS.
First he gave this error:

#114

And I fixed it as they say at the end of the post.

Then he started giving this other ...

NullReferenceException: Object reference not set to an instance of an object.
at Lidgren.Network.NetPeer.InitializeNetwork () [0x00000] in <00000000000000000000000000000000>: 0
at Lidgren.Network.NetPeer.Start () [0x00000] in <00000000000000000000000000000000>: 0
at Menu.Update () [0x00000] in <00000000000000000000000000000000>: 0

But is Unity + IOS no longer supported?

@aienabled
Copy link
Contributor Author

@kenfalco I cannot provide a response as I'm not the library developer and not developing any applications for iOS. My contributor bage is only because I've submitted a pull requested and it was accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants