Karl Berg, on 03 June 2013 - 08:51 PM, said:
Hey guys, it's worth at least a brief explanation about what's been going on.
A change last patch, which was intended to address huge packet send delays induced by small amounts of packet loss, ended up causing very large numbers of small packets to be sent by mistake, and due to other bugs this was not detected until too late.
The reason for the huge latency with minimal packet loss was we found that CryNetwork implements a form of flow control designed to detect and correct for bad network conditions. This flow control would rapidly throttle back the size of sent packets to ridiculously low values, 13 byte payloads in fact, and would take several 10's of seconds to recover. We also determined that a core loop that was designed to flush outstanding messages to the socket was not iterating correctly, causing the system to send only a single packet per game update loop. The teeny packet sizes, combined with the single packet per frame, caused traffic to get extremely backlogged and would induce huge delays into the network layer.
We 'corrected' the send loop by fixing their main packet loop to at least iterate until all pending messages had been sent. Unfortunately the send queue could block messages, causing the loop to iterate too many times, causing small micro packets to get transmitted.
Now normally we would have caught this, we monitor network traffic very closely. It turns out, unfortunately, that while the CryNetwork traffic metrics correctly monitor received traffic data usage, the send traffic data usage is bugged and was not accounting for the overhead these tiny packets were causing.
We have corrected both our buggy fix to the send loop, and the network data usage counters for this next patch coming out tomorrow. This should fix the increased latency some users are experiencing, as well as the erroneous DoS detection that this bug caused.
This leads me to the disconnect issue.
While it is extremely unfortunate that the previous bug made its way to production, it turns out that this allowed us to isolate and address this major bug which was causing disconnect to mechlab issues. It turns out that this whole time, at the very lowest layer, CryNetwork has been using only a single byte for packet sequence id's. This is an extremely small size, providing only 256 possible sequence values; and we've determined that if there is a large change in connection latency causing these sequence numbers to overflow, the engine detects this as a 'malformed packet' error and forces a disconnect.
The large number of small packets introduced with last patch caused the network layer to burn through these sequence numbers at a much higher rate, hence the increased number of disconnects.
We have now doubled this sequence number size from 1 byte to 2, or from a total of 256 possible sequence values to 65536. This increases the engines tolerance for delayed packets from a second or two at most to something far more sane, closer to 4 or 5 minutes in fact.
This change is also coming out next patch.
I greatly apologize for all the grief these bugs have caused. It's been a pretty rocky set of patches up to now as we've tried to iron out all the issues we've been having with this engine. In addition to the fixes listed above, we've added a whole new set of test conditions to the QA test plan designed to catch and prevent similar issues. Already these new tests have caught some issues on login that users with very poor connections may have been experiencing. We have done our best to address these login issues with tomorrows patch as well.
Love these in depth posts Karl. It's great from a comp-sci perspective to see what's going on, and why my ping has been spikey as of late.
Also -- it's interesting to see how much work has to be done on the CryEngine by you guys