King Arthur IV, on 03 June 2013 - 09:13 PM, said:
Let me break this down, for n00bs to understand.
Karl Berg, on 03 June 2013 - 08:51 PM, said:
Hey guys, it's worth at least a brief explanation about what's been going on.
It demands its own Command Chair post, because it epic sounding.
Quote
A change last patch, which was intended to address huge packet send delays induced by small amounts of packet loss, ended up causing very large numbers of small packets to be sent by mistake, and due to other bugs this was not detected until too late.
The reason for the huge latency with minimal packet loss was we found that CryNetwork implements a form of flow control designed to detect and correct for bad network conditions. This flow control would rapidly throttle back the size of sent packets to ridiculously low values, 13 byte payloads in fact, and would take several 10's of seconds to recover. We also determined that a core loop that was designed to flush outstanding messages to the socket was not iterating correctly, causing the system to send only a single packet per game update loop. The teeny packet sizes, combined with the single packet per frame, caused traffic to get extremely backlogged and would induce huge delays into the network layer.
Packet loss is bad (it doesn't reach its destination and/or changes in transit, which is all bad), so what computers tend to do is resize the packet into smaller chunks. Think of it like eating food.. some people can swallow big chunks of food like a vacuum... others use knives and cut them into smaller pieces so it fits their mouths.
The problem is the interaction with the changes made to address packet loss that made certain connections and people's pings change wildly. You don't want to keep sending tiny packets, it's bad for the network (think 56k if you will). You want to send the biggest packet as OFTEN as possible and as RELIABLE as possible. That's part of writing good netcode that works with its enviornment.
Quote
We 'corrected' the send loop by fixing their main packet loop to at least iterate until all pending messages had been sent. Unfortunately the send queue could block messages, causing the loop to iterate too many times, causing small micro packets to get transmitted.
Now normally we would have caught this, we monitor network traffic very closely. It turns out, unfortunately, that while the CryNetwork traffic metrics correctly monitor received traffic data usage, the send traffic data usage is bugged and was not accounting for the overhead these tiny packets were causing.
We have corrected both our buggy fix to the send loop, and the network data usage counters for this next patch coming out tomorrow. This should fix the increased latency some users are experiencing, as well as the erroneous DoS detection that this bug caused.
Simply put, they had to spend more time making sure their network detection tools are up to par with their changes in their code.
Quote
This leads me to the disconnect issue.
While it is extremely unfortunate that the previous bug made its way to production, it turns out that this allowed us to isolate and address this major bug which was causing disconnect to mechlab issues. It turns out that this whole time, at the very lowest layer, CryNetwork has been using only a single byte for packet sequence id's. This is an extremely small size, providing only 256 possible sequence values; and we've determined that if there is a large change in connection latency causing these sequence numbers to overflow, the engine detects this as a 'malformed packet' error and forces a disconnect.
The large number of small packets introduced with last patch caused the network layer to burn through these sequence numbers at a much higher rate, hence the increased number of disconnects.
We have now doubled this sequence number size from 1 byte to 2, or from a total of possible sequence values 256 to 65536. This increases the engines tolerance for delayed packets from a second or two at most to something far more sane, closer to 4 or 5 minutes in fact.
Ah... built in limitations are bad. I assume the compression made this numbering a lot more sensitive.
Quote
This change is also coming out next patch.
I greatly apologize for all the grief these bugs have caused. It's been a pretty rocky set of patches up to now as we've tried to iron out all the issues we've been having with this engine. In addition to the fixes listed above, we've added a whole new set of test conditions to the QA test plan designed to catch and prevent similar issues. Already these new tests have caught some issues on login that users with very poor connections may have been experiencing. We have done our best to address these login issues with tomorrows patch as well.
I was suffering, so I'm thrilled that this is being resolved.
Edited by Deathlike, 03 June 2013 - 09:27 PM.