Spoiler
Karl Berg, on 14 May 2013 - 09:56 AM, said:
Hey everyone, With the HUD bug hotfix coming out soon, I’ve been asked to provide a write up about what we found, and explain a bit about why it took so long to resolve these issues. I was brought on to help resolve these issues relatively late in the process, once we suspected that the problem did not reside within Piranha’s internal code, but was in-fact a pre-existing issue either in the game engine we were using, or in some other external system used by the game engine in turn. As you all are probably aware, MWO is built on CryEngine. CryEngine uses the very popular Scaleform flash library for all of its UI, and additionally has replaced the default windows implementation of the standard template library with a third-party implementation called STLPort. Our first challenge in solving this issue was simply finding a reliable process for reproducing and detecting HUD corruption issues to begin with. The issue appeared to be affecting random bits of the HUD UI with extremely random behaviour. The CSR’s, QA, and community testers spent a lot of time scraping through community posts trying to collect common themes and patterns, and testing what they found to see if there was any discernible correlation that might help us out. By the time this process was completed, the QA team had run hundreds, if not thousands, of games in the testlab under varying circumstances, visually inspecting each screen for any signs of HUD corruption and collecting any common factors, such as Windows versions, common hardware configurations, GPU driver versions, etc. Several leads were investigated during this process, including 64-bit issues, alt-tabbing, and low memory conditions, but unfortunately none of these were capable of providing us with a 100% reproduction rate. When I was brought on, our senior graphics and UI engineers had determined that there was a major issue somewhere in the CryEngine to Scaleform layer that was basically causing C++ calls made by the game to invoke incorrect flash functions in Scaleform. This was after significant effort was spent tracking down and investigating other suspected causes, such as random heap memory corruption, or threaded race conditions. We first spent a considerable amount of time attempting to reproduce the issue with a debug build and breakpoint a failure case, in the hopes that the error was obvious, but also to get an idea where the issue truly resided so we knew where to focus our efforts. Eventually we did manage to reproduce one of these issues with a debug build, and what this effectively confirmed for me was that there was almost certainly an issue somewhere in CryEngine itself. Starting from here, it became important to understand as much about the interactions between CryEngine and Scaleform as possible. What followed was about two days of line by line debugging and auditing code in order to figure out what the intended behaviour of these systems were. I’ve got several pages of notes from this process, outlining general flows and several class structures. Once I had a good idea of how this system was intended to work, I spent some time building up a set of validation and verification routines to detect when these subsystems entered an invalid state. I was specifically interested in a convoluted set of lookup structures which mapped string handles to Scaleform object instances. The end result was an extremely slow build, but which would crash and generate stack call data whenever it detected that the CryEngine scaleform layer had entered an invalid state. This immediately produced results for us. CryEngine has built a global string table subsystem, which is used to accelerate specific string operations as well as save memory, similar to the string tables kept by some other languages. Additionally, users can create handles to this global string table in the form of CryName instances. Internally, we found that this global string table is built upon a case-insensitive string hashing mechanism, which it used to determine whether or not an entry exists within the global table or not. This case insensitive compare is most likely a result of the case insensitivity of the windows filesystem, and the probable roots of this CryName system in resource handling. In the case of Scaleform it was actually an error. This meant, if the UI team had ever added two flash elements to a single player which differed only in case, one of those elements would have been essentially inaccessible to us. Unfortunately, it turned out that the case-insensitivity issue was not related to the current HUD corruption issues on production, so for now I revised the verification routines to similarly ignore case, and we continued with our debugging. Again, we reproduced a real instance of the HUD issue with the revised verification code, and this time we managed to isolate a single function call as having put the system into a bad state. This was enormous progress, as we could now focus on this single specific function. We audited the function line by line, and unfortunately determined that while messy and initially confusing, it was actually safe, and should not have resulted in the behaviour we were observing. It’s important enough to note that internally this issue had an extremely low reproduction rate, and that this function was invoked hundreds of times every time a client loaded into a game. For this reason it was impossible to know ahead of time whether or not a given single call would produce the behaviour resulting in corruption. Instead, I went back and added extremely heavy logging to this specific function in order to determine the flow sequence that was resulting in the HUD corruption. Once again we reproduced the issue, and we were left with baffling results. The logs indicated that the STL library itself was in error; but this was nearly inconceivable. The data structure from in question was an std::map, which was implemented as a red-black tree in STLPort. This is an extremely efficient self-balancing tree structure with excellent performance characteristics. STL data structures for user defined types take a user defined sorting criteria. This sorting criteria must guarantee a property known as strict-weak ordering. I had already examined this sort operator for correctness, which was one of the CryName structures in question, and determined that the comparison operator definitely obeyed strict-weak ordering. In fact, since the global name table guaranteed one instance for any given string in memory at the time, CryName used absolute memory addresses as their sorting criteria. Now, this sounds terribly unsafe, and I spent a good while verifying that a CryName would never be deleted from the global table and re-added at a different memory location while being stored as a key value in one of these mapping structures. CryEngine guarantees this by keeping reference counts for every CryName. However, what we DID find was extremely surprising. STLPort has a whole category of data structure assessors which take member function template parameters, rather than template parameters defined by the data structure types themselves. This appears to actually violate the STL spec, as the function signature for a map access is definitely defined in terms of the key type for the data structure itself. The problem however manifested as follows: CryScaleform layer attempted to lookup a structure which did not yet exist in their cache layer, this triggers a load and bind operation. This lookup operation involved passing a string key value as a const char * to the map operator[] function; however the operator[] function was templated in such a way that the key value provided actually expanded to a const char * at compile time rather than a CryName instance. What followed was a terrible failure case, where a CryName instance would be constructed, inserted into the global name table, and compared against a given node within the map, and then destroyed and completely removed from the global name table; only to happen all over again at the next level down in the red-black tree. This definitely threw away all concept of strict-weak ordering. Now, if the global name table happened to allocate the comparison CryName object at differing memory locations, what would result was essentially a random walk through the tree structure, and the final map node returned to us was completely unpredictable. The only reason any of this system tended to ever work was because of the tendency of the memory allocation subsystem to reuse the same memory address when reallocating one of these global name table entries. As soon as the memory subsystem deviated from this behaviour, the function in question would fall apart and generate completely unpredictable results. This was the root cause of our HUD woes, and once found was relatively easy for us to address. Unfortunately this leads us to very serious bug number two. Once the mapping structures were repaired and this subsystem was put back into a good state, QA located another bug; this time relating to a race condition between precached texture loads. This manifested only as missing minimap textures. Jin, our senior graphics guy, very quickly located this issue and identified it as an incorrectly scoped lock during the precache phase in CryEngine. With these two issues finally resolved, the QA reproduction rates finally dropped to 0%. Hopefully this gives you some sense of how deep these issues ended up taking us, and why it ended up taking so much time. tl;dr: The HUD bugs were a result of undefined behaviour in CryEngines scaleform wrapper library, which took considerable effort for us to track down and fix.
This thread will serve as the official and dedicated feedback channel. Please keep your feedback, responses, comments and concerns limited to this thread so that we may ensure no valuable opinions are lost.