Torchlight 3 Lead Programmer Explains Launch Problems With Zombie Metaphors And Rubber Ducks
Torchlight 3 launch weekend and what they did to fix them in the latest State of the Game update on Steam. When I say in-depth, I mean a lengthy blow by blow technical of each and every problem explained with zombie metaphors and a parable about rubber ducks.
Apparently, the disconnection issues and problems with traveling to another zone that plagued the game on launch were caused by “zombie processes” that inadvertently took up server space. Somberg explains:
“[. . .] We inspected the processes on the servers that were running and saw some defunct processes - colloquially called ‘zombie processes’, which gave rise to the nickname ‘zombie zones’. A ‘defunct’ process in Linux is a process which has been killed, but is still lingering. It will still hold on to some minor resources such as file handles and such, but it is officially dead - but not gone. Thus: zombies.”
After identifying to root cause of the issues, the developers proceeded to find ways to “kill” the all the zombies and figure out why they were appearing in the first place. According to Somberg, killing the zombies was a simple matter of rebooting the servers, fixing the issue itself wasn’t so easy. “It was like playing Whack-a-Mole - as soon as we had cleaned up one set of servers, another one would get infested with zombies again.”
“At this point, it was back to the drawing board. We knew generally where the problem had to lie, but did not have a specific root cause. So, we busted out our rubber ducks,” continues Somberg. “What do rubber ducks have to do with it? The parable goes that a junior engineer walks into a senior engineer’s office and asks for help with a problem. The senior engineer stops the junior engineer short, hands over a rubber duck and says ‘Tell it to the rubber duck, then tell it to me,’ then turns around and goes back to work. The junior engineer thinks this is weird, but shrugs and starts to tell the rubber duck about the problem. ‘Whenever I call this function, it crashes. But it’s always on the last element of the list for some reason. Oh! It’s because that element is one past the end, so I can’t dereference it!’ The junior engineer rushes off to fix the code.”
“The moral is that often times you just need to describe a problem to somebody else - anybody else - in order to realize what the problem is. ‘Rubber duck debugging’ is the process of just having somebody else describe to you how the system works and what the problem is.”
Somberg goes on to a lengthy technical explanation of how they identified and fixed the underlying issue which all boils down to server request timeouts caused by a problematic pattern used when loading new zones. “[. . .] We’re blocking up ZC’s ability to monitor zones, we’re blocking up the zone server’s ability to send messages, and we’re leaving these zombie zones around which occupy capacity that should be going to actual players,” he says. “All of these problems stacked up at the same time due to this one pattern that we were using, which caused all sorts of badness.” A fix was promptly rolled out late Sunday night.
You’ll find the full State of the Game update on the Torchlight 3 Steam News Hub.