Debugging a race condition in a Node.js test

We will demonstrate debugging an intermittent race condition in a Node.js test, with little prior knowledge of Node.js internals, with Pernosco.

The first step is to capture a recording of the bug. We took advantage of rr's chaos mode to inject additional non-determinism into Node.js and shortened our reproduction time. With traditional debugging approaches more non-determinism often makes a system much more challenging to debug. But with advanced tooling like Pernosco additional randomness is no challenge at all. We captured a recording in just a handful of test runs and uploaded it to Pernosco. Once that finished processing Pernosco emailed us a link to our debugging session.

When we open the session we immediately see that Node.js crashed with a SIGSEGV. Automated heuristics have determined that this is the most "interesting" thing that happened in the program, and set the initial focus to the crash. It is not immediately obvious why we crashed here, but you may notice that handler_ is neither a parameter nor a local variable in this function, suggesting that it is a member variable of this class. (Ending member variable names with '_' is also a common practice in C++ codebases).

Moving over to the stack, we expand the parameters to TcpHolder::WriteRaw (the current function) and we can see that this function was called with a NULL this pointer. That explains the proximate cause of this crash: the program attempts to fetch the handler_ member variable from an invalid object. Before moving on, we open the notebook view and jot down what we have discovered so far.

The next question to answer is why this is NULL.

With traditional debugging approaches this could be rather difficult. If the cause is not obvious from the stack (and it is not here, try it with the link above) you would need to identify the possible reasons that this pointer could be NULL, attempt to rule those causes in or out based on information available in a debugger at the time of the crash, or potentially rerun the program with additional logging or other instrumentation to observe the state at an earlier point in time. This requires a pre-existing familiarity with or time spent learning how the code in question works. And because this is an intermittent failure, there is also the additional burden of managing to reproduce the same failure again.

With Pernosco, you just click on the incorrect value.

With full program history Pernosco can immediately explain where this value came from. The value is tracked backwards through events such as memcpys or moves in and out of registers until we reach a point where the value "originated" from, and each step in this process is displayed in the "Dataflow" panel that opens automatically. There is no need to read or understand the code and think about what might have happened, we can simply ask the debugger what did happen.

At a glance we can see that std::unique_ptr::reset was called. If we click on that item in the explanation we'll immediately jump to the reset call. Using the stack view to go up one frame, we can see that the OnEof method is calling reset on tcp_ because the tcp socket here has closed already.

The answer to our question of "why is this NULL?" is because this tcp connection has already closed. We make a note of this in the notebook as well.

The final question we need to answer is why the program is attempting to send a message on a closed tcp connection.

To do that, we return to the original crash site. Entries in the notebook double as bookmarks, so clicking on one will take us to the moment in the program where we made that notebook entry. From there, we can examine the program stack. The most recent several frames appear to be the mechanics of sending a message and writing it to a tcp socket, but a little further up there's a more promising function called Dispatch.

We can see that this is a member function of a RequestToServer object. If we go one more frame further up the stack we can see that this RequestToServer instance is currently being dequeued from some sort of message queue. A natural question to ask is when did this message was added to the message queue. If we scroll around a little bit in the source we'll query find a Post method which adds messages to the queue.

By clicking on a line in this function, we can see when it is executed. A condition expression matching the message argument to the value we see on the stack when the message is dequeued narrows our search down to a single execution, and clicking on it will take us to the moment it was executed.

A glance up the stack will lead us to an executionContextDestroyed method that sends this message. We add a final notebook entry at this point.

In just a few short minutes we discover the complete sequence of events in this race condition. A message is queued to report that the execution context has been destroyed, but before it can be sent, the tcp connection the program wants to send it on is torn down. Then, when the program finally does get around to sending the message, it crashes because the tcp connection is gone.