rr Trace Portability: TZCNT

Posted by khuey on 21 October 2021

When we introduced trace portability to rr our primary goal was to enable use cases like Pernosco, where traces would be recorded on one machine and replayed on another for advanced analysis. At the time it was an open question exactly how portable these traces would be, but for the most part the x86 architecture is well behaved enough that things just work. As Pernosco usage has grown we have ingested traces from a larger collection of systems and discovered a few subtle differences. We have previously written about emulating RSQRTTSS and other "approximate" floating point instructions which produce different results on Intel and AMD processors. The most recent issue we've come across is related to the TZCNT instruction.

Backwards compatible?

The TZCNT (Count the Number of Trailing Zero Bits) instruction is often described as doing "basically the same thing as" or as being "an extension of" the BSF (Bit Scan Forward) instruction. There are two differences between them:

BSF is undefined for an input of 0 (in practice it leaves the output register unmodified), whereas TZCNT is defined to return the operand size (in bits).
BSF sets the Zero Flag (ZF) if the input is 0, and all other flags are undefined, whereas TZCNT sets the ZF if the output is 0, sets the Carry Flag (CF) if the input is 0, and leaves all other flags undefined.

So TZCNT and BSF only produce exactly identical results for inputs that are non-zero and do not have the least significant bit set. Because of the difference in flag behavior, TZCNT is not strictly an extension for BSF, but actually changes previous behavior that was well-defined for BSF.

Why does it matter?

That's all well and good, but why do we care whether or not TZCNT is an "extension" of BSF? We care because the opcode for TZCNT is an extension of the opcode for BSF.

The BSF instruction that's been around since the 386 was released in 1985 has the opcode 0F BC. The TZCNT instruction, introduced much more recently in Haswell, has the opcode F3 0F BC. These two opcodes differ only in the leading F3 byte. F3 is the REP prefix that's normally used with string instructions like MOVS or STOS to repeat them. While the Intel manuals say that using these prefixes with other instructions is reserved, in practice CPUs will generally discard the prefix and proceed normally with the opcode. And that's exactly what happens when a TZCNT instruction is executed on a CPU that does not support TZCNT: it is interpreted as REP BSF, the REP prefix is discarded, and it executes as a BSF.

If a program contains a TZCNT but was executed on a machine without TZCNT, and then the trace is replayed on a machine with TZCNT, the behavior of the program can change silently. Processing traces recorded on pre-Haswell machines in Pernosco is vulnerable to this problem.

The fix

While it may be tempting to say that you should simply not execute a TZCNT opcode on a CPU that does not support it, "your program should not have bugs in it" is not a tenable position for a debugger to take. And because TZCNT won't generate a SIGILL on an older CPU, it's especially likely to be missing a CPUID check.

Since Pernosco uses binary instrumentation to extract program state, it's straightforward for us to simply detect the case where a trace was executed on a machine without TZCNT and rewrite any TZCNT opcodes we encounter to be BSFs. We do something very similar with LZCNT, which has the same relationship with the BSR opcode. This enables traces recorded on e.g. Ivy Bridge to be replayed and processed by Pernosco.