rdtsc x86 instruction to detect virtual machines

A new version of pafish has been recently released. It comes with a set of detections completely new for the project (read: not new techniques), which are based on CPUs information. To get this information, the code makes use of rdtsc and cpuid x86 instructions.

Here we are going to look at rdtsc instruction technique, and how it is used to detect VMs.

What is rdtsc?

Wikipedia's description is pretty straightforward [1]:

The Time Stamp Counter (TSC) is a 64-bit register present on all x86 processors since the Pentium. It counts the number of cycles since reset. The instruction RDTSC returns the TSC in EDX:EAX. In x86-64 mode, RDTSC also clears the higher 32 bits of RAX and RDX. Its opcode is 0F 31.

So it is a counter increased in each CPU cycle.

Well, it actually depends on the processor.

Initially, this value was used to count the actual internal processor clock cycles. It was meant for developers to measure how many cycles a routine takes to complete. It was good to measure performance.

In the latest Intel processor families, this counter increases at a constant rate, which is determined by the maximum frequency the processor can run at that boot. Maximum does not mean current, as power-saving measures can dynamically change the velocity of the processor. This means it is not good to measure performance anymore, because the processor frequency can change at runtime and ruin the metric. On the other hand, now it can be used to measure time.

This is explained much better in reference [2].

So, how is this used to detect VMs?

In a physical (host) system the counters subtraction of two consecutive rdtsc instructions will result in a very small amount of cycles.

On the other hand, doing the same in a virtualized (guest) system, the difference can be much bigger. This is caused by the overhead of actually run inside the virtual machine.

I wrote a small program to verify this behaviour, it will do the subtraction ten times with a sleeping period of time in between. You can get the source from here.

This is similar to what pafish does, the output in a physical machine looks like this:

$ gcc -O2 vmfun.c && ./a.out
(81889337556698 - 81889337556746) rdtsc difference: 48
(81891335245484 - 81891335245508) rdtsc difference: 24
(81893332927964 - 81893332927988) rdtsc difference: 24
(81895330659684 - 81895330659708) rdtsc difference: 24
(81897326984696 - 81897326984720) rdtsc difference: 24
(81899324782460 - 81899324782520) rdtsc difference: 60
(81901322471630 - 81901322471690) rdtsc difference: 60
(81903320069632 - 81903320069656) rdtsc difference: 24
(81905317727808 - 81905317727832) rdtsc difference: 24
(81907314531066 - 81907314531078) rdtsc difference: 12
difference average is: 32

Try to compile and run this code with different compiler optimizations if you want to have some fun ;)

This is the theory, but in practice it depends on the virtualization product, its configuration, and the number of cores assigned to the guest system.

For instance, VMware virtualizes the TSC by default. This can be disabled but it is not recommended, the TSC virtualization can also be tweaked in the configuration. Much more information about this in references [3] and [4].

There is also a substantial difference when the VM has two or more cores assigned. With one core, the differences are not that big, and it gets close to a physical processor although sometimes some peaks can happen. With two or more cores, the differences are much bigger and consistent.

I suspect the second behaviour is caused by CPU ready times, which is explained in references [5] and [6].

Have a look at the following example in VirtualBox:

One core assigned, note the peaks

Two cores assigned, the differences are large and consistent

So we can conclude two things.

The first one is, this method is not always reliable as it is heavily dependant on the processor and the virtualization product.

The second one is, if I were running a sandbox cluster, I would try to assign only one core to each guest machine. Not only because it would make this method a bit less reliable, but also for performance.

Our fabulous sandbox uses an emulator instead of a VM, should I care about this?

Well, generally speaking you should not care about this specific method then. Emulators replicate the whole machine hardware, including the CPU at the lowest level (binary translation), so it has its own TSC implementation, and the cycles usage for a routine should be similar to a physical CPU.

We can verify this running our testing program in QEMU: