Interpreters and virtual machines

1 Introduction

An interpreter is a program that executes another program, represented as some kind of data-structure. Common program representations include:

raw text (source code),
trees (AST of the program),
linear sequences of instructions.

Interpreters enable the execution of a program without requiring its compilation to native code. They simplify the implementation of programming languages and — on modern hardware — are efficient enough for many tasks.

2 Basic interpreters

2.1 Text-based interpreters

Text-based interpreters directly interpret the textual source of the program. They are very seldom used, except for trivial languages where every expression is evaluated at most once — i.e. languages without loops or functions.

A plausible example of such an interpreter is a calculator program, which evaluates arithmetic expressions as it parses them.

2.2 Tree-based interpreters

Tree-based interpreters walk over the abstract syntax tree of the program to interpret it. Their advantage compared to text-based interpreters is that parsing — and name/type analysis, if applicable — is done only once.

A plausible example of such an interpreter is a graphing program, which has to repeatedly evaluate a mathematical function submitted by the user in order to plot it. Also, all the interpreters included in the L₃ compiler are tree-based.

3 Virtual machines

Virtual machines resemble real processors, but are implemented in software. Like real processors, they accept as input a program composed of a sequence of instructions.

Virtual machines often provide more than the simple interpretation of programs: they also abstract the underlying system by managing memory, threads, and sometimes I/O.

Perhaps surprisingly, virtual machines are a very old concept, dating back to ~1950. They have been — and still are — used in the implementation of many important languages, like SmallTalk, Lisp, Forth, Pascal, and more recently Java and C#.

3.1 Virtual vs real machines

Since the compiler has to generate code for some machine, why prefer a virtual over a real one? There are several reasons:

the compiled VM code can be run on all architectures on which the virtual machine runs, without having to be recompiled,
the compiler is usually simpler if it targets a virtual machine, as virtual machines tend to be higher-level than real processors,
a virtual machine is easier to monitor and profile, which makes debugging easier.

The only drawback of virtual machines compared to real ones is that they tend to be slower. This is due to the overhead associated with interpretation: fetching and decoding instructions, executing them, etc. Moreover, the high number of indirect jumps in interpreters causes pipeline stalls in modern processors.

To a (sometimes large) degree, this is mitigated by the tendency of modern VMs to compile the program being executed, and to perform optimizations based on its behavior.

3.2 Stack and register-based VMs

There are two kinds of virtual machines:

stack-based VMs, which use a stack to store intermediate results, variables, etc.
register-based VMs, which use a limited set of registers for that purpose, like real CPUs.

For a compiler writer, it is usually easier to target a stack-based VM than a register-based VM, as the complex task of register allocation can be avoided. Therefore, most widely-used virtual machines today are stack-based (e.g. the JVM, .NET’s CLR, etc.) but a few recent ones are register-based (e.g. Lua 5.0).

3.3 Virtual machine implementation

Virtual machines take as input a program represented as a sequence of instructions. Each instruction is identified by its opcode (operation code), a simple number. Often, opcodes occupy one byte, hence the name byte code. Some instructions have additional arguments, which appear after the opcode in the instruction stream.

Virtual machines interpret the program they receive as input in much the same way as a real processor:

the next instruction to execute is fetched from memory and decoded,
the operands are fetched, the result computed, and the state updated,
the process is repeated.

Many VMs today are written in C or C++, because these languages are at the right abstraction level for the task, fast and relatively portable. Moreover, as we will see below, GCC (the Gnu C compiler) has an extension that makes it possible to use labels as normal values. This extension can be used to write very efficient VMs, and for that reason, several of them are written specifically for GCC or compatible compilers (e.g. clang).

In C, a basic, stack-based virtual machine could look as follows:

typedef enum { add, /* … */ } instruction_t;

void interpret() {
  static instruction_t program[] = { add /* … */ };
  instruction_t* pc = program;
  int* sp = …; /* stack pointer */
  for (;;) {
    switch (*pc++) {
    case add:
      sp[1] += sp[0];
      sp++;
      break;
      /* … other instructions */
    }
  }
}

It is however possible to improve this basic design to make it faster, using several techniques that are described below, namely:

threaded code,
top of stack caching,
super-instructions.

3.4 Threaded code

In a switch-based interpreter like the one presented above, each instruction requires two jumps:

one indirect jump to the branch of the switch handling the current instruction,
one direct jump from there back to the main loop.

It would be better to avoid the second one, by jumping directly to the code handling the next instruction.

This is the idea of threaded code, illustrated graphically in the image below. It compares how a simple program would be executed in a switch-based interpreter like the one above, with how it would be executed in a threaded interpreter.

There are two main variants of threaded code:

indirect threading, where instructions index an array containing pointers to the code handling them,
direct threading, where instructions are pointers to the code handling them.

Direct threading could potentially be faster than indirect threding — because of the lack of indirection — but on modern 64-bit architectures, representing each opcode by a 64-bit pointer is expensive.

To implement any of these variants of threaded code, it must be possible to manipulate code pointers. In ANSI C, the only way to do this is through function pointers. However, GCC offers another solution, as it allows the manipulation of labels as values. The two solutions are described below.

3.4.1 Threaded code in ANSI C

Implementing direct threading in ANSI C is easy, but inefficient. The idea is to define one function per VM instruction. The program can then simply be represented as an array of function pointers. Some code is inserted at the end of every function to call the function handling the next VM instruction.

typedef void (*instruction_t)();
static instruction_t* pc;
static int* sp = …;

static void add() {
  sp[1] += sp[0];
  ++sp;
  (*++pc)(); /* handle next instruction */
}

/* … other instructions */

static instruction_t program[] = { add, /* … */ };

void interpret() {
  sp = …;
  pc = program;
  (*pc)(); /* handle first instruction */
}

This implementation of direct threading in ANSI C has a major problem: it leads to stack overflow very quickly, unless the compiler implements tail call elimination.

Indeed, in the interpreter above, the function call appearing at the end of add — and all other functions implementing VM instructions — is a tail call and should be optimized. While recent versions of GCC and clang do full tail call elimination, not all compilers do. With such compilers, the only option is to eliminate tail calls by hand, e.g. using trampolines.

3.4.2 Threaded code with GCC

GCC and other compilers (mostly clang) offer an extension that is very useful to implement direct threading: labels can be treated as values, and a goto can jump to a computed label.

With this extension, the program can be represented as an array of labels, and jumping to the next instruction is achieved by a goto to the label currently referred to by the program counter.

void interpret() {
  void* program[] = { &&l_add, /* … */ };

  int* sp = …;
  void** pc = program;
  goto **pc; /* jump to first instruction */

 l_add:
   sp[1] += sp[0];
   ++sp;
   goto **(++pc); /* jump to next instruction */

 /* … other instructions */
}

3.4.3 Comparison

To compare the performance of switch-based and threaded interpreters, four versions of a tiny interpreter were written, and the time needed to interpret a simple loop was measured. The results, normalized to the time taken by the switch-based interpreter, are presented below.

Given the tiny size of the interpreter and the simplicity of the program it interprets, these results should be interpreted with caution. They do show, however, that threaded code only pays off when GCC's labels as values extension can be used.

3.5 Top-of-stack caching

In a stack-based VM, the stack is typically represented as an array in memory. Since almost all instructions access the stack, it can be interesting to store some of its topmost elements in registers. However, keeping a fixed number of stack elements in registers is usually a bad idea, as the following example illustrates:

Since caching a fixed number of stack elements in registers seems like a bad idea, it is natural to try to cache a variable number of them. For example, here is what happens when caching at most one stack element in a register:

Caching a variable number of stack elements in registers complicates the implementation of instructions: there must be one implementation of each VM instruction per cache state — defined as the number of stack elements currently cached in registers.

For example, when caching at most one stack element, the add instruction needs the following two implementations:

State 0: no elements in reg.	State 1: top-of-stack in reg.
`add_0:`	`add_1:`
`tos = sp[0] + sp[1];`	`tos += sp[0];`
`sp += 2;`	`sp += 1;`
// go to state 1	// stay in state 1

The image below compares the time taken by two interpreters to execute a program to compute a large sum of integers. One interpreter uses top-of-stack caching (with at most one element cached), the other does not.

3.6 Super-instructions

Since instruction dispatch is expensive in a VM, one way to reduce its cost is simply to dispatch less. This can be done by grouping several instructions that often appear in sequence into a super-instruction.

For example, if the mul instruction is often followed by the add instruction, the two can be combined in a single madd (multiply and add) super-instruction.

Profiling is typically used to determine which sequences should be transformed into super-instructions, and the instruction set of the VM is then modified accordingly.

It is also possible to generate super-instructions at run time, to adapt them to the program being run. This is the idea behind dynamic super-instructions. This technique can be pushed to its limits, by generating one super-instruction for every basic block of the program! This effectively transform all basic blocks into single (super-)instructions.

4 L₃VM

L₃VM is the virtual machine developed for this course. Its main characteristics are:

it is a 32-bit VM, in the sense that the basic unit of storage is a 32-bit word — therefore, (untagged) integers, pointers and instructions are all 32 bits,
it is register-based, although its notion of registers is not standard,
it is relatively simple, with only 32 instructions.

4.1 Memory

L₃VM has a 32-bit address space — even when running on a 64-bit machine — which is used to store both code and data. Code is stored starting at address 0 and the rest of the available memory is used for the heap.

The L₃VM address space is virtual, in that it is not the same as the one of the host architecture.

4.2 Registers

Strictly speaking, L₃VM has only four registers:

PC is the program counter, containing the address of the instruction being executed,
Ib, Lb and Ob are the input, local and output base registers (respectively), each of which contains either 0 or the address of a heap-allocated block.

The slots of the blocks pointed by Ib, Lb and Ob are reachable through pseudo-registers. For example, the pseudo-register O3 designates the slot at index 3 of the block pointed by Ob.

There are 32 input and output pseudo-registers (I0 to I31 and O0 to O31), and 160 local pseudo-registers (L0 to L159), as well as 32 constant pseudo-registers containing the constants 0 to 31. (In the following, we use the term register to designate a pseudo-registers, and base register to designate a base register).

A function gets its arguments through its input registers, stores its local variables in its local registers, and uses its output registers to pass arguments to the functions it calls.

The CALL instruction takes care of saving the caller’s context, composed of its base registers (Ib, Lb and Ob) as well as its return address. They are saved in the callee’s first four input registers (I0 to I4), and can be seen as implicit arguments passed to the callee.

Symmetrically, the RET instruction takes care of restoring the caller’s context.

The following image — available as an animation in the lecture slides, which is probably much easier to interpret — shows how registers are handled during a non tail call:

The following image — also available as an animation in the lecture slides — shows how registers are handled during a tail call:

Finally, the following image — also available as an animation in the lecture slides — shows how registers are handled during a function return:

4.3 Instruction set

The 32 instructions of the L₃VM are presented below. Notice that most of them correspond directly to what we called the low-level primitives of the CPS/L₃ language.

A first group of instructions is the arithmetic ones, which perform standard arithmetic and bitwise operations on 32 bit values:

Control instructions either perform jumps within functions, or calls across functions:

Block instructions manipulate tagged blocks:

I/O instructions perform input/output of individual bytes:

4.4 Example

As an example of L₃VM code, here is the hand-coded version of the factorial function:

fact:
	RALO 0, 5
	JNE C0, I4, else
	RET C1
else:
	SUB O4, I4, C1
	CALL_ND fact
	MUL I4, I4, O0
	RET I4