The L3 project

1. Introduction

L₃ is a programming language designed for this course. It belongs to the Lisp family of languages, and its main characteristics are:

it is “dynamically typed”,
it is mostly functional, in that :
- functions are first-class values and can be nested,
- there are few side-effects (exceptions: mutable blocks and I/O),
it automatically manages (frees) memory.

L₃ tries to strike a balance between being simple enough that the compiler can be understood by students over the course of one semester, and yet powerful enough that non trivial programs can be written in it.

The implementation of L₃ consists in two main components:

a compiler, written in Scala, that generates an assembly file for the virtual machine,
a virtual machine, written in C, that can execute the code generated by the compiler.

An overview of the compiler appears at the end of this document, while the virtual machine will be described in a later one.

2. The L₃ language

The following example gives a taste of the L₃ language by defining a function called pow that raises an integer to a given power.

The various elements of L₃ are described below.

2.1. Values

L₃ has four types of atomic values—unit, booleans, characters and integers—and one type of composite value—tagged blocks. As described later, tagged blocks are used to encode all forms of structured values: strings, arrays, pairs, functions, etc.

2.1.1. Atomic values

L₃ integers are represented using 31 (!) bits, in two’s complement, and therefore range from –2³⁰ to 2³⁰ – 1.

L₃ characters are represented by their Unicode code point, i.e. a 21 bit positive integer in one of the following ranges:

from 0000₁₆ to D7FF₁₆, or
from E000₁₆ to 10FFFF₁₆.

Characters and integers are different types, and conversion between the two must be done using primitives described later.

2.1.2. Tagged blocks

Tagged blocks are composed of a tag (a small integer) and zero or more slots containing arbitrary L₃ values. They are therefore similar to arrays or tuples in other language, but with an attached tag whose goal is to identify what the block contains. For example, strings are represented by blocks tagged with a tag that is reserved for their exclusive use.

It is important to understand that tagged blocks are a low-level data structure. They are meant to be used to implement higher-level data structures like arrays, lists, etc. As a rule, they are therefore used directly only in libraries, and not in application code.

Block tags are 8 bit integers, and therefore range from 0 to 255. Tags do not appear as numbers in L₃, but as symbols. The compiler is responsible for assignining different numerical tags to the symbolic tags used by a program, or fail if there are more than 256 of them.

2.1.3. Syntax for values

Syntax exists for all the four kinds of atomic values, as well as for strings:

2.2. Definitions

L₃ offers two kinds of top-level definitions, both of which allow the definition of a name that is visible in the rest of the program. The first one (def) is non-recursive and can be used to name the value of an arbitrary expression, while the second one (defrec) is recursive and can only be used to name a function.

The rationale behind this distinction is that arbitrary recursive values are rarely useful, but challenging to compile efficiently. Allowing only functions to be recursive offers a good compromise between the power given to the programmer and the performance of the compiled program.

L₃ also offers local definitions, which are only visible in a delimited block of code. Like top-level definitions, local definitions exist in two variants: an arbitrary, non-recursive one (let), and a recursive one limited to functions (letrec).

2.3. Functions and primitives

Like all functional languages, L₃ offers a concise syntax to define anonymous functions with the fun keyword.

Function application is even more concise, as no keyword is associated with it: the function and its arguments are simply written in parentheses.

Apart from functions, L₃ also offers primitives, which can be viewed as a set of limited, predefined functions. They are limited in the sense that, unlike functions, they cannot be used as values. The only operation that is available on them is application, which is distinguished from function application by the use of the @ keyword.

2.3.1. Arity-based name lookup

A special name lookup rule is used when analyzing a function application in which the function is a simple name:

(n e1 e2 … ek)

In such a case, the name n@k (i.e. the name itself, followed by @, followed by the arity in base 10) is first looked up, and used instead of n instead if it exists. Otherwise, name analysis proceeds as usual.

This is called arity-based name lookup and although this is not overloading per se, it allows a kind of function overloading based on arity. For example, it is possible to define several functions to create lists of different lengths:

(def list-make@1 (fun (e1) …))
(def list-make@2 (fun (e1 e2) …))

and so on for list-make@3, list-make@4, etc. With these definitions, the following two function applications are valid:

(list-make 1)                           ; (invokes list-make@1)
(list-make 1 (+ 2 3))                   ; (invokes list-make@2)

However, the following one is not valid, unless a definition for the bare name list-make also appears in scope:

(map list-make l)

The reason is that here, list-make does not appear in the function position of a function application, and therefore arity-based name lookup is not done.

2.4. Conditional expressions

L₃ offers two kinds of conditional expressions: a simple if similar to the one found in other functional programming languages, and cond. The latter is used to express what is usually expressed using chains of if … else if in other programming languages.

Despite the fact that L₃ has boolean values (#t and #f, as described above), it is important to notice that conditional expressions consider any value different from #f as being true.

L₃ also offers short-cutting logical operations, which are a form of conditional expressions.

Notice that both and and or produce the value of the last expression they evaluated, which can be very useful in practice. For example, assume that the variable user-name contains either a string or the value #f to indicate that the user is anonymous. Then, the following expression will either produce the user name or the string <anonymous> if the user is anonymous:

(or user-name "<anonymous>")

This is very similar to the use of values of type Option in Scala (also called Maybe or Optional in other languages), and the or expression in L₃ can play the role of the getOrElse method in Scala.

2.5. Loops and blocks

Like other functional languages, L₃ does not offer any form of imperative loops. Therefore, programmers have to resort to recursion to loop, and the rec construct is meant to simplify the definition of recursive functions modeling loops.

2.6. Primitives

The various primitives offered by L₃ are described below. Almost all of them only accept a subset of all possible values as argument. For example, the primitive for addition only accepts integers as arguments.

It is nevertheless possible to apply those primitives to values that they do not accept, in which case the behavior of the program is undefined. In other words, the validity of primitive arguments is not checked at run time, for simplicity and efficiency reasons.

2.6.1. Polymorphic primitives

L₃ offers two polymorphic primitives, that can be applied to any value.

The first one is the equality test, =, that checks whether its two arguments are equal.

The second one is the identity primitive, id, which simply returns its (single) argument ummodified. This primitive is useless for user programs, and therefore only available to the compiler, which uses it in some phases.

2.6.2. Type-testing primitives

For each of its five types of values, L₃ offers a type-testing primitive that accepts any value and tests whether it has the corresponding type. Since these primitives return a boolean value, their name ends with a question mark (?). They are: unit?, bool?, char?, int? and block?.

2.6.3. Integer primitives

L₃ offers primitives for the four basic arithmetic operations on integers, namely + (addition), - (subtraction), * (multiplication), / (division) and % (remainder). Notice that the notion of integer division (and remainder) used in L₃ is truncated division, like in C and Java. Neither the / nor the % primitive accept 0 as second argument.

Additionally, like many other languages, L₃ offers primitives to manipulate integers as bit vectors, of size 31. These are shift-left, shift-right, and, or and xor.

Finally, L₃ offers two comparison primitives, namely < (less than) and <= (less than or equal). In the interest of simplicity, no primitives are provided for greater than and greater than or equal, as these tests can easily (and cheaply) be implemented using the existing primitives.

2.6.4. Character primitives

L₃ offers only two primtives on characters, char->int and int->char, that convert them to and from the corresponding character. The integer corresponding to a character is its Unicode code point.

2.6.5. I/O primitives

L₃ offers two primitives to interact with the outside world by writing or reading bytes, usually to/from the console.

A single byte can be read using the byte-read primitive, which either returns that byte as an integer between 0 and 255 (included), or -1 to signal the end of input.

A single byte can be written using the byte-write primitive, which only accepts integers between 0 and 255 (included).

2.6.6. Block primitives

A tagged block can be allocated using the block-alloc primitive, which is special in that the (symbolic) tag of the block to allocate is specified as part of the name of the primitive. For example, block-alloc-foo will allocate a block tagged with some unknown tag, chosen by the compiler to correspond to the symbol foo. The reason why the tag is part of the primitive name is that the virtual machine can only allocate blocks with a constant tag.

Once a block has been allocated, its tag can be obtained using the block-tag primitive, and its size with block-size. The integer corresponding to a tag can be obtained using the #_ notation. For example #_foo is the integer between 0 and 255 (included) that the compiler chose to correspond to the symbol foo.

Finally, two primitives exist to read and write block slots:

(@block-get b n) returns the value of slot at index n of block b, and
(@block-set! b n v) sets the value of slot at index n of block b to value v.

Slot indices are 0-based.

2.7. EBNF grammar

For reference, L₃’s EBNF grammar is given below.

3. L₃ syntactic sugar

L₃ has a substantial amount of syntactic sugar: constructs that can be syntactically translated to other existing constructs. Syntactic sugar does not offer additional expressive power to the programmer, but some syntactical convenience.

For example, L₃ allows if expressions without an else branch, which is implicitly taken to be the unit value #u:

(if e1 e2)                              ; ⇔ (if e1 e2 #u)

Syntactic sugar is typically removed very early in the compilation process — e.g. during parsing — to simplify the language that the compiler has to handle. This process is known as desugaring.

Desugaring is specified below as a function denoted by ⟦·⟧ taking an L₃ term and producing a desugared CL₃ term (CL₃ is Core L₃, the desugared version of L₃). To clarify the presentation, L₃ terms appear in orange, CL₃ terms in green, and meta-terms in black.

As an example, the image below shows how the following program:

(@byte-write (if #t 79 75))
(@byte-write (if #f 79 75))

is desugared using the rules above.

4. The L₃ compiler

The L₃ compiler consists of a sequence of phases that gradually transform the program being compiled. The most important phases are shown in the figure below, and are split as usual into front-end and back-end phases.

The L₃ compiler manipulates a total of four languages:

L₃ is the source language that is parsed, but never exists as a tree — it is desugared to CL₃ immediately,
CL₃ — a.k.a. Core L₃ — is the desugared version of L₃,
CPS is the main intermediate language, on which optimizations are performed,
ASM is the assembly language of the target (virtual) machine.

The compiler contains interpreters for the last three languages, which is useful to check that a program behaves in the same way as it is undergoes transformation. These interpreters also serve as semantics for their language.