The L3 project
1 Introduction
L3 is a programming language designed for this course. It belongs to the Lisp family of languages, and its main characteristics are:
- it is "dynamically typed",
- it is mostly functional, in that :
- functions are first-class values and can be nested,
- there are few side-effects (exceptions: mutable blocks and I/O),
- it automatically manages (frees) memory.
L3 tries to strike a balance between being simple enough that the compiler can be understood by students over the course of one semester, and yet powerful enough that non trivial programs can be written in it.
The implementation of L3 consists in two main components:
- a compiler, written in Scala, that generates an assembly file for the virtual machine,
- a virtual machine, written in C, that can execute the code generated by the compiler.
An overview of the compiler appears at the end of this document, while the virtual machine will be described in a later one.
2 The L3 language
The following example gives a taste of the L3 language by defining a function called pow
that raises an integer to a given power.
The various elements of L3 are described below.
2.1 Values
L3 has four types of atomic values—unit, booleans, characters and integers—and one type of composite value—tagged blocks. As described later, tagged blocks are used to encode all forms of structured values: strings, arrays, pairs, functions, etc.
2.1.1 Atomic values
L3 integers are represented using 31 (!) bits, in two's complement, and therefore range from –230 to 230 – 1.
L3 characters are represented by their Unicode code point, i.e. a 21 bit positive integer in one of the following ranges:
- from 000016 to D7FF16, or
- from E00016 to 10FFFF16.
Characters and integers are different types, and conversion between the two must be done using primitives described later.
2.1.2 Tagged blocks
Tagged blocks are composed of a tag (a small integer) and zero or more slots containing arbitrary L3 values. They are therefore similar to arrays or tuples in other language, but with an attached tag whose goal is to identify what the block contains. For example, strings are represented by blocks tagged with a tag that is reserved for their exclusive use.
It is important to understand that tagged blocks are a low-level data structure. They are meant to be used to implement higher-level data structures like arrays, lists, etc. As a rule, they are therefore used directly only in libraries, and not in application code.
Block tags are 8 bit integers, and therefore range from 0 to 255. Tags above 200 are reserved for use by the compiler, while the remaining ones are available for general use. For example, the L3 standard library uses blocks with such tags to represent arrays, lists, pairs, etc.
2.1.3 Syntax for values
Syntax exists for all the four kinds of atomic values, as well as for strings:
2.2 Definitions
L3 offers two kinds of top-level definitions, both of which allow the definition of a name that is visible in the rest of the program. The first one (def
) is non-recursive and can be used to name the value of an arbitrary expression, while the second one (defrec
) is recursive and can only be used to name a function.
The rationale behind this distinction is that arbitrary recursive values are rarely useful, but challenging to compile efficiently. Allowing only functions to be recursive offers a good compromise between the power given to the programmer and the performance of the compiled program.
L3 also offers local definitions, which are only visible in a delimited block of code. Like top-level definitions, local definitions exist in two variants: an arbitrary, non-recursive one (let
), and a recursive one limited to functions (letrec
).
2.3 Functions and primitives
Like all functional languages, L3 offers a concise syntax to define anonymous functions with the fun
keyword.
Function application is even more concise, as no keyword is associated with it: the function and its arguments are simply written in parentheses.
Apart from functions, L3 also offers primitives, which can be viewed as a set of limited, predefined functions. They are limited in the sense that, unlike functions, they cannot be used as values. The only operation that is available on them is application, which is distinguished from function application by the use of the @
keyword.
2.3.1 Arity-based name lookup
A special name lookup rule is used when analyzing a function application in which the function is a simple name:
(n e1 e2 … ek)
In such a case, the name n@k
(i.e. the name itself, followed by @
, followed by the arity in base 10) is first looked up, and used instead of n instead if it exists. Otherwise, name analysis proceeds as usual.
This is called arity-based name lookup and although this is not overloading per se, it allows a kind of function overloading based on arity. For example, it is possible to define several functions to create lists of different lengths:
(def list-make@1 (fun (e1) …)) (def list-make@2 (fun (e1 e2) …))
and so on for list-make@3
, list-make@4
, etc. With these definitions, the following two function applications are valid:
(list-make 1) ; (invokes list-make@1) (list-make 1 (+ 2 3)) ; (invokes list-make@2)
However, the following one is not valid, unless a definition for the bare name list-make
also appears in scope:
(map list-make l)
The reason is that here, list-make
does not appear in the function position of a function application, and therefore arity-based name lookup is not done.
2.4 Conditional expressions
L3 offers two kinds of conditional expressions: a simple if
similar to the one found in other functional programming languages, and cond
. The latter is used to express what is usually expressed using chains of if
… else if
in other programming languages.
Despite the fact that L3 has boolean values (#t
and #f
, as described above), it is important to notice that conditional expressions consider any value different from #f
as being true.
L3 also offers short-cutting logical operations, which are a form of conditional expressions.
Notice that both and
and or
produce the value of the last expression they evaluated, which can be very useful in practice. For example, assume that the variable user-name
contains either a string or the value #f
to indicate that the user is anonymous. Then, the following expression will either produce the user name or the string <anonymous>
if the user is anonymous:
(or user-name "<anonymous>")
This is very similar to the use of values of type Option
in Scala (also called Maybe
or Optional
in other languages), and the or
expression in L3 can play the role of the getOrElse
method in Scala.
2.5 Loops and blocks
Like other functional languages, L3 does not offer any form of imperative loops. Therefore, programmers have to resort to recursion to loop, and the rec
construct is meant to simplify the definition of recursive functions modeling loops.
2.6 Primitives
The various primitives offered by L3 are described below. Almost all of them only accept a subset of all possible values as argument. For example, the primitive for addition only accepts integers as arguments.
It is nevertheless possible to apply those primitives to values that they do not accept, in which case the behavior of the program is undefined. In other words, the validity of primitive arguments is not checked at run time, for simplicity and efficiency reasons.
2.6.1 Polymorphic primitives
L3 offers two polymorphic primitives, that can be applied to any value.
The first one is the equality test, =
, that checks whether its two arguments are equal.
The second one is the identity primitive, id
, which simply returns its (single) argument ummodified. This primitive is useless for user programs, and therefore only available to the compiler, which uses it in some phases.
2.6.2 Type-testing primitives
For each of its five types of values, L3 offers a type-testing primitive that accepts any value and tests whether it has the corresponding type. Since these primitives return a boolean value, their name ends with a question mark (?
). They are: unit?
, bool?
, char?
, int?
and block?
.
2.6.3 Integer primitives
L3 offers primitives for the four basic arithmetic operations on integers, namely +
(addition), -
(subtraction), *
(multiplication), /
(division) and %
(remainder). Notice that the notion of integer division (and remainder) used in L3 is truncated division, like in C and Java. Neither the /
nor the %
primitive accept 0 as second argument.
Additionally, like many other languages, L3 offers primitives to manipulate integers as bit vectors, of size 31. These are shift-left
, shift-right
, and
, or
and xor
.
Finally, L3 offers two comparison primitives, namely <
(less than) and <=
(less than or equal). In the interest of simplicity, no primitives are provided for greater than and greater than or equal, as these tests can easily (and cheaply) be implemented using the existing primitives.
2.6.4 Character primitives
L3 offers only two primtives on characters, char->int
and int->char
, that convert them to and from the corresponding character. The integer corresponding to a character is its Unicode code point.
2.6.5 I/O primitives
L3 offers two primitives to interact with the outside world by writing or reading bytes, usually to/from the console.
A single byte can be read using the byte-read
primitive, which either returns that byte as an integer between 0 and 255 (included), or -1 to signal the end of input.
A single byte can be written using the byte-write
primitive, which only accepts integers between 0 and 255 (included).
2.6.6 Block primitives
A tagged block can be allocated using one of the 201 different block allocation primitives offered by L3: block-alloc-0
takes a size as argument and return a block of that size, tagged with 0; block-alloc-1
works in the same fashion, but returns a block tagged with 1; and so on until block-alloc-200
. The reason why the tag is part of the primitive name is that the virtual machine can only allocate blocks with a constant tag.
Once a block has been allocated, its tag can be obtained using the block-tag
primitive, and its size with block-size
.
Finally, two primitives exist to read and write block slots:
(@block-get b n)
returns the value of slot at indexn
of blockb
, and(@block-set! b n v)
sets the value of slot at indexn
of blockb
to valuev
.
Slot indices are 0-based.
2.7 EBNF grammar
For reference, L3's EBNF grammar is given below.
3 L3 syntactic sugar
L3 has a substantial amount of syntactic sugar: constructs that can be syntactically translated to other existing constructs. Syntactic sugar does not offer additional expressive power to the programmer, but some syntactical convenience.
For example, L3 allows if
expressions without an else
branch, which is implicitly taken to be the unit value #u
:
(if e1 e2) ; ⇔ (if e1 e2 #u)
Syntactic sugar is typically removed very early in the compilation process — e.g. during parsing — to simplify the language that the compiler has to handle. This process is known as desugaring.
Desugaring is specified below as a function denoted by ⟦·⟧ taking an L3 term and producing a desugared CL3 term (CL3 is Core L3, the desugared version of L3). To clarify the presentation, L3 terms appear in orange, CL3 terms in green, and meta-terms in black.
As an example, the image below shows how the following program:
(@byte-write (if #t 79 75)) (@byte-write (if #f 79 75))
is desugared using the rules above.
4 The L3 compiler
The L3 compiler consists of a sequence of phases that gradually transform the program being compiled. The most important phases are shown in the figure below, and are split as usual into front-end and back-end phases.
The L3 compiler manipulates a total of four languages:
- L3 is the source language that is parsed, but never exists as a tree — it is desugared to CL3 immediately,
- CL3 — a.k.a. Core L3 — is the desugared version of L3,
- CPS is the main intermediate language, on which optimizations are performed,
- ASM is the assembly language of the target (virtual) machine.
The compiler contains interpreters for the last three languages, which is useful to check that a program behaves in the same way as it is undergoes transformation. These interpreters also serve as semantics for their language.