On a CLR backend for Moscow ML
4-week project report
Niels Jørgen Kokholm
IT-University of Copenhagen
May 31, 2002
Contents
2 Background, problem statement and methods
2.2 The current Moscow ML compiler
3.1 Overall structure of code generation and code
3.3 Verifiable vs. unverifiable code
3.4 Simple internal data types
3.9 Global values and accessing values in other
units
3.10 Some proposals for optimisations
3.11 Towards a port of Moscow ML to CLR
4 User’s guide to the pilot compiler
4.1 To install the pilot, compile and run a SML
program
4.2 To modify, recompile and test the compiler and library
5.1 Structure of pilot compiler
5.3 Examples of compilation of expressions and code
generation
5.5 Exception handling example
6.3 Testing on other CLR platforms
9 List of code listing appendices
This report was written in May 2002 in connection with a 4-week project at the IT University of Copenhagen under supervision of Andrzej Waskowski and Peter Sestoft. Thanks to both for initial input, discussions, advice and evil test cases. Thanks to the developers of Moscow ML for making it open source, thereby enabling and encouraging projects like this.
In the
project I have made a preliminary investigation of making the Moscow ML system
use the IL intermediate code of the Common Language Runtime instead of the
current CaML Light intermediate code. The project has produced a pilot compiler
as a Standard ML program that can compile a significant subset of Standard ML
including a small subset of the Standard ML standard library to correct and
somewhat efficient Common Language Runtime programs.
The project
was done in continuation of the course “Programming languages” [PFO] at the IT University. The report
assumes knowledge of functional programming, in particular of Standard ML, of
compiler construction at the level of that course and some knowledge of C#.
Moscow ML [MosML] is a compiler and an
interactive system for the strict functional language Standard ML (SML)[1].
The current, version 2.0, Moscow ML compiler generates intermediate code in the
form of CaML Light bytecode that can be interpreted by a CaML Light virtual
machine (runtime program). The interactive system works by compiling individual
top level declarations on the fly to byte code and then loading the result into
the virtual machine.
The Common
Language Runtime (CLR) ([ECMA]) is one of the two main components of the
Microsoft .NET Framework [MSDN],
the other one being the .Net Framework class libraries. The CLR is (the specification of) a virtual machine
that runs programs in the form of Common Intermediate Language (CIL) bytecode[2].
I may sometimes misuse the term CIL to denote the textual (assembler) form when
no mistake should be possible. The CLR is in many respects quite similar to the
Java Virtual Machine (JVM) ([Java]).
There are
two interesting levels of CIL the CLR can execute: verifiable vs. valid but
unverifiable[3]. Verifiable
code is CIL code the CLR can verify beforehand is safe for the CLR to execute
in the sense that it will not corrupt the runtime itself or other running
applications. Valid but unverifiable code may be executed by the CLR if
permitted by the configured security policy. The possibility of executing
unverifiable code is there to enable high-performance implementations of
certain programming languages/applications.
It is the
stated policy of Microsoft to make the .NET framework, in particular the CLR,
the preferred platform for application execution on hardware platforms from
hand-held devices to large-scale servers. Other projects (Mono [Mono] and DotGNU Portable.NET [DotGNU]) are developing open source
CLR implementations to support various purposes on GNU/Linux platforms. It is
therefore to be expected that implementations of the CLR will in the future be
very widely deployed.
The purpose
of this project is to reach a documented opinion on whether and how an
efficient backend for Moscow ML targeting the CLR can be constructed. The
documentation of the opinion will be the descriptions in this report together
with a tested pilot implementation of a sufficient part of the central features
of Moscow ML on CLR.
There are
several perspectives of such an endeavour:
1.
The
perspective of promoting the safety of strongly typed languages in the server
application area.
2.
The
perspective of assessing the portability of the design of Moscow ML with
respect to the backend virtual machine or assessing the flexibility of the CLR
as an execution platform for languages outside the realm of object
oriented/imperative languages.
3.
The
personal perspective of learning about virtual machine code generation in a
non-toy-compiler setting, learning about the CLR and learning more about
functional programming.
Since one
might imagine porting the CaML Light runtime from C to C# or even simulating
one of the current hardware/software platforms of Moscow ML in C# to get the
compiler and compiled programs to run on top of CLR, targeting the CLR is only
really interesting if efficiency is part of the requirements. Therefore head-on
comparison of benchmark program with the current Moscow ML is important.
On the
other hand, fine-tuning of implementation detail performance, or changing the
way Moscow ML system works in areas not directly related to a specific backend
is not relevant.
Following some more detailed priorities:
To support
compiling enough interesting test cases, the pilot implementation must
implement integers, integer arithmetic, strings and simple string operations.
Tuples, lists, algebraic data types, references, functions (including mutually
recursive ones) with proper tail recursion, conditional expressions (including
cases) and exception handling are also needed.
Design of
good, efficient internal data types for the above types.
Special
consideration must be made of the expected problem areas closures, exception
handling and garbage collection.
The
implementation should be covered reasonably by test cases although a
comprehensive regression test suite for the pilot compiler is not required.
Consider
use of verifiable or unverifiable code.
The part of
the standard library preloaded by default. Both as test cases and to get some
experience with the amount of work needed and whether new tricks typically need
to be invented or adding support for the libraries is of a more routine
character.
Bootstrapping
the compiler or implementing an interactive system is beyond what should be
expected during the project, but consideration should be given not to make
design decisions that would make bootstrapping or interactive mode overly
difficult.
Cooperation
with other languages on the .NET platform should be taken into account in order
not to make it too hard to call other languages or be called from other
languages. Designing libraries for exporting SML values as CLR objects,
implementing the .NET class library as a SML library are well beyond the scope
of this project though.
I don’t
expect floating-point types or unsigned or alternative precision integer types
to present problems that would not be exposed by integer values. Therefore
implementation of these types together with their arithmetic and bitwise
operations will have low prioritity.
Other
features of low priority: The quote/antiquote facilities because they are not
used much. The parts of the standard lib not loaded with the Misc unit; the top
level compilation mode and testing the compilation of code using anything but
the simplest parts of the module system (i.e. no functors in test cases).
Character set questions.
The Moscow
ML compiler is constructed in the traditional way with stages for lexing and
parsing that build an abstract syntax tree, front-end stages for type checking
and tree restructuring that transforms the abstract syntax tree to an
intermediate form and back-end stages for compiling the intermediate “language”
to byte code. The output of the compilation of an SML input file will be an
“.ui” file describing the external interface of the compilation unit and an
”.uo” file containing the byte code to
be executed by the runtime to evaluate the expressions in the source file[4].
The
following diagram zooms in on the back-end code generating stages:

A few of
the SML files making up the stages or defining a format are shown together with
some central functions.
In the main
flow, each top-level declaration of the source SML program is translated in the
front end to a Lambda expression, which is sent to the backend. In the backend
UNdeBruijn splits a Lambda expression containing one or more function
definitions into the list of (Lambda expression) function bodies and a modified
expression with the function definitions substituted by closure-creating Lambda
primitives and function applications substituted with closure call primitives.
The resulting Lambda expressions are then compiled to (symbolic)
macroinstructions by compileExp and the result translated to real byte code in
the Emitxxx modules. In certain situations the Sigmtch module may generate
extra Lambda expressions, e.g. when a signature exports a prim_val from a
structure. The compileExp function is recursively defined by a large case
statement for the many kinds of Lambda expressions. The box with Compiler.sml
is shown because that is the place from where the various stages are
controlled.
The
internal Lambda Expression code is in principle untyped. Values can live in an
evaluation stack or be stored in a global store or in the free variable
environment of a closure. Values can be primitive (int, string, etc.), closures
or are built from “blocks” consisting of an 8bit tag and a sequence of values.
Some block examples: an SML tuple is represented as a block with tag 0 and the
tuple elements as the elements of the block sequence. A list is represented as
a block with tag 1 and a two-element sequence of the head and tail of the list
(recursively). A reference is represented as a block with tag 250 and the
one-element sequence of references value.
Due to the
short time span of the project and a low initial level of knowledge of the CLR,
of CIL assembler and of Moscow ML compiler internals, the strategy has been to
study these areas simultaneously with an incremental construction of the pilot
compiler together with design of data types and other design questions.
I have used
the working hypothesis that the CLR is sufficiently similar to CaML light that
we can preserve the structure of the compiler including the backend down to the
structure of the prime compilation function compileExp. We could hope to
preserve a major part of the macroinstruction set (Instruct.sml) and just
change the byte code emitting stage (Emit*.sml). In practise the strategy has
been to start from the source code of Moscow ML and do:
1.
Build
new files for Back.sml, Emitxxx.sml and Instruct.sml. To avoid confusion they
will be called Cil_back.sml, Cil_ephr.sml and Cil_inst.sml. Change the
references in the other source files of the compiler.
2.
Construct
a skeleton of Cil_back.sml by copying from the old, but with an empty
compileExp.
3.
Make a
first guess at the implementation of integer and string internal types.
4.
Study
the input to compileExp for the simplest possible test cases using the printing
facilities of Pr_lam.sml.
5.
Try to
copy additional exercised cases of the old compileExp to Cil_back.sml while
adding new macroinstruction names used to Cil_inst.sml and CIL-emitting code to
Cil_ephr.sml. Rename macroinstructions from Kxxxx to Cxxxx to avoid confusion.
a.
Test
if the test cases are compiled correctly (if no new macroinstructions had to be
invented and the representation of internal types is correct, that would be
likely)
b.
If
needed change the logic of the code in the case in Cil_back, perhaps inventing
new macroinstructions.
c.
If
relevant, write mock-up code in C#, compile and disassemble it to find out how
to express a certain functionality in CIL assembler.
6.
Repeat
from point 3 with less simple test cases eventually exercising additional
internal data types.
Other
notable points:
1.
During
the early phases, adapt the build system (make based) of the Moscow ML compiler
sources to the current situation.
2.
Incrementally
build a set of test cases that may be run and controlled automatically to
continuously check the quality of the pilot compiler and document the
implemented features.
3.
If
possible, write longer code sequences as methods in C# instead of CIL assembler
and call the compiled methods from CIL.
4.
I have
used some of the chapters from [Gough] for general information on the CLR and
in particular used code samples there for integer tests and integer arithmetic.
5.
I have
used the introductory chapters from [Lidin] and a lot of browsing of the
electronic version for more deep information on CIL.
6.
The
.NET SDK from Microsoft delivered electronic documentation for .NET in general,
for C# and for the .NET standard library.
7.
Peter
Berthelsen, describes a Java byte code backend for Moscow ML in [PMB], which has a lot of information on the internals of Moscow ML, in
particular the Lambda code.
8.
Morten
Syslvest Olsen describes the CLR and differences between code generation for CLR and JVM in [MSO].
In this
section I describe and discuss design questions for the pilot compiler. More
detailed technical descriptions of interesting details are deferred to section 5.
At least
three strategies can be devised for generating (binary) CIL code: (1)
generating byte codes directly, (2) generating (textual) CIL assembler and
running the ilasm assembler on it and (3) using the run-time code generation
facilities of the .NET framework. Strategy (2) is much easier to implement than
(1) and it is much easier to manipulate and debug the generated code. Strategy
(3) is not relevant before the compiler itself is running on CLR, it is a
strategy for the future. Thus (1) was chosen.
There are
three standard formats of CIL code files: “exe”, “dll” and “netmodule”. The
first two are “assemblies” in their own right i.e. they may be loaded
dynamically into a running CLR program. The difference being that the former
may be started as an application. The third format is used for compiled code
that are going to be linked together with other “netmodules” to form a combined
assembly. While “netmodules” may be interesting for building libraries and
completely self-contained executables, a CLR based interactive Moscow ML system
must be able to dynamically load (newly) compiled SML units. For simplicity, I
have chosen to let the pilot compiler always generate an “exe” file assembly,
with the entry point being a method that will evaluate the expressions in the
compiled program (after loading dependencies).
All
generated code is put inside a namespace “MosML”. Such is natural for runtime
code and library code with any language. In our case the class and member names
in the compiled code will be out of the control of the user and bear no
resemblance to the names in the source file. Therefore it is natural to “hide”
them in a “MosML” namespace to avoid clashes with code written in other
languages. The pilot compiler even puts all the code from, say, “Unit1.sml” in
namespace “MosML.Unit1” to avoid clashes between different compiled units.
The
following table gives an overview of the contents of a compiled unit. A C#
program that mimics the way a compiled unit works can be found in the compiler
subdirectory of the pilot compiler distribution under the name UnitTmpl.cs.
|
Type (class) |
Use |
|
main |
Container for globals, exports and dependencies tables (described in section 3.9) and methods implementing the evaluation of the unit |
|
clo |
Container for methods implementing closure bodies |
Overview of a compiled SML file
In order to
let several individually compiled SML units work together, they need common
definitions of the basic value types (classes). Therefore these definitions
have been put in a separate assembly, “MosMLRT.dll” (RT for RunTime), written
as a C# program “MosMLRT.cs”.
This
runtime assembly also contains some methods called by the CIL code for
primitives in the compiler’s intermediate Lambda code and some methods to be
referenced with the “prim_val” construct in an SML library source file.
The runtime
assembly furthermore contains the list of loaded SML units together with
support methods for loading unit dependencies, an implementation of the
“General” unit and some exception handling support methods. The value types implements ToString methods
to support easy “ugly-printing” of values from test cases until access to
pretty-printing becomes available some time in the future.
|
Type (class) |
Use |
|
Value |
Abstract super class of all |
|
MLInt : Value |
Representation of int (and char) |
|
MLString : Value (MLStringBuilder) |
Representation of string |
|
BLOCK : Value |
Representation of block |
|
MLArgList, MLArgs, MLFuntion |
Closure construction |
|
CLOSURE : Value |
Closure representation |
|
MLException : System.Exception |
Exception handling |
|
UnitRef |
Data for a loaded unit |
|
Runtime |
Container for list of loaded units |
|
General : UnitRef |
Initialization of General unit |
|
Stdlib |
Methods implementing primitives |
Overview of the MosMLRT assembly
Types and
members etc. defined in the runtime are referenced in the CIL code of compiled
SML files by using full names like “class [MosMLRT]MosML.MLInt“, the part in
square brackets denoting a reference to an external assembly. The CLR resolves
such external references at load time by looking in the “global assembly
cache”, in the application directory or following directions in various
configuration files. All of these ways of locating the runtime assembly were
tried and the most flexible one during compiler development were to let the
build system install new versions of the runtime assembly whenever a new one
was built. To be able to install the runtime in the global assembly cache it
must have a version id and be signed digitally. The CIL generated by the
compiler must include the same version id and a compatible public key
fingerprint for this to work, so these items must be held in sync between the
runtime C# source and the compiler SML source.
The way to
find the runtime assembly in a future production compiler is undecided, but one
would probably wish to install the runtime (and the standard library) in the
global assembly cache for performance reasons anyway.
At the
project start there was a working hypothesis that one would have to rely on
unverifiable code using e.g. unmanaged function pointers to get an efficient
implementation of closures. It became clear during early experiments with code
generation that the batch verifier from the .NET SDK was a great help in early
discovery and location of bugs if one required generated code to be accepted by
the verifier without warnings. I decided to try to stay with verifiable code as
long as possible and only start experimenting with unverifiable code when I had
more experience with CIL.
It turned
out to be possible to implement decently efficient closures with verifiable
code and trying to improve performance with unverifiable code was never attempted.
A case for
insisting on verifiability can be made in the following way. If the point of
implementing Standard ML on CLR is to enable high reliability of large-scale
server applications through type safety in the source language, it would seem
strange that users must enable unsafe applications in the server to run these
“ultra-safe” applications. Indeed, verification may help prevent a compiler bug
crash the server and not just the application.
We will represent SML values by objects of suitable CLR classes. One could imagine getting higher efficiency with value classes (structs), but value classes are not heap-allocated, which lead me to ignore this possibility in the pilot compiler.
At many points in the code generation we need to be able to handle the internal representation of a SML value which can be of any type, so we need a common super class, MosML.Value or just Value. The only feature we need in Value in addition to being a System.Object is an equality method (denoted eqvals). One should not override the standard Equals method – we need that to test for object identity. Besides, eqvals will return the internal ML values for true or false, not the CLR booleans.
public
abstract class Value {
public
static BLOCK unit = new BLOCK(null, 0, 1);
public
static BLOCK mlfalse = new BLOCK(null, 0, 2);
public
static BLOCK mltrue = new BLOCK(null, 1, 2);
public
static BLOCK[] falsetrue = new BLOCK[2] {mlfalse, mltrue};
public
abstract BLOCK eqvals(Value o);
}
Now we can represent int and string by simply adding an int32 field respectively a string field to Value and implementing eqvals in the obvious way. That gives classes MLInt and MLString:
public class MLInt : Value {
public int
val;
public
MLInt(int i) {
val = i;
}
public
override BLOCK eqvals(Value o) {
if (val
== ((MLInt)o).val) return Value.mltrue;
else
return Value.mlfalse;
}
}
public class
MLString : Value {
public
string val;
public
MLString(string i) {
val = i;
}
public
override BLOCK eqvals(Value o) {
if
(val.Equals(((MLString)o).val)) return Value.mltrue;
else
return Value.mlfalse;
}
}
(I have removed the uninteresting ToString method from these and subsequent C# code snippets)
The code for eqvals just casts the “o“ argument to the type of “this” without hesitation because SML type safety prevents calling eqvals with data that would result in a casting error.
The block internal value used for lists etc. needs a tag and a sequence of values, so we let the class BLOCK extend Value with a byte field named tag and a Value[] to hold the sequence. There is currently also a byte field denoted span, which are only really needed at compile time, but is of help in the current temporary support for printing values despite a small performance penalty. Then eqvals checks for equal tag values and then test for pair wise equality of the elements in the sequences using eqvals (recursively). Equality testing is different for tag=250, which are used for references and where we check for equality as CLR objects of the elements of the sequences (there can be only one in each in this case).
public class
BLOCK : Value {
public byte
tag;
public byte
span;
public Value[]
val;
public
BLOCK(Value[] v, byte t, byte l) {
tag = t;
span = l;
val = v;
//could be null?
}
public
override BLOCK eqvals(Value o) {
BLOCK ob = (BLOCK)o;
if (tag != ob.tag) return Value.mlfalse;
Value[]
oval = ob.val;
if (tag
== 250) //reference
return
val[0] == oval[0] ? Value.mltrue : Value.mlfalse;
int len =
(val == null ) ? 0 : val.Length;
int olen
= (oval == null ) ? 0 : oval.Length;
if (len
!= olen) return Value.mlfalse;
for
(int i=0; i<len; i++)
if (val[i].eqvals(oval[i]).tag == 0)
return Value.mlfalse;
return
Value.mltrue;
}
}
It is important during implementation of internal value representation to remember that (most) values in SML are immutable so the corresponding internal representations should be handled so too, and that we usually cannot reuse the value objects since the corresponding SML value might be a constituent of some list, tuple etc..
Adapting
the most basic part of the standard library (Array, Int, Misc, Strbase,
Stringcvt, Vector, Char, List, Option, String, Substring) lead to the use of
MLInt also for the internal representation of char, but to add an
MLStringBuilder class to support manipulation of the contents of strings. The
adaptation of Vector and Array lead to the introduction of MLVector. This way
the SML code implementing the library modules did not have to be changed, but
methods was added to the MosML.Stdlib class in the runtime assembly to support
the prim_val declarations in the library implementation. Some of the methods in
the MosML.Stdlib class need to be over-loaded because some input Value may be
either an MLString or MLStringBuilder or either an MLVector or a BLOCK with tag
250 containing an MLVector (i.e. a ref to a vector – an array). The overloading
is done by explicit tests now, but it might be cleaner and more efficient to
rely on CLR overloading.
The
adaptation of these parts of the standard library has not been thoroughly
tested.
In the Java
byte code backend of [PMB], java.lang.Object is used where I use Value. The
standard java.lang.Integer boxed int is used for int and corresponding to the
BLOCK class above, the a Constructor class is provided, but tuples and
references are represented by special classes where the tag has been removed.
The standard equals method is overridden instead of adding a separate one as done here. The question of
representation of strings are discussed, and a solution with java.lang.String
was chosen.
As explained in section 2.2, the Lambda expression intermediate code operates
with a (virtual) stack containing e.g. let bound values, which can be addressed
with a “Lvar of int” Lambda expression. The CLR stack cannot be used in this
way: it is really just syntactic sugar for method call notation – one cannot
address into the stack. In the CLR one must declare and use local variables to
hold the let bound values instead of putting them on the stack. Fortunately,
the infrastructure in the original compileExp for keeping track of at which
stack index a given value can be found adapts directly to this situation where
the values are found in local variables. The compilation of (Lvar i) is
actually a little simpler now.
The
internal representation of a closure must encapsulate both the code needed to
evaluate the closure body and the free variables bound at the time of closure
definition. That is for the original closures constructed by the UNdeBruijn
function in Cil_back.sml. During evaluation, closures may be applied with fewer
arguments than the closure body expects. The result will be a new closure
value, where the arguments supplied so far are saved along with the free
variable values and the same body for later evaluation when all arguments have
been supplied.
The code
for evaluating the body must necessarily reside inside a method body in the
CIL. We will use one CIL method pr closure body and put all such methods in a
single class which will have the full name “MosML.<unit name>.clo”.
Another approach would be to put all closure body code within a large switch
statement in a single, common method for a whole compilation unit. That
approach would enable implementing tail calls inside the unit by jumps, but the
CLR has support for tail calls of methods so we try to have one method per
closure body.
The image
one should have of a closure value internal type is a class with fields for the
list of bound free variables, the arguments supplied so far and a function pointer
to the method implementing the body. Unfortunately, using pure method pointers
make the CIL code unverifiable and will prevent the CLR from doing proper tail
calls. But the CLR has a kind of function pointer called a delegate that fits
nicely with this idea, can be used in verifiable code and can be invoked as a
proper tail call. A delegate is a pointer to a method, with a particular “this”
object bound opaquely inside the delegate. A delegate must be of a specific
delegate type declared by the signature of arguments and return value. I let
the class MosML.<unit name>.clo (containing the closure body
methods) have fields for a free variable list and a list of arguments supplied
so far. Then one can create a delegate encapsulating both the function pointer
and specific lists by creating a new object of class “clo”, initialising the
lists and create a delegate based on the appropriate method and this object.
The CLOSURE
class for representing closure values then just extends Value with a field of
the appropriate delegate type, and for technical reasons an int field holding
the missing number of arguments. The ML equality method throws an exception
directly since it should be impossible to compare closures in Standard ML.
/* MLArgList
is a poor man's linked list of values. Used for closure
* arguments
and lists of free variables. The values are limked in
* backwards
order, which means we can add a list of further
* arguments
by a trivial join without changing the original
* (shared)
argument list. Thus the name "prev" for the field
*
traditionally named "next".
*
* argument
list (v1,v2,v3) is represented as
* ( arg: v3 )
*
( prev: -----> ( arg: v2 )
* ( prev: ------> ( arg: v1 )
* ( prev: . )
*/
public class
MLArgList {
public
Value arg; //Last arg of the ones
supplied so far
public
MLArgList prev; //ref to previous arguments
public
MLArgList(Value v) {
arg = v;
}
public
MLArgList(Value v, MLArgList mla) {
arg
= v;
prev = mla;
}
}
public class
MLArgs {
public
MLArgList args_sofar;
public
MLArgList free;
//There is
a slight performance benefit in this order of f and a
public
MLArgs(MLArgList f, MLArgList a) {
args_sofar = a;
free = f;
}
}
public
delegate Value MLFunction(int missing, MLArgList tail, MLArgList args);
public class
CLOSURE : Value {
public
MLFunction func;
public int
missing;
public
CLOSURE(MLFunction f, int a) {
func = f;
missing =
a;
}
public
CLOSURE(int a) {
missing =
a;
}
public
override BLOCK eqvals(Value o) {
//Type
safety in SML should prevent ever to call eqvals on a CLOSURE
throw new
System.Exception("Calling eqvals on a CLOSURE object");
}
}
The main
difficulty is to handle the fact that closures may be called in both under- and
over saturated settings – with fewer than expected respectively too many
arguments. In the former case the arguments should just be stowed away for
later use. In the latter case, the closure body should be supplied with the
just right number and called. The return value must be a closure, which must be
applied on the remaining arguments and so on until all arguments have been
used. One can handle these matters either at the place where a closure is
applied or inside the closure in a prologue before the proper compiled body
code.
In the
pilot compiler, the case of too few parameters is handled within the closure
method prologue, because this is where the free variables and earlier supplied
arguments are available. Thus the prologue will in the under saturated case add
the new arguments to the ones supplied earlier, then construct a new delegate
pointing to the method itself but with a new “this” object and finally return a
CLOSURE based on this new delegate. The case of too many parameters is handled
at the site where the closure is applied by only supplying the right number of
arguments to the delegate invocation and repeating until all arguments has been
used. It would seem cleaner to put the “too many” handling in the closure
prologue too; one could save some code duplication that way.
The free
variable list is created at the closure creation site (created by the splitting
in UNdeBruijn) and not manipulated further until it is unpacked when the real
body is evaluated.
In the
implementation of the ideas above it is important to realize, that the partial
argument lists may be shared among several closures so one may not modify them
arbitrarily.
A
walk-through of the implementation details can be found in section 5.4.
Generating
CIL instructions for tail calls when appropriate is done by matching for a ret
instruction on the code continuation in the Cil_ephr.emit_phr case for the
Ccall macroinstruction.
A very
similar, but perhaps simpler and more efficient approach comes to mind but has
not been tested. Each closure body would define a class extending a common
abstract closure class with three fields for (1) the free variable list, (2)
the arguments supplied so far and (3) the number of arguments taken by the
body, and an abstract “invoke” method to be implemented by the compiled closure
body in the instantiated subclasses. One would call a closure by adding the new
arguments to the arguments supplied so far, and if the count of arguments
reached the count taken by the body, the invoke method would be called. One
would have to take care with over saturated calls. The logic seems to be almost
the same as the delegate approach described above, but it seems possible to
save some method calls.
[PMB] uses a design, where all the closure bodies are
compiled into cases in a switch common to all closures in a unit. The CLR
approach with delegates would not be possible in JVM.
There are
two kinds of exceptions to be handled: exceptions generated by the CLR, e.g.
because of an IO error, stack overflow etc. and exceptions raised by the SML
raise function. The former kind must be handled with CIL try/catch
blocks and we will use this native exception approach for the SML exceptions by
raising exceptions with throw and handling them with try/catch blocks. This
choice seems the only reasonable one.
To do so we
subclass the System.Exception class as MosML.MLException by adding a field to
contain the internal representation of the ML exception value (a BLOCK;
MosML.MLException cannot be a Value itself). MosML.MLException has a static
method, “filter”, for transforming a System.Exception to the appropriate
MLException value, if possible.
Now a “exp handle match” SML expression should be compiled to
something like “try{comp(exp)}catch(Exceptionclass
ex){comp(match)}”. A
design question now is which exception class to catch? I decided to just catch
any System.Exception no matter the contents of “match”. This might seem wasteful since we
do not utilize the CLR’s exception filtering capabilities, but I assume that
exceptions that would be caught and rethrown several times with the chosen approach
would usually make it all the way to the top level and abort the program, so
the performance hit is not so important. Hopefully, evaluating expressions in a
try block will not incur much overhead if no exception is thrown. If one wanted
to utilize the CLR’s probably more efficient filtering, one would have to
inspect the match Lambda expression, which is a non-trivial matter because it
has been translated to a chain of if-then-else expressions somewhere in the
font end of the compiler.
The
compiled match expression expects to see a SML exception value on the top of
the evaluation stack. Therefore we insert a call to MosML.MLException.filter to
transform the System.Exception the CLR has left on the stack to a SML value and
code to rethrow if this was a kind of exception with no remapping to SML
exceptions.
The main
difficulty with exception handling implementation is the fact that the CLR
requires that its stack is empty on entry to a try block and empty on exit from
a try or catch block. The empty on exit conditions can be handled easily (given
that empty on entry has been handled) by saving the value generated by the exp
or match expression in a local value and restoring it right after leaving the
block. Note that one can only leave a try or catch block with the special leave
instruction that takes a label to branch to.
It doesn’t
seem to be a good idea to reorganize the compileExp function to generate CIL
code that stores every computed intermediate value in local variables and only
populates the stack right before a method call or other instruction that takes
values from the stack.
Another
approach would be to put “comp(exp)” inside a separate CIL method, which would
then need to be supplied with the values of “exp’s” free variables as
arguments. Such an approach is taken in [PMB]. With few free variables and nested deeply in an
expression, this might be more efficient than the save/restore design, but it
seems more complicated to implement and I do not expect many handle statements
to be buried deep in expressions or late in long argument lists for functions.
Thus we
have to save the stack in (local) variables before entering the try block and
restore them at the point where the leave instruction returns control. To
implement that I had to extend the compileExp function with an extra argument,
a list of the types of values on the CLR stack the start of the evaluation of
the exp argument to compileExp and modify all the cases. There is a code example in section 5.5.
Note: it
would be more in line with the .NET guidelines to let MLException subclass
System.ApplicationException, but in the development phase it is practical to
stick with System.Exception because then a stack trace is included in the
default ToString() method.
There are
three important aspects of the implementation of integer arithmetic:
The
operations must be checked i.e. raise an exception on overflow and not just
wrap around. The CIL has checked variants of the relevant arithmetic instructions
so we just use those variants and make sure the thrown CLR exceptions will be
transformed to the right ML kind in the call of the MosML.MLException.filter
method at the start of a catch block.
The
operations must produce correct results, which is an issue for the operation
related to division, where rounding must be in the right direction. The
existing “div” and “rem” CIL instructions corresponds to rounding towards zero
and thus to the Int.quot and Int.rem functions. The SML div and rem operators
have no direct equivalences as CIL instructions, but [Gough] explains how to
implement then in a dozen llines of CIL assembler. I have adapted his code.
One should
if possible avoid the expensive boxing and unboxing between MLInt objects and
the ints inside when computing integer expressions. The pilot compiler applies
the idea of “unboxing early” by inserting explicit unboxing CIL instructions in
the instruction stream and trying to propagate these backwards instead of
hiding the unboxing inside the arithmetic macro instruction. In pseudo code, if
we are compiling “e1 + e2” we would get something like
“comp(e1),iunbox,comp(e2),iunbox,add,ibox”. If the e’s are integer constants or
arithmetic expressions themselves, we may eliminate ibox,iunbox pairs. It is
needed that the compileExp function is written in continuation passing style,
but it was for exactly such kinds of optimisations. The concrete implementation
of this case is shown in section 5.3.2.
Global
values, i.e. values declared at top level are stored in a static Value array
“globals” in the “main” class in the namespace of the unit under compilation
using the index in the Lambda get_global/set_global expressions as index. The
map from name to index into the ”globals” array is maintained as a hash table
“exports” (a System.Collections.Hashtable object) for the benefit of external
access. This mimics what happens in Moscow ML.
Referencing
values in other units are oversimplified done as follows: For an SML expression
like “Int.maxInt”, which is translated to a Lambda expression
“Lprim(Pget_global(Int.maxInt/0)”, the compiler will generate code that finds
and dynamically loads and evaluates the assembly, Int.exe, (created by compiling
the Int.sml unit source code), then finds the value as
Int.main.globals[Int.exports[“maxInt”]].
This is not
the correct semantics though. First of all we should only load and init each
unit once, so a hash table “MosML.Runtime.loaded_units” declared in the runtime
assembly MosMLRT.cs caches references to the “globals” and “exports” objects of
already loaded units.
Secondly,
the order of events will not be correct. Before starting the evaluation of a
compiled unit, all units on which it depends must be loaded and evaluated
(recursively). Therefore the following happens. During compilation, code to
create and initialise a static array containing the names of direct
dependencies is emitted. When a compiled unit is started as an application, and
before running the evaluation code, the list of dependencies is examined. Every
dependency assembly is found and loaded dynamically, without running any
evaluation code; the dependencies of the newly loaded units are loaded in the
same way. When all dependencies have been loaded, they are sorted topologically
according to dependency and their evaluation code is run in a compatible order.
Then we can run the evaluation code of the originally started unit. The
supervision of dependency loading is done in MosML.Runtime.load_dependencies()
and the individual units are loaded in the constructor of MosML.UnitRef, both
classes defined in the MosMLRT runtime assembly.
The
topological sorting is done with an adaptation of the standard DFS based
algorithm, see chapter 22 of [CLRS].
In [PMB], exported values are implemented as fields with the
fieldname being a slightly mangled version of the export name to avoid certain
name clashes. Thus an explicit table lookup to find a value in another unit is
not needed, the JVM takes care of that.
Mimicking
the approach for loading dependencies, one may implement a C# program that can
call a function defined in a compiled SML unit. The pair T_smlcall.cs/L_ext.sml
in the compiler/ciltest regression test directory (cf. section 4) of the pilot compiler distribution does so. This is
not a satisfactory solution for end-programmers accessing SML “objects” from
other languages, because all value marshalling has to be done by hand on the C#
side, and the type safety of SML is lost.
The pilot
compiler also has a feature for use of objects in assemblies written in C#, by
putting a full assembly reference in a prim_val declaration. An example can be
found in the regression test directory as the pair T_ccall.sml/L_ccall.cs. This
feature is for the reasons above expert-only.
More
deboxing using current approach, using the code continuation may be possible by
being more systematic.
The local
variables allocated for values that would be stored and accessed in the stack
in Moscow ML are all declared to be of class Value, but for a let bound int
value one would have further opportunity to debox by declaring the variable to
be of type int32. One would have to consider reuse of local variable slots
(virtual stack indexes) though to implement that.
One could
augment the code continuation with a description of the list of types of values
the code continuation expects to be on the stack. That information would then
be used instead of matching on the code continuation instructions to debox. I
believe the approach would be doable and more general that the current one, by
being more easily able to “look through” e.g. labels and branching
instructions.
Benchmarks
show that the BLOCK implementation is a performance bottlenecks compared to
Moscow ML (cf. section 6.2), both in areas of creation and garbage collection.
One might
attempt to define one BLOCK class for each tag value and eliminate the tag
field, and even specialize the number of values in the sequence in the BLOCK
class (for small numbers) in order to eliminate the array operations. Both tag
value and sequence length is known at compile time. One would have to benchmark
to see if anything could be gained that way.
One
opportunity for optimising closure calls are saturated self-tail-calls within a
closure body. Such calls are a typical pattern in loop-type recursive utility
functions. One should be able to avoid the call and just update the arg values
and branch to the start of the closure body code. The pilot compiler detects
saturated self-calls, and inserts comments in the CIL assembler to mark them
but does not do any optimisations.
One might
consider special casing the closure type according to the number of arguments
taken by the closure body with the saved arguments as individual fields to
avoid packing and unpacking.
One might
attempt to do a limited type inference of closures at compile time by examining
the body to see if one can determine that the return value will always be a
specific run-time type (int, string or block) and similarly for the arguments
to determine a call signature, e.g. (int, int)àint. I do not propose to do comprehensive type
inference, just some quick and dirty local examination. I think a simple, low
complexity, recursive examination of closure bodies could catch a significant
number of such restricted signatures. One might then be able to easily debox
the return value, depending on the precise closure implementation. Deboxing the
arguments would be harder and not possible unless one uses some special casing
of closures that do not store the arguments in lists.
In cases, where all the possible uses of a
closure can be determined, one might also use the ways a closure is used to
restrict types as in “let fun id x = x in id y +1 end”. I am not sure how
complex such inference would be.
In [PMB]
closure specialisation for direct calls of manifest functions was done using
some runtime type inference and an
extension of the information maintained during the lifting of closure bodies in
UNdeBruijn.
One might
use varargs instead of MLArgLists for packaging a variable number of arguments
in a list in the calling conventions used. The JIT will surely be able to
generate more efficient packing/unpacking of a varargs list than the explicit
linked lists. This idea looks most promising in a special-casing scenario where
MLArgLists are used less.
[Gough] has a discussion at the end of chapter 9 one should
have in mind when considering optimisations at the IL level. The question is to
what extent does classical optimisations on the IL level just change the work
that the JIT has to do or even make it more difficult? No comprehensive answer
to this question is available.
One
particular instance of this question of potential relevance for the current
pilot concerns its use of local scratch variables. These are variables that are
currently used for a short stretch of instructions inside macro instructions
e.g. for compensating the lack of a
swap instruction when the arguments to a object constructor has been created in
the wrong order on the stack (the creation order being given by SML semantics).
Currently every method generated by the compiler has a fixed selection of such
scratch variables of various types – in numbers that cover all situations.
Using such method global local variables indicates to the JIT that the
corresponding slot may be reused. One could also declare new local scratch
variables each place a new independent one is needed signalling the lifetime of
the variables to the JIT. I do not see which alternative is best. Possibly the JIT
will be able to infer the necessary information from the usage pattern anyway.
Some benchmarking of local variable strategies is described in [MSO], where the conclusion is that the JIT seems to be
clever enough that the strategy used in th CIL code is not so important.
The experience from working a little with a small subset of the standard
library implementation was that a straightforward port to CLR was quite routine
work. But a short look at the internals of some of the IO library modules shows
that those will need some more real design work to be ported because the IO
system of CaML Light does not resemble the IO system of CLR very much. Blindly
implementing the primitives of the former on CLR would not be a good idea.
Testing design decisions will be easier having an interactive system
running on CLR so one should try to get there quickly. Since every new
top-level declaration evaluated by the interactive system must be compiled and
then loaded and executed, the interactive system must have substituted CLR
runtime code generation (RTCG) for the current IL assembler loop to get
acceptable response times.
A quick way to support RTCG for a compiler running on CLR would be to
make a C# method in MosMLRT.Stdlib for each Cil_inst macroinstruction by
transforming the corresponding case in Cil_ephr to a sequence of RTCG method
calls. The method would be available for a new Cil_ephr via a zillion prim_val
declarations.
One should be able to bootstrap the compiler on CLR before attempting to
compile the interactive system.
The following strategy for extending the current work to a CLR back end
for Moscow ML can thus be proposed:
Extend the standard library support to the parts needed for
bootstrapping. IO seems to be the most likely part to be non-routine.
Bootstrap the compiler.
Develop a RTCG library module to replace the [Gough][Gough]current CIL assembler emitter.
Implement any missing standard library support needed for the
interactive system.
Implement the interactive system.
Start experiments to improve performance.
Implement the rest of the standard library.
Make sure ML system will run on other important CLR platforms.
Consider how to make the new Moscow ML compliant with the Common
Language Specification.
To try the
pilot compiler you will need to run one of the Microsoft Windows platforms,
where the Microsoft .NET SDK is supported and have that SDK installed. I assume
below that the C# compiler, csc, the CIL assembler, ilasm, the global assembly
cache utility, gacutil and the batch CIL verifier, peverify is in your path.
Get the
pilot compiler from http://www.it-c.dk/~kokholm/mos.ml/pilot-31-05-2002.zip
and unzip with your preferred unzip utility.
To prepare
for using the runtime assembly, install it in the global assembly cache by
cd’ing to the “compiler” subdirectory op the top directory of the unpacked
distribution and running “gacutil /i MosMLRT.dll”. Alternatively, copy the
MosMLRT.dll assembly to any directory from which you want to run compiled programs.
The following script show how you could compile and run the T_handle test case starting in the top level directory of the distribution:
C:\home\kokholm\dotml>cd compiler\ciltest
C:\home\kokholm\dotml\compiler\ciltest>..\mosmlcmp -P none T_handle.sml
File "T_handle.sml", line 20-23, characters
8-133:
!
......."a" => (5555555 * 5555555; "bignum")
! |
"b" => raise Fail "Fejl"
! |
"c" => raise Subscript
! |
"d" => "noerror".
! Warning: pattern matching is not exhaustive
C:\home\kokholm\dotml\compiler\ciltest>ilasm /quiet T_handle.il
Assembling 'T_handle.il' , no listing file, to EXE
--> 'T_handle.EXE'
Source file is ANSI
Operation completed successfully
C:\home\kokholm\dotml\compiler\ciltest>T_handle
67
308608025
67
17
(#a=a# (#b=b# (#c=c# (#noerror# (#e<>a,b,c#
.)))))
1010101
105
105
39
(otte# 321 ulla3)
! Uncaught exception:
! eFoo hejsa
If you do
not include “–P none” in the call of the compiler it
will probably exit because it cannot find Misc.ui to preload Misc. If you
include the flag –lam_debug, the compiler will show the Lambda
expressions as they are compiled. You can access the partial standard library
during compilation by adding a flag “-stdlib <path to standard library>” to the command line; in the situation of the
example above “-stdlib c:\home\kokholm\dotml\mosmllib”. Then you do not need “‑P none”, but you will have to define an environment
variable DOTMLLIB with the value “.:<path to standard library>” (without the quotes, the separator
is a colon) in order to execute the compiled code.
If you are
using a cygwin shell to run the above you may want to use the run_mosmlc script
in the compiler directory to run both stages and the verifier in one command.
The
compiler has been compiled with filename handling for unix systems. I have only
tested with filenames starting with an uppercase letter and all other letters
lower case and I recommend following that convention.
The
compiler and library are built using make. You will need Moscow ML to compile
the compiler. There are make files in the compiler directory for cygwin make
(Makefile) and Microsoft nmake (Makefile.w32). If you make clean, you will need
perl to recreate a couple of files. There is only a cygwin makefile in the
mosmllib directory.
The
regression tests are in the ciltest subdirectories of the compiler and mosmllib
directories. The regression tests are performed by perl scripts by the name
regress, that must have access to utilities like diff to run.
The source
of the pilot compiler has been created by initially making a copy of Moscow ML
source code, deleting some parts that were not needed. Thus the directory
structure matches the Moscow ML structure closely.
The pilot
compiler has three phases:
1.
mosmlcmp.exe,
compiled with Moscow ML from the modified compiler sources compiles the input
file to a CIL assembler file with extension il.
2.
The
ilasm assembler program (from .NET SDK) compiles the il file to an EXE file.
3.
The
exe file is verified with the peverify program (also from .NET SDK)
The finished executable must have acces to the MosMLRT.dll runtime assembly in the same directory or installed in the global assembly cache. Any SML unit dependency must be found in the path given by the DOTMLLIB environment variable.
Changed/new
SML files overview with comments on differences with the Moscow ML version:
|
File name |
Comment |
|
Cil_back.sml Cil_back.sig |
Replacement for Back.sml, reusing much of that. Mostly changes in
compileExp, but preserveing the overall structure of that function. |
|
Cil_ephr.sml |
Replacement for Emit_ephr.sml and Emitcode.sml. Defines start_emit_phrase and end_emit_phrase functions that writes
the header and trailer of the CIL assembler file. Defines emit_phrase that translates the Cil_inst macro instruction
stream to CIL assembler instructions. |
|
Cil_inst.sml |
Replacement for Instruct.sml, Defines the symbolic CIL macro instruction set used in the code stream
generated by Cil_back.compileLambda . |
|
Cil_glob.sml |
New. A few values shared among the other Cil_*.sml files |
|
Compiler.sml |
Changed reference from Back to Cil_back and byte code file extension
from uo to il. Added a flag to signal the start of the end_emit phase to
Cil_back; the flag needed to compile the phrases created by Sigmtch
correctly. |
|
Sigmtch.sml |
Changed references from Back and Emit_phr to Cil_back and Cil_ephr. |
|
Lambda.sml |
Changed reference from Instruct to Cil_inst. |
|
Mainc.sml |
Added a command line flag to display Lambda code during compilation. |
|
Prim.sml |
Added an int to the Pclosure constructor: the number of arguments
taken by the closure body, to communicate that information from the
UNdeBruijn splitting to the compilation of the closure creation expression
compilation in compexp. |
|
Pr_lam.sml |
Change of Lambda expression printing due to the change in Prim.sml |
|
Types.sml |
Commented out a “raise Toplevel” to enable compilation of a test case,
with Value polymorphism at top level (compiler/pstest/Inductive.sml). |
I will give
three examples from Cil_back and Cil_ephr of how Lambda expressions are
compiled as cases in the compexp local function in compileExp and further
emitted to CIL assembler. The compexp function has been called with 5
arguments: compexp cilstk sz dp exp C:
|
cilstk |
The list of value types on the stack when evaluation of exp starts |
|
sz dp |
2 ints describing the state of the Lambda expression virtual stack |
|
exp |
The expression to be compiled |
|
C |
The code continuation – a list of CIL macro instructions |
The compexp
is a large case statement on the value of exp.
Accessing
the i’th element of the sequence of a BLOCK, e.g. the i’th field of a tuple, is
compiled in Cil_back.sml by the case
| Lprim (Pfield i, [e1]) =>
(noteDepth (cilstk,1);
compexp cilstk sz dp e1 (Cgetfield
i :: C))
The
noteDepth utility function registers that the CLR stack (due to the value of
e1) will be 1 element larger at some point of the evaluation than at the start,
but not more at positions in the code visible at this point. The evaluation of
e1 may internally demand even more stack size, but that will be noted in the
compilation of e1. This stack size
information is used for computing a value for the mandatory .maxstack
CIL directive for every method.
The case
then puts a Cgetfield macroinstruction in front of the continuation and then
compiles e1 with the new continuation and with the same stack type description,
cilstk, as the whole exp.
The
Cgetfield macroinstruction assumes the value of e1 is on the top of the CLR
stack. The value will be a BLOCK by SML type safety, but the evaluation of e1
may have left something the verifier can only infer will be of type Value so
the emitted instruction stream starts with a castclass instruction to BLOCK.
The next instruction loads (pushes) the Value[] val field of the BLOCK on top
of stack to the stack, removing the BLOCK. Then the array index is loaded on
stack and that element of the val array is loaded, removing the array reference
and index. The net effect is to pop an element from the stack and push a Value
object. From Cil_ephr.sml:
| Cgetfield i :: C =>
(output
(os,"castclass\t[MosMLRT]MosML.BLOCK\n\
\ldfld\tclass
[MosMLRT]MosML.Value[] [MosMLRT]MosML.BLOCK::val\n\
\ldc.i4\t"^myitos(i)^"\n\
\ldelem.ref\n");
emit C)
Note the
verbosity with the types and field refs spelled out in all details including
having to spell out the fact that [MosMLRT]MosML.Value denotes a class. This is
by demand of the CIL assembler and makes the intermediate il files even huger.
The myitos function just translates an SML int to a C#-string. Note that this
code will actually expand the stack on element more than noted above, but the
expansion will be confined to the interior of this instruction sequence and we
can take care of all such local expansions by adding a constant (currently 3)
to the maxstack value computed by the noteDepth calls. The .maxstack CIL
directive only has to give an upper bound on the stack size.
The case
for adding the values of two expressions, (Cil_back.sml:)
| Lprim (Psmladdint, [e1,e2]) =>
(noteDepth
(cilstk,2);
compIntArithmetic cilstk Cadd
sz dp e1 e2 C)
Simply calls
compIntArithmetic defined elsewhere by
and compIntArithmetic cilstk inst sz dp e1
e2 C =
let val C2 = case C of
Ccvt (cilValue, cilInt) :: C1 => C1
| _ => Cnewint :: C
in
compexp
cilstk sz dp e1
(Ccvt (cilValue, cilInt) ::
compexp (cilInt::cilstk) sz dp e2
(Ccvt (cilValue, cilInt) :: inst :: C2))
end
Which first checks if the code continuation starts with an unboxing operation, Ccvt. If so, the unboxing operation is removed, else a boxing operation is inserted. Now we have a new continuation C2. The two expressions are compiled with code continuations starting with unboxing operations, e2 with a cilInt type (from e1) added to the stack description. The code in Cil_ephr for outputting CIL code for the Ccvt and Cadd symbolic instructions are quite simple:
| Cadd :: C =>
(output
(os,"add.ovf\n");
emit C)
| Ccvt (cilValue, cilInt) :: C
=>
(output
(os,"castclass\t[MosMLRT]MosML.MLInt\n\
\ldfld\tint32 [MosMLRT]MosML.MLInt::val\n") ;
emit C)
The add.ovf
is a checked integer addition. The unboxing must use a castclass instruction to
keep the CIL verifier happy and then just extracts the val field of the MLInt
object.
A closure
application, applying the (closure) value result of evaluating the expression f
to the values given by evaluating the list of expressions args is compiled by
the following case in Cil_back (changed slightly by removing a few things
irrelevant for this discussion and let binding some inermediate results):
| Lapply (f,args) =>
let val nargs = List.length args
val C1 = Ccall nargs :: C
val newcilstk = cilValueList nargs cilstk
val C2 =
compexp newcilstk sz dp f C1
in (noteDepth (cilstk,nargs+1);
compExpList cilstk sz dp args C2
end
The new
code continuation C1 just is C with a Ccall macroinstruction prepended. This is
the most complex macroinstruction. It packages the argument values left on the
stack by the code compiled from args in an MLArgList, gets the delegate from
the CLOSURE left on the stack by the code compiled from f and then invokes the delegate. But over
saturation of the call and the potential for issuing a tail call instruction is
also handled by the Ccall emitted code.
Then f is
compiled with continuation C1 with a stack description with nargs cilValue
types added to the original (by the cilValuelist function). Finally the
arguments are compiled one by one by the compExpList utility function defined
elsewhere. The noteDepth function registers that the stack depth will be up to
nargs+1 places larger than the initial stack at the boundaries visible in this
code.
I will not
show the CIL assembler code produced by Ccall here but refer to the source code
or the inermediate il files generated by the compiler.
The details
of how closures work is exposed at three types of sites: where the closure is
created, where it is called and inside the compiled body code is run. I will
explain the function of the CIL assembler by a mock-up written in C#. A full
mock-up example is in UnitTmpl.cs in the compiler directory of the source.
The initial
construction of a closure (the compilation of a Lprim(Pclosure…)) expression
does this: Build the list of free variables and save them with an empty list of
supplied arguments in an object that will become the private object of a
delegate pointing to the compiled closure body method:
//bundle the free variables to be
stored in the closure in a list
//together
with an empty list of arguments supplied so far:
MlargList free = new
MLArgList(free_1);
free = new MLArgList(free_2, free);
//…
free = new MLArgList(free_nfree,
free);
clo initialargs = new clo(free,
null);
CLOSURE f = new CLOSURE(new
MLFunction(initialargs.cl_TMPL), missing);
The call site
will in principle just do
//package arguments supplied here
in args list of length numargs
MLArgList tail = new MLArgList(arg_1);
MLArglist args = tail
args = new MLArgList(arg_2, args);
//…
args = new MLArgList(arg_numargs,
args);
//
//and call delegate. “tail” is the last element of
args
f.func(f.missing-numargs, tail,
args);
The
prologue of a closure method shows how under-saturation is handled by
concatenating the lists of supplied variables to create a new base object for a
delegate pointing to the same closure body method.
class clo : MLArgs {
public Value cl_TMPL(int missing,MLArgList
tail,MLArgList args){
tail.prev = args_sofar; //Concatenate
the lists! tail.prev used to be null
if (missing > 0) {
//build new closure and return it
clo newargs = new clo(free, args);
return new CLOSURE(new
MLFunction(newargs.cl_TMPL), missing);
}
else {
//unpack the free vars and args to
the local variables
// expected by the compiled closure body
Value free_nfree = free.arg;
Free= free.prev;
//…
Value free_1 = free.arg
//
Value arg_nargs = args.arg;
args= args.prev;
//…
Value arg_1 = args.arg
//execute the compiled closure body
and return its value
//
}
}
The call
site will actually handle the possibility of an over-saturated call. Instead of just invoking f.func as shown
above it executes CIL instructions equivalent to a call
call_closure(f, f.missing-numargs,
tail, args)
where
call_closure is
public static Value
call_closure(CLOSURE f, int numargs, MLArgList tail , MLArgList args) {
if (f.missing >= numargs)
//”Normal, not oversaturated”
//Tail call,
return f.func(f.missing-numargs, tail, args);
//The over saturated case:
//supply as many args as possible to the closure
//call and repeat with the closure returned and
//the rest of the arguments :
int missing = f.missing - numargs;
Value result = f;
while (missing<0) {
CLOSURE closure_to_call = (CLOSURE)result;
int surplus = -missing-closure_to_call.missing;
MLArgList newtail = args;
for (int i=1; i<surplus;i++)
newtail = newtail.prev;
if (surplus == 0)
//Tail call
return closure_to_call.func(0, tail, newtail.prev);
result = closure_to_call.func(0, tail, newtail.prev);
missing = -surplus;
tail = newtail;
}
}
The
call_closure code has to be inlined as assembler in practise to insert tail
instructions where necessary.
The
following CIL example explains the code emitted from compiling the expression
a+(b*c handle _ => 13), assuming a,b,c are let bound as
variables number 1 to 3:
First a is pushed and unboxed
ldloc.s V_1
castclass
[MosMLRT]MosML.MLInt
ldfld int32 [MosMLRT]MosML.MLInt::val
A local variable to save the value on stack is
declared and the value saved
{.locals( int32 HANDLE_1_0)
stloc.s
HANDLE_1_0
Stack is empty, we can enter .try block,
compute b*c, store result in local variable SCRATCH_VALUE, and branch to label DOTML_1:
.try {
ldloc.s V_2
castclass
[MosMLRT]MosML.MLInt
ldfld
int32 [MosMLRT]MosML.MLInt::val
ldloc.s V_3
castclass
[MosMLRT]MosML.MLInt
ldfld
int32 [MosMLRT]MosML.MLInt::val
mul.ovf
newobj instance void
[MosMLRT]MosML.MLInt::.ctor(int32)
stloc.s
SCRATCH_VALUE
leave DOTML_1}
Handle any exception, rethrow if uncatchable in
ML, else put 13 in SCRATCH_VALUE:
catch
[mscorlib]System.Exception {
call class
[MosMLRT]MosML.BLOCK [MosMLRT]MosML.MLException::'filter'(class
[mscorlib]System.Exception e)
dup
brtrue
DOTMLLOC_1
rethrow
DOTMLLOC_1:
stloc.s V_4
ldc.i4 13
newobj instance void
[MosMLRT]MosML.MLInt::.ctor(int32)
stloc.s
SCRATCH_VALUE
leave DOTML_1}
Restore the saved stack value:
DOTML_1:
ldloc.s HANDLE_1_0
}
Push the value of the handled expression and
perform the addition.
ldloc.s SCRATCH_VALUE
castclass
[MosMLRT]MosML.MLInt
ldfld int32 [MosMLRT]MosML.MLInt::val
add.ovf
newobj
instance void [MosMLRT]MosML.MLInt::.ctor(int32)
During the
development of the pilot compiler, test cases played an all-important role.
First test input files was created for discovering what Lambda expressions were
generated by specific SML expressions. Later, when the compiler started
producing executable code, the correct output from the test input files were
saved and a regession test script was written that would compile and run all
test input files and compare the output with the expected. The tests are
divided into files according to their main area of interest, but common types
like lists are used and therefore tested everywhere.
The
regression script was run frequently, normally after every small adjustments of
the compiler, to discover immediately if a change to support some additional
feature or repair a bug had adverse effects on functionality already
implemented. Whenever a test for a new feature provoked a bug in some area
already considered finished, a dedicated test case for that bug was devised
(and the problem resolved).
In addition
to covering old bugs, the selection of expressions in test cases has been
selected in an attempt to be complete with respect to the tested SML features,
but has been somewhat limited by my SML experience. Fortunately the supervisors
supplied extra, nasty test cases or ideas to stretch the limits of the
compiler. I tend to believe the high bug hit rate of their ideas were due to
their cleverness and not due to flakiness of the compiler.
The way the
compileExp function works together with optimisation by using code continuation
matching, the effects of UNdeBruijn lifting and optimisations of case/switch
expressions based on the distribution of cases is sufficiently complex that it
can be hard to predict which case and code path will be emitted and exercised
by a particular input program. I considered adding macroinstructions inserting
trace instructions in the generated code everywhere in compileExp to be able to
check the internal completeness of the tests, but decided that it would be bad
at that stage to completely clutter the compileExp code that way.
In the
cases in compileExp potentially exercised by the compilation of conditional
expressions (if-then-else, andalso, orelse, case/match) I have inserted
comments in the emitted code so I could at least be sure that the test cases
would cover the compileExp with respect to code generation.
The
following table summarizes some benchmarks run to compare the performance of
code compiled with the pilot compiler with code compiled by Moscow ML. The
tests were run on a 466 MHz Pentium Celeron pc with 256 MB RAM and running
Windows XP. All CLR assemblies were built with no debugging symbols – running
the C# compiler and CIL assembler with the /debug flag typically imposed a 2.5
times performance penalty on the resulting assembly.
The first
column indicates where the benchmark SML source program is located. The second
column my guess at what feature is particularly benchmarked in that case.
Garbage collection will be exercised in all cases, except the one marked nogc.
The next two columns are running times in seconds (typical result of 5 tries)
and finally the ratio of run times, where relevant. Times were measured by
running the programs in a Cygwin bash shell using the standard time command.

The Bench
test is shown here because it was the first program used for comparing the
performance of an early version of the pilot compiler to Moscow ML and was used
continuously to check that changes in the design did have some positive effect
or not a great negative effect on performance.
The Arith
test spends most of its time performing integer arithmetic. Due to the deboxing
optimisation for pure arithmetic expressions in the pilot, performance is
respectable for the pilot. The slower performance of Moscow ML is because
integers are represented as 31 bits in a 32 bit field, where the last bit is a
flag bit that has to be masked (shifted away) during the actual
computation.
The Queens and Tak(euchi) tests closure performance but also garbage collection and Takt in particular tuples. These benchmarks gives the impression that closure performance is not so bad, possibly just half the speed of Moscow ML, but an indication that BLOCK handling might be less efficient. To factor out closures, the While* tests uses the while primitive to implement loops. The While and While2 tests build let bound tuples respectively short lists and forgets them immediately, while the While3 test build one large list and forgets it in the end. The indications are that building BLOCKs are perhaps 5 times as slow with the pilot, and garbage collecting a lot of them even slower. There are few possibilities for the user to influence the way garbage collection is performed in CLR. One may force a garbage collection at a particular time, which shouldn’t help here. One may change the gcConcurrent flag from true to false, which can be done at application granularity level in an application config file. I have tried that but saw no significant change in the results.
The Handle* cases tries to gauge the overhead for having try block, where the exception is not thrown. The Handle1 repeatedly computes 3*(4*(5*(4+(6 handle _ => 7)))) in a while loop, the Handle0 computes the same expression with 6 substituted for ( 6 handle _=>7). In the pilot, the compiler will generate code to save 4 ints in local variables and restore them again in each loop. The execution times indicates that the overhead of using handle statements for this kind of test are somewhat higher relatively in the pilot than in Moscow ML, but not terribly high in itself.
Finally,
the OpenMisc test just does “open Misc”. The times for the pilot starts at over
3 seconds and after some iterations stabilize around 1.2 seconds. The test
shows that one will have to have the standard library installed in the global
assembly cache to obtain good start up performance for realistic programs – the
other benchmarks load no libraries.
Comparing
the sizes of .EXE files in the pilot version of part of the standard library to
the corresponding .uo files in Moscow ML indicates a factor around 5. A lot of
this will be inline closure-call handling code. A single test indicated that
the pilot output files could be much more compressible than the Moscow ML ones.
Both greater size and compressibility seems irrelevant for files on local disc.
I haven’t
measured the performance of the compilation process itself, but there is really
no comparison at this stage: Moscow ML is lightning fast, while the pilot
compiler seems to take seconds on even the smallest files. This difference is
natural since the pilot compiler generates a huge CIL assembler file and then
starts ilasm to assemble it and finally runs the verifier. A future production
compiler using RTCG instead of textual CIL assembler should be expected to be
nearly the same speed as (current) Moscow ML – depending on the efficiency of
the RTCG library.
I have
tried the pilot with Mono on Windows XP and Microsoft SSCLI on FreeBSD. I have
not experimented with DotGNU, and according to the status description on the
website as of end of May 2002, tail calls and delegates are not implemented in
DotGNU at this point in time, so it would have failed any test involving
functions.
Tried
version: binary kit for MS Windows downloaded 23-04-2002.
1.
There
is no CIL assembler for (this version of) mono, so I settled for trying to run
the already compiled test cases on the mono runtime (the jit version).
2.
The
reflection and some other needed standard libraries were missing, so loading of
other SML units could not be done.
3.
This
version of mono doesn’t support the “tail.” instruction prefix, so I had to
remove those prefixes from the intermediate il files. After that, all compiler
tests succeeded, except for the test of correct tail recursion.
4.
A few
attempts at performance testing were not very successful: I couldn’t run the
tests that use (tail) recursion for looping since the programs would die in a
stack overflow. Running the While2 test didn’t succeed because the process grew
out of virtual memory on my machine (around 700 MB) – running on MS CLR the
process grows to around 5 MB in size. This seems to indicate a problem with
garbage collection on the tested version of mono. The While3 test was more
successful and ran at twice the speed of the MS CLR!
5.
Under
some circumstances exceptions that should have been caught by the try/catch in
the Main method were not caught. In particular, the stack overflow because of
the missing “tail.” instruction prefix resulted in mono aborting without any
message.
6.
A
single small syntax quirk was discovered: MS CLR accepts array creation (with
equal semantics) with CIL assembler either “newarr Value” or “newarr class
Value”, but mono will not run code compiled from the latter aborting with an
“implement me error message”.
I tested
the compiler with Microsoft’s SSCLI, a “shared-source” implementation of the
CLR on FreeBSD 4.5 [MSDN], using the “free” (as opposed to checked) build. The
version used was downloaded late march 2002.
I could not
build the runtime assembly because of problems with the strong-name
implementation, so I used a copy of the one compiled on the .NET SDK. The
compiler test cases then ran correctly under the “clix” runtime system except
for some of the expressions in T_fun. The problems with T_fun were not resolved
due to time restraints.
A design
and pilot implementation of a CLR backend for the compiler in Moscow ML has
been developed.
Tests
indicates that the design is sufficiently comprehensive and efficient to be the
basis of a usable SML compiler targeting CLR that will be able to bootstrap
itself and compile an interactive SML system on CLR.
Some areas
for improving the current design have been indicated.
The
following strategy for extending the current work to a CLR back end for Moscow
ML can be proposed:
Extend the
standard library support to the parts needed for bootstrapping. IO seems to be
the most likely part to be non-routine.
Bootstrap
the compiler.
Develop a
RTCG library module to replace the current CIL assembler emitter.
Implement
any missing standard library support needed for the interactive system.
Implement
the interactive system.
Start
experiments to improve performance.
Implement
the rest of the standard library.
Make sure
ML system will run on other important CLR platforms.
Consider
how to make the new Moscow ML compliant with the Common Language Specification.
|
Cormen, Leiserson, Rivest and Shamir: “Introduction to algorithms”,
2.ed. MIT press 2001 |
|
|
John Gough, “Compiling for the .NET Common Language Runtime (CLR)”,
Prentice Hall 2002 |
|
|
http://java.sun.com/ |
|
|
[Lidin] |
Serge Lidin, “Inside Microsoft .NET IL assembler”, Microsoft press
2002 |
|
http://www.go-mono.org/ |
|
|
Morten Sylvest Olsen, “A pizza compiler for .NET”, Master’s Thesis DTU
2002 |
|
|
Peter M. Berthelsen, “Compiling SML to Java bytecode”, Master’s
Thesis, DTU 1998 |
1.
Cil_back.sml
2.
Cil_ephr.sml
3.
Cil_inst.sml
4.
Cil_glob.sml
5.
MosMLRT.cs
6.
UnitTmpl.cs
7.
C#
test cases
8.
SML
test cases
[1] See the Moscow ML web site for
references to material on Standard ML.
[2] The original acronym was MSIL,
newer Microsoft documentation may use IL. I follow [Gough] in using CIL.
[3] Invalid CIL cannot be executed and
so is less interesting.
[4] The byte code file will be missing
if one is only compiling a SML signature. I will ignore this case in the
following.