On a CLR backend for Moscow ML

4-week project report

Niels Jørgen Kokholm

IT-University of Copenhagen

May 31, 2002


Contents

1    Preface. 3

2    Background, problem statement and methods. 3

2.1     Problem statement 3

2.2     The current Moscow ML compiler 5

2.3     Strategy. 6

3    Problem analysis. 7

3.1     Overall structure of code generation and code. 7

3.2     The runtime assembly. 7

3.3     Verifiable vs. unverifiable code. 8

3.4     Simple internal data types. 9

3.5     Stack discipline. 10

3.6     Closures. 10

3.7     Exception handling. 12

3.8     Integer arithmetic. 13

3.9     Global values and accessing values in other units. 14

3.10       Some proposals for optimisations. 15

3.11       Towards a port of Moscow ML to CLR.. 16

4    User’s guide to the pilot compiler 17

4.1     To install the pilot, compile and run a SML program.. 17

4.2     To modify, recompile and test the compiler and library. 18

5    Technical details. 18

5.1     Structure of pilot compiler 18

5.2     Source changes overview.. 19

5.3     Examples of compilation of expressions and code generation. 19

5.4     Closure details. 21

5.5     Exception handling example. 22

6    Testing. 23

6.1     Regression tests. 23

6.2     Benchmarks. 24

6.3     Testing on other CLR platforms. 25

7    Conclusion. 26

8    References. 27

9    List of code listing appendices. 27

 

 

 


1         Preface

This report was written in May 2002 in connection with a 4-week project at the IT University of Copenhagen under supervision of Andrzej Waskowski and Peter Sestoft. Thanks to both for initial input, discussions, advice and evil test cases. Thanks to the developers of Moscow ML for making it open source, thereby enabling and encouraging projects like this.

In the project I have made a preliminary investigation of making the Moscow ML system use the IL intermediate code of the Common Language Runtime instead of the current CaML Light intermediate code. The project has produced a pilot compiler as a Standard ML program that can compile a significant subset of Standard ML including a small subset of the Standard ML standard library to correct and somewhat efficient Common Language Runtime programs.

The project was done in continuation of the course “Programming languages” [PFO] at the IT University. The report assumes knowledge of functional programming, in particular of Standard ML, of compiler construction at the level of that course and some knowledge of C#.

2         Background, problem statement and methods

Moscow ML [MosML] is a compiler and an interactive system for the strict functional language Standard ML (SML)[1]. The current, version 2.0, Moscow ML compiler generates intermediate code in the form of CaML Light bytecode that can be interpreted by a CaML Light virtual machine (runtime program). The interactive system works by compiling individual top level declarations on the fly to byte code and then loading the result into the virtual machine.

The Common Language Runtime (CLR) ([ECMA]) is one of the two main components of the Microsoft .NET Framework [MSDN], the other one being the .Net Framework class libraries.  The CLR is (the specification of) a virtual machine that runs programs in the form of Common Intermediate Language (CIL) bytecode[2]. I may sometimes misuse the term CIL to denote the textual (assembler) form when no mistake should be possible. The CLR is in many respects quite similar to the Java Virtual Machine (JVM) ([Java]).

There are two interesting levels of CIL the CLR can execute: verifiable vs. valid but unverifiable[3]. Verifiable code is CIL code the CLR can verify beforehand is safe for the CLR to execute in the sense that it will not corrupt the runtime itself or other running applications. Valid but unverifiable code may be executed by the CLR if permitted by the configured security policy. The possibility of executing unverifiable code is there to enable high-performance implementations of certain programming languages/applications.

It is the stated policy of Microsoft to make the .NET framework, in particular the CLR, the preferred platform for application execution on hardware platforms from hand-held devices to large-scale servers. Other projects (Mono [Mono] and DotGNU Portable.NET [DotGNU]) are developing open source CLR implementations to support various purposes on GNU/Linux platforms. It is therefore to be expected that implementations of the CLR will in the future be very widely deployed.

2.1      Problem statement

The purpose of this project is to reach a documented opinion on whether and how an efficient backend for Moscow ML targeting the CLR can be constructed. The documentation of the opinion will be the descriptions in this report together with a tested pilot implementation of a sufficient part of the central features of Moscow ML on CLR.

There are several perspectives of such an endeavour:

1.      The perspective of promoting the safety of strongly typed languages in the server application area.

2.      The perspective of assessing the portability of the design of Moscow ML with respect to the backend virtual machine or assessing the flexibility of the CLR as an execution platform for languages outside the realm of object oriented/imperative languages.

3.      The personal perspective of learning about virtual machine code generation in a non-toy-compiler setting, learning about the CLR and learning more about functional programming.

Since one might imagine porting the CaML Light runtime from C to C# or even simulating one of the current hardware/software platforms of Moscow ML in C# to get the compiler and compiled programs to run on top of CLR, targeting the CLR is only really interesting if efficiency is part of the requirements. Therefore head-on comparison of benchmark program with the current Moscow ML is important.

On the other hand, fine-tuning of implementation detail performance, or changing the way Moscow ML system works in areas not directly related to a specific backend is not relevant.

 Following some more detailed priorities:

2.1.1      High priority

To support compiling enough interesting test cases, the pilot implementation must implement integers, integer arithmetic, strings and simple string operations. Tuples, lists, algebraic data types, references, functions (including mutually recursive ones) with proper tail recursion, conditional expressions (including cases) and exception handling are also needed.

Design of good, efficient internal data types for the above types.

Special consideration must be made of the expected problem areas closures, exception handling and garbage collection.

The implementation should be covered reasonably by test cases although a comprehensive regression test suite for the pilot compiler is not required.

Consider use of verifiable or unverifiable code.

2.1.2      Medium priority

The part of the standard library preloaded by default. Both as test cases and to get some experience with the amount of work needed and whether new tricks typically need to be invented or adding support for the libraries is of a more routine character.

Bootstrapping the compiler or implementing an interactive system is beyond what should be expected during the project, but consideration should be given not to make design decisions that would make bootstrapping or interactive mode overly difficult.

Cooperation with other languages on the .NET platform should be taken into account in order not to make it too hard to call other languages or be called from other languages. Designing libraries for exporting SML values as CLR objects, implementing the .NET class library as a SML library are well beyond the scope of this project though.

2.1.3      Low priority

I don’t expect floating-point types or unsigned or alternative precision integer types to present problems that would not be exposed by integer values. Therefore implementation of these types together with their arithmetic and bitwise operations will have low prioritity.

Other features of low priority: The quote/antiquote facilities because they are not used much. The parts of the standard lib not loaded with the Misc unit; the top level compilation mode and testing the compilation of code using anything but the simplest parts of the module system (i.e. no functors in test cases). Character set questions.

2.2      The current Moscow ML compiler

The Moscow ML compiler is constructed in the traditional way with stages for lexing and parsing that build an abstract syntax tree, front-end stages for type checking and tree restructuring that transforms the abstract syntax tree to an intermediate form and back-end stages for compiling the intermediate “language” to byte code. The output of the compilation of an SML input file will be an “.ui” file describing the external interface of the compilation unit and an ”.uo”  file containing the byte code to be executed by the runtime to evaluate the expressions in the source file[4].

The following diagram zooms in on the back-end code generating stages:

 

 

A few of the SML files making up the stages or defining a format are shown together with some central functions.

In the main flow, each top-level declaration of the source SML program is translated in the front end to a Lambda expression, which is sent to the backend. In the backend UNdeBruijn splits a Lambda expression containing one or more function definitions into the list of (Lambda expression) function bodies and a modified expression with the function definitions substituted by closure-creating Lambda primitives and function applications substituted with closure call primitives. The resulting Lambda expressions are then compiled to (symbolic) macroinstructions by compileExp and the result translated to real byte code in the Emitxxx modules. In certain situations the Sigmtch module may generate extra Lambda expressions, e.g. when a signature exports a prim_val from a structure. The compileExp function is recursively defined by a large case statement for the many kinds of Lambda expressions. The box with Compiler.sml is shown because that is the place from where the various stages are controlled.

The internal Lambda Expression code is in principle untyped. Values can live in an evaluation stack or be stored in a global store or in the free variable environment of a closure. Values can be primitive (int, string, etc.), closures or are built from “blocks” consisting of an 8bit tag and a sequence of values. Some block examples: an SML tuple is represented as a block with tag 0 and the tuple elements as the elements of the block sequence. A list is represented as a block with tag 1 and a two-element sequence of the head and tail of the list (recursively). A reference is represented as a block with tag 250 and the one-element sequence of references value.

2.3      Strategy

Due to the short time span of the project and a low initial level of knowledge of the CLR, of CIL assembler and of Moscow ML compiler internals, the strategy has been to study these areas simultaneously with an incremental construction of the pilot compiler together with design of data types and other design questions.

I have used the working hypothesis that the CLR is sufficiently similar to CaML light that we can preserve the structure of the compiler including the backend down to the structure of the prime compilation function compileExp. We could hope to preserve a major part of the macroinstruction set (Instruct.sml) and just change the byte code emitting stage (Emit*.sml). In practise the strategy has been to start from the source code of Moscow ML and do:

1.      Build new files for Back.sml, Emitxxx.sml and Instruct.sml. To avoid confusion they will be called Cil_back.sml, Cil_ephr.sml and Cil_inst.sml. Change the references in the other source files of the compiler.

2.      Construct a skeleton of Cil_back.sml by copying from the old, but with an empty compileExp.

3.      Make a first guess at the implementation of integer and string internal types.

4.      Study the input to compileExp for the simplest possible test cases using the printing facilities of Pr_lam.sml. 

5.      Try to copy additional exercised cases of the old compileExp to Cil_back.sml while adding new macroinstruction names used to Cil_inst.sml and CIL-emitting code to Cil_ephr.sml. Rename macroinstructions from Kxxxx to Cxxxx to avoid confusion.

a.       Test if the test cases are compiled correctly (if no new macroinstructions had to be invented and the representation of internal types is correct, that would be likely)

b.      If needed change the logic of the code in the case in Cil_back, perhaps inventing new macroinstructions.

c.       If relevant, write mock-up code in C#, compile and disassemble it to find out how to express a certain functionality in CIL assembler.

6.      Repeat from point 3 with less simple test cases eventually exercising additional internal data types.

 

Other notable points:

1.      During the early phases, adapt the build system (make based) of the Moscow ML compiler sources to the current situation.

2.      Incrementally build a set of test cases that may be run and controlled automatically to continuously check the quality of the pilot compiler and document the implemented features.

3.      If possible, write longer code sequences as methods in C# instead of CIL assembler and call the compiled methods from CIL.

4.      I have used some of the chapters from [Gough] for general information on the CLR and in particular used code samples there for integer tests and integer arithmetic.

5.      I have used the introductory chapters from [Lidin] and a lot of browsing of the electronic version for more deep information on CIL.

6.      The .NET SDK from Microsoft delivered electronic documentation for .NET in general, for C# and for the .NET standard library.

7.      Peter Berthelsen, describes a Java byte code backend for Moscow ML in [PMB], which has a lot of information on the internals of Moscow ML, in particular the Lambda code.

8.      Morten Syslvest Olsen describes the CLR and differences between code generation for CLR and JVM in [MSO].

3         Problem analysis

In this section I describe and discuss design questions for the pilot compiler. More detailed technical descriptions of interesting details are deferred to section 5.

3.1      Overall structure of code generation and code

At least three strategies can be devised for generating (binary) CIL code: (1) generating byte codes directly, (2) generating (textual) CIL assembler and running the ilasm assembler on it and (3) using the run-time code generation facilities of the .NET framework. Strategy (2) is much easier to implement than (1) and it is much easier to manipulate and debug the generated code. Strategy (3) is not relevant before the compiler itself is running on CLR, it is a strategy for the future. Thus (1) was chosen.

There are three standard formats of CIL code files: “exe”, “dll” and “netmodule”. The first two are “assemblies” in their own right i.e. they may be loaded dynamically into a running CLR program. The difference being that the former may be started as an application. The third format is used for compiled code that are going to be linked together with other “netmodules” to form a combined assembly. While “netmodules” may be interesting for building libraries and completely self-contained executables, a CLR based interactive Moscow ML system must be able to dynamically load (newly) compiled SML units. For simplicity, I have chosen to let the pilot compiler always generate an “exe” file assembly, with the entry point being a method that will evaluate the expressions in the compiled program (after loading dependencies).

All generated code is put inside a namespace “MosML”. Such is natural for runtime code and library code with any language. In our case the class and member names in the compiled code will be out of the control of the user and bear no resemblance to the names in the source file. Therefore it is natural to “hide” them in a “MosML” namespace to avoid clashes with code written in other languages. The pilot compiler even puts all the code from, say, “Unit1.sml” in namespace “MosML.Unit1” to avoid clashes between different compiled units.

The following table gives an overview of the contents of a compiled unit. A C# program that mimics the way a compiled unit works can be found in the compiler subdirectory of the pilot compiler distribution under the name UnitTmpl.cs.

 

Type (class)

Use

main

Container for

globals, exports and dependencies tables (described in section 3.9) and

methods implementing the evaluation of the unit

clo

Container for methods implementing closure bodies

Overview of a compiled SML file

 

3.2      The runtime assembly

In order to let several individually compiled SML units work together, they need common definitions of the basic value types (classes). Therefore these definitions have been put in a separate assembly, “MosMLRT.dll” (RT for RunTime), written as a C# program “MosMLRT.cs”.

This runtime assembly also contains some methods called by the CIL code for primitives in the compiler’s intermediate Lambda code and some methods to be referenced with the “prim_val” construct in an SML library source file.

The runtime assembly furthermore contains the list of loaded SML units together with support methods for loading unit dependencies, an implementation of the “General” unit and some exception handling support methods.  The value types implements ToString methods to support easy “ugly-printing” of values from test cases until access to pretty-printing becomes available some time in the future.

 

Type (class)

Use

Value

Abstract super class of all

MLInt : Value

Representation of int (and char)

MLString : Value (MLStringBuilder)

Representation of string

BLOCK : Value

Representation of block

MLArgList, MLArgs, MLFuntion

Closure construction

CLOSURE : Value

Closure representation

MLException : System.Exception

Exception handling

UnitRef

Data for a loaded unit

Runtime

Container for list of loaded units

General : UnitRef

Initialization of General unit

Stdlib

Methods implementing primitives

Overview of the MosMLRT assembly

 

Types and members etc. defined in the runtime are referenced in the CIL code of compiled SML files by using full names like “class [MosMLRT]MosML.MLInt“, the part in square brackets denoting a reference to an external assembly. The CLR resolves such external references at load time by looking in the “global assembly cache”, in the application directory or following directions in various configuration files. All of these ways of locating the runtime assembly were tried and the most flexible one during compiler development were to let the build system install new versions of the runtime assembly whenever a new one was built. To be able to install the runtime in the global assembly cache it must have a version id and be signed digitally. The CIL generated by the compiler must include the same version id and a compatible public key fingerprint for this to work, so these items must be held in sync between the runtime C# source and the compiler SML source.

The way to find the runtime assembly in a future production compiler is undecided, but one would probably wish to install the runtime (and the standard library) in the global assembly cache for performance reasons anyway.

 

3.3      Verifiable vs. unverifiable code

At the project start there was a working hypothesis that one would have to rely on unverifiable code using e.g. unmanaged function pointers to get an efficient implementation of closures. It became clear during early experiments with code generation that the batch verifier from the .NET SDK was a great help in early discovery and location of bugs if one required generated code to be accepted by the verifier without warnings. I decided to try to stay with verifiable code as long as possible and only start experimenting with unverifiable code when I had more experience with CIL.

It turned out to be possible to implement decently efficient closures with verifiable code and trying to improve performance with unverifiable code was never attempted.

A case for insisting on verifiability can be made in the following way. If the point of implementing Standard ML on CLR is to enable high reliability of large-scale server applications through type safety in the source language, it would seem strange that users must enable unsafe applications in the server to run these “ultra-safe” applications. Indeed, verification may help prevent a compiler bug crash the server and not just the application.

3.4      Simple internal data types

We will represent SML values by objects of suitable CLR classes. One could imagine getting higher efficiency with value classes (structs), but value classes are not heap-allocated, which lead me to ignore this possibility in the pilot compiler.

At many points in the code generation we need to be able to handle the internal representation of a SML value which can be of any type, so we need a common super class, MosML.Value or just Value. The only feature we need in Value in addition to being a System.Object is an equality method (denoted eqvals). One should not override the standard Equals method – we need that to test for object identity. Besides, eqvals will return the internal ML values for true or false, not the CLR booleans.

  public abstract class Value {

    public static BLOCK unit = new BLOCK(null, 0, 1);

    public static BLOCK mlfalse = new BLOCK(null, 0, 2);

    public static BLOCK mltrue = new BLOCK(null, 1, 2);

    public static BLOCK[] falsetrue = new BLOCK[2] {mlfalse, mltrue};

    public abstract BLOCK eqvals(Value o);

  }

Now we can represent int and string by simply adding an int32 field respectively a string field to Value and implementing eqvals in the obvious way. That gives classes MLInt and MLString:

public class MLInt : Value {

    public int val;

    public MLInt(int i) {

      val = i;

    }

    public override BLOCK eqvals(Value o) {

      if (val == ((MLInt)o).val) return Value.mltrue;

      else return Value.mlfalse;

    }

  }

 

  public class MLString : Value {

    public string val;

    public MLString(string i) {     

      val = i;

    }

    public override BLOCK eqvals(Value o) {

      if (val.Equals(((MLString)o).val)) return Value.mltrue;

      else return Value.mlfalse;   

    } 

  }

(I have removed the uninteresting ToString method from these and subsequent C# code snippets)

The code for eqvals just casts the “o“ argument to the type of “this” without hesitation because SML type safety prevents calling eqvals with data that would result in a casting error.

The block internal value used for lists etc. needs a tag and a sequence of values, so we let the class BLOCK extend Value with a byte field named tag and a Value[] to hold the sequence. There is currently also a byte field denoted span, which are only really needed at compile time, but is of help in the current temporary support for printing values despite a small performance penalty. Then eqvals checks for equal tag values and then test for pair wise equality of the elements in the sequences using eqvals (recursively). Equality testing is different for tag=250, which are used for references and where we check for equality as CLR objects of the elements of the sequences (there can be only one in each in this case).

  public class BLOCK : Value {

    public byte tag;   

    public byte span;

    public Value[] val;

    public BLOCK(Value[] v, byte t, byte l) {

      tag = t;

      span = l;

      val = v; //could be null?

    }

    public override BLOCK eqvals(Value o) {

      BLOCK ob = (BLOCK)o;

      if (tag != ob.tag) return Value.mlfalse;

      Value[] oval = ob.val;

      if (tag == 250) //reference

           return val[0] == oval[0] ? Value.mltrue : Value.mlfalse;

      int len = (val == null ) ? 0 : val.Length;

      int olen = (oval == null ) ? 0 : oval.Length;

      if (len != olen) return Value.mlfalse;

      for (int i=0; i<len; i++)

           if (val[i].eqvals(oval[i]).tag == 0)

             return Value.mlfalse;

      return Value.mltrue;

    }

  }

 It is important during implementation of internal value representation to remember that (most) values in SML are immutable so the corresponding internal representations should be handled so too, and that we usually cannot reuse the value objects since the corresponding SML value might be a constituent of some list, tuple etc..

Adapting the most basic part of the standard library (Array, Int, Misc, Strbase, Stringcvt, Vector, Char, List, Option, String, Substring) lead to the use of MLInt also for the internal representation of char, but to add an MLStringBuilder class to support manipulation of the contents of strings. The adaptation of Vector and Array lead to the introduction of MLVector. This way the SML code implementing the library modules did not have to be changed, but methods was added to the MosML.Stdlib class in the runtime assembly to support the prim_val declarations in the library implementation. Some of the methods in the MosML.Stdlib class need to be over-loaded because some input Value may be either an MLString or MLStringBuilder or either an MLVector or a BLOCK with tag 250 containing an MLVector (i.e. a ref to a vector – an array). The overloading is done by explicit tests now, but it might be cleaner and more efficient to rely on CLR overloading.

The adaptation of these parts of the standard library has not been thoroughly tested.

In the Java byte code backend of [PMB], java.lang.Object is used where I use Value. The standard java.lang.Integer boxed int is used for int and corresponding to the BLOCK class above, the a Constructor class is provided, but tuples and references are represented by special classes where the tag has been removed. The standard equals method is overridden instead of  adding a separate one as done here. The question of representation of strings are discussed, and a solution with java.lang.String was chosen.

3.5      Stack discipline

As explained in section 2.2, the Lambda expression intermediate code operates with a (virtual) stack containing e.g. let bound values, which can be addressed with a “Lvar of int” Lambda expression. The CLR stack cannot be used in this way: it is really just syntactic sugar for method call notation – one cannot address into the stack. In the CLR one must declare and use local variables to hold the let bound values instead of putting them on the stack. Fortunately, the infrastructure in the original compileExp for keeping track of at which stack index a given value can be found adapts directly to this situation where the values are found in local variables. The compilation of (Lvar i) is actually a little simpler now.

3.6      Closures

The internal representation of a closure must encapsulate both the code needed to evaluate the closure body and the free variables bound at the time of closure definition. That is for the original closures constructed by the UNdeBruijn function in Cil_back.sml. During evaluation, closures may be applied with fewer arguments than the closure body expects. The result will be a new closure value, where the arguments supplied so far are saved along with the free variable values and the same body for later evaluation when all arguments have been supplied.

The code for evaluating the body must necessarily reside inside a method body in the CIL. We will use one CIL method pr closure body and put all such methods in a single class which will have the full name “MosML.<unit name>.clo”. Another approach would be to put all closure body code within a large switch statement in a single, common method for a whole compilation unit. That approach would enable implementing tail calls inside the unit by jumps, but the CLR has support for tail calls of methods so we try to have one method per closure body.

The image one should have of a closure value internal type is a class with fields for the list of bound free variables, the arguments supplied so far and a function pointer to the method implementing the body. Unfortunately, using pure method pointers make the CIL code unverifiable and will prevent the CLR from doing proper tail calls. But the CLR has a kind of function pointer called a delegate that fits nicely with this idea, can be used in verifiable code and can be invoked as a proper tail call. A delegate is a pointer to a method, with a particular “this” object bound opaquely inside the delegate. A delegate must be of a specific delegate type declared by the signature of arguments and return value. I let the class MosML.<unit name>.clo (containing the closure body methods) have fields for a free variable list and a list of arguments supplied so far. Then one can create a delegate encapsulating both the function pointer and specific lists by creating a new object of class “clo”, initialising the lists and create a delegate based on the appropriate method and this object.

The CLOSURE class for representing closure values then just extends Value with a field of the appropriate delegate type, and for technical reasons an int field holding the missing number of arguments. The ML equality method throws an exception directly since it should be impossible to compare closures in Standard ML.

  /* MLArgList is a poor man's linked list of values. Used for closure

   * arguments and lists of free variables. The values are limked in

   * backwards order, which means we can add a list of further

   * arguments by a trivial join without changing the original

   * (shared) argument list. Thus the name "prev" for the field

   * traditionally named "next".

   *

   * argument list (v1,v2,v3) is represented as

   *    ( arg: v3 )

   *    ( prev: -----> ( arg: v2 )

   *                   ( prev: ------> ( arg: v1 )

   *                                   ( prev: . )

   */

  public class MLArgList {

    public Value arg;      //Last arg of the ones supplied so far

    public MLArgList prev; //ref to previous arguments

    public MLArgList(Value v) {

      arg = v;

    }

    public MLArgList(Value v, MLArgList mla) {

      arg = v;

      prev = mla;

    }

  }

  public class MLArgs {

    public MLArgList args_sofar;

    public MLArgList free;

    //There is a slight performance benefit in this order of f and a

    public MLArgs(MLArgList f, MLArgList a) {

      args_sofar = a;

      free = f;

    }

  }

  public delegate Value MLFunction(int missing, MLArgList tail, MLArgList args);

  public class CLOSURE : Value {

    public MLFunction func;

    public int missing;

    public CLOSURE(MLFunction f, int a) {

      func = f;

      missing = a;

    }

    public CLOSURE(int a) {

      missing = a;

    }

    public override BLOCK eqvals(Value o) {

      //Type safety in SML should prevent ever to call eqvals on a CLOSURE

      throw new System.Exception("Calling eqvals on a CLOSURE object");

    }

  }

The main difficulty is to handle the fact that closures may be called in both under- and over saturated settings – with fewer than expected respectively too many arguments. In the former case the arguments should just be stowed away for later use. In the latter case, the closure body should be supplied with the just right number and called. The return value must be a closure, which must be applied on the remaining arguments and so on until all arguments have been used. One can handle these matters either at the place where a closure is applied or inside the closure in a prologue before the proper compiled body code.

In the pilot compiler, the case of too few parameters is handled within the closure method prologue, because this is where the free variables and earlier supplied arguments are available. Thus the prologue will in the under saturated case add the new arguments to the ones supplied earlier, then construct a new delegate pointing to the method itself but with a new “this” object and finally return a CLOSURE based on this new delegate. The case of too many parameters is handled at the site where the closure is applied by only supplying the right number of arguments to the delegate invocation and repeating until all arguments has been used. It would seem cleaner to put the “too many” handling in the closure prologue too; one could save some code duplication that way.

The free variable list is created at the closure creation site (created by the splitting in UNdeBruijn) and not manipulated further until it is unpacked when the real body is evaluated.

In the implementation of the ideas above it is important to realize, that the partial argument lists may be shared among several closures so one may not modify them arbitrarily.

A walk-through of the implementation details can be found in section 5.4.

Generating CIL instructions for tail calls when appropriate is done by matching for a ret instruction on the code continuation in the Cil_ephr.emit_phr case for the Ccall macroinstruction.

A very similar, but perhaps simpler and more efficient approach comes to mind but has not been tested. Each closure body would define a class extending a common abstract closure class with three fields for (1) the free variable list, (2) the arguments supplied so far and (3) the number of arguments taken by the body, and an abstract “invoke” method to be implemented by the compiled closure body in the instantiated subclasses. One would call a closure by adding the new arguments to the arguments supplied so far, and if the count of arguments reached the count taken by the body, the invoke method would be called. One would have to take care with over saturated calls. The logic seems to be almost the same as the delegate approach described above, but it seems possible to save some method calls.

[PMB] uses a design, where all the closure bodies are compiled into cases in a switch common to all closures in a unit. The CLR approach with delegates would not be possible in JVM.

3.7      Exception handling

There are two kinds of exceptions to be handled: exceptions generated by the CLR, e.g. because of an IO error, stack overflow etc. and exceptions raised by the SML raise function. The former kind must be handled with CIL try/catch blocks and we will use this native exception approach for the SML exceptions by raising exceptions with throw and handling them with try/catch blocks. This choice seems the only reasonable one.

To do so we subclass the System.Exception class as MosML.MLException by adding a field to contain the internal representation of the ML exception value (a BLOCK; MosML.MLException cannot be a Value itself). MosML.MLException has a static method, “filter”, for transforming a System.Exception to the appropriate MLException value, if possible.

Now a “exp handle match” SML expression should be compiled to something like “try{comp(exp)}catch(Exceptionclass ex){comp(match)}”. A design question now is which exception class to catch? I decided to just catch any System.Exception no matter the contents of “match”. This might seem wasteful since we do not utilize the CLR’s exception filtering capabilities, but I assume that exceptions that would be caught and rethrown several times with the chosen approach would usually make it all the way to the top level and abort the program, so the performance hit is not so important. Hopefully, evaluating expressions in a try block will not incur much overhead if no exception is thrown. If one wanted to utilize the CLR’s probably more efficient filtering, one would have to inspect the match Lambda expression, which is a non-trivial matter because it has been translated to a chain of if-then-else expressions somewhere in the font end of the compiler.

The compiled match expression expects to see a SML exception value on the top of the evaluation stack. Therefore we insert a call to MosML.MLException.filter to transform the System.Exception the CLR has left on the stack to a SML value and code to rethrow if this was a kind of exception with no remapping to SML exceptions.

The main difficulty with exception handling implementation is the fact that the CLR requires that its stack is empty on entry to a try block and empty on exit from a try or catch block. The empty on exit conditions can be handled easily (given that empty on entry has been handled) by saving the value generated by the exp or match expression in a local value and restoring it right after leaving the block. Note that one can only leave a try or catch block with the special leave instruction that takes a label to branch to.

It doesn’t seem to be a good idea to reorganize the compileExp function to generate CIL code that stores every computed intermediate value in local variables and only populates the stack right before a method call or other instruction that takes values from the stack.

Another approach would be to put “comp(exp)” inside a separate CIL method, which would then need to be supplied with the values of “exp’s” free variables as arguments. Such an approach is taken in [PMB]. With few free variables and nested deeply in an expression, this might be more efficient than the save/restore design, but it seems more complicated to implement and I do not expect many handle statements to be buried deep in expressions or late in long argument lists for functions.

Thus we have to save the stack in (local) variables before entering the try block and restore them at the point where the leave instruction returns control. To implement that I had to extend the compileExp function with an extra argument, a list of the types of values on the CLR stack the start of the evaluation of the exp argument to compileExp and modify all the cases.  There is a code example in section 5.5.

Note: it would be more in line with the .NET guidelines to let MLException subclass System.ApplicationException, but in the development phase it is practical to stick with System.Exception because then a stack trace is included in the default ToString() method.

3.8      Integer arithmetic

There are three important aspects of the implementation of integer arithmetic:

The operations must be checked i.e. raise an exception on overflow and not just wrap around. The CIL has checked variants of the relevant arithmetic instructions so we just use those variants and make sure the thrown CLR exceptions will be transformed to the right ML kind in the call of the MosML.MLException.filter method at the start of a catch block.

The operations must produce correct results, which is an issue for the operation related to division, where rounding must be in the right direction. The existing “div” and “rem” CIL instructions corresponds to rounding towards zero and thus to the Int.quot and Int.rem functions. The SML div and rem operators have no direct equivalences as CIL instructions, but [Gough] explains how to implement then in a dozen llines of CIL assembler. I have adapted his code.

One should if possible avoid the expensive boxing and unboxing between MLInt objects and the ints inside when computing integer expressions. The pilot compiler applies the idea of “unboxing early” by inserting explicit unboxing CIL instructions in the instruction stream and trying to propagate these backwards instead of hiding the unboxing inside the arithmetic macro instruction. In pseudo code, if we are compiling “e1 + e2” we would get something like “comp(e1),iunbox,comp(e2),iunbox,add,ibox”. If the e’s are integer constants or arithmetic expressions themselves, we may eliminate ibox,iunbox pairs. It is needed that the compileExp function is written in continuation passing style, but it was for exactly such kinds of optimisations. The concrete implementation of this case is shown in section 5.3.2.

3.9      Global values and accessing values in other units

Global values, i.e. values declared at top level are stored in a static Value array “globals” in the “main” class in the namespace of the unit under compilation using the index in the Lambda get_global/set_global expressions as index. The map from name to index into the ”globals” array is maintained as a hash table “exports” (a System.Collections.Hashtable object) for the benefit of external access. This mimics what happens in Moscow ML.

Referencing values in other units are oversimplified done as follows: For an SML expression like “Int.maxInt”, which is translated to a Lambda expression “Lprim(Pget_global(Int.maxInt/0)”, the compiler will generate code that finds and dynamically loads and evaluates the assembly, Int.exe, (created by compiling the Int.sml unit source code), then finds the value as Int.main.globals[Int.exports[“maxInt”]].  

This is not the correct semantics though. First of all we should only load and init each unit once, so a hash table “MosML.Runtime.loaded_units” declared in the runtime assembly MosMLRT.cs caches references to the “globals” and “exports” objects of already loaded units.

Secondly, the order of events will not be correct. Before starting the evaluation of a compiled unit, all units on which it depends must be loaded and evaluated (recursively). Therefore the following happens. During compilation, code to create and initialise a static array containing the names of direct dependencies is emitted. When a compiled unit is started as an application, and before running the evaluation code, the list of dependencies is examined. Every dependency assembly is found and loaded dynamically, without running any evaluation code; the dependencies of the newly loaded units are loaded in the same way. When all dependencies have been loaded, they are sorted topologically according to dependency and their evaluation code is run in a compatible order. Then we can run the evaluation code of the originally started unit. The supervision of dependency loading is done in MosML.Runtime.load_dependencies() and the individual units are loaded in the constructor of MosML.UnitRef, both classes defined in the MosMLRT runtime assembly.

The topological sorting is done with an adaptation of the standard DFS based algorithm, see chapter 22 of [CLRS].

In [PMB], exported values are implemented as fields with the fieldname being a slightly mangled version of the export name to avoid certain name clashes. Thus an explicit table lookup to find a value in another unit is not needed, the JVM takes care of that.

Mimicking the approach for loading dependencies, one may implement a C# program that can call a function defined in a compiled SML unit. The pair T_smlcall.cs/L_ext.sml in the compiler/ciltest regression test directory (cf. section 4) of the pilot compiler distribution does so. This is not a satisfactory solution for end-programmers accessing SML “objects” from other languages, because all value marshalling has to be done by hand on the C# side, and the type safety of SML is lost.

The pilot compiler also has a feature for use of objects in assemblies written in C#, by putting a full assembly reference in a prim_val declaration. An example can be found in the regression test directory as the pair T_ccall.sml/L_ccall.cs. This feature is for the reasons above expert-only.

3.10Some proposals for optimisations

3.10.1 Further boxing/unboxing elimination

More deboxing using current approach, using the code continuation may be possible by being more systematic.

The local variables allocated for values that would be stored and accessed in the stack in Moscow ML are all declared to be of class Value, but for a let bound int value one would have further opportunity to debox by declaring the variable to be of type int32. One would have to consider reuse of local variable slots (virtual stack indexes) though to implement that.

One could augment the code continuation with a description of the list of types of values the code continuation expects to be on the stack. That information would then be used instead of matching on the code continuation instructions to debox. I believe the approach would be doable and more general that the current one, by being more easily able to “look through” e.g. labels and branching instructions.

3.10.2 BLOCK implementation

Benchmarks show that the BLOCK implementation is a performance bottlenecks compared to Moscow ML (cf. section 6.2), both in areas of creation and garbage collection.

One might attempt to define one BLOCK class for each tag value and eliminate the tag field, and even specialize the number of values in the sequence in the BLOCK class (for small numbers) in order to eliminate the array operations. Both tag value and sequence length is known at compile time. One would have to benchmark to see if anything could be gained that way. 

3.10.3 Closures

One opportunity for optimising closure calls are saturated self-tail-calls within a closure body. Such calls are a typical pattern in loop-type recursive utility functions. One should be able to avoid the call and just update the arg values and branch to the start of the closure body code. The pilot compiler detects saturated self-calls, and inserts comments in the CIL assembler to mark them but does not do any optimisations.

One might consider special casing the closure type according to the number of arguments taken by the closure body with the saved arguments as individual fields to avoid packing and unpacking.

One might attempt to do a limited type inference of closures at compile time by examining the body to see if one can determine that the return value will always be a specific run-time type (int, string or block) and similarly for the arguments to determine a call signature, e.g. (int, int)àint. I do not propose to do comprehensive type inference, just some quick and dirty local examination. I think a simple, low complexity, recursive examination of closure bodies could catch a significant number of such restricted signatures. One might then be able to easily debox the return value, depending on the precise closure implementation. Deboxing the arguments would be harder and not possible unless one uses some special casing of closures that do not store the arguments in lists.

 In cases, where all the possible uses of a closure can be determined, one might also use the ways a closure is used to restrict types as in “let fun id x = x in id y +1 end”. I am not sure how complex such inference would be.

In [PMB] closure specialisation for direct calls of manifest functions was done using some runtime  type inference and an extension of the information maintained during the lifting of closure bodies in UNdeBruijn.

One might use varargs instead of MLArgLists for packaging a variable number of arguments in a list in the calling conventions used. The JIT will surely be able to generate more efficient packing/unpacking of a varargs list than the explicit linked lists. This idea looks most promising in a special-casing scenario where MLArgLists are used less.

3.10.4 The effects of just in time compilation in CLR

[Gough] has a discussion at the end of chapter 9 one should have in mind when considering optimisations at the IL level. The question is to what extent does classical optimisations on the IL level just change the work that the JIT has to do or even make it more difficult? No comprehensive answer to this question is available.

One particular instance of this question of potential relevance for the current pilot concerns its use of local scratch variables. These are variables that are currently used for a short stretch of instructions inside macro instructions e.g. for compensating the lack of  a swap instruction when the arguments to a object constructor has been created in the wrong order on the stack (the creation order being given by SML semantics). Currently every method generated by the compiler has a fixed selection of such scratch variables of various types – in numbers that cover all situations. Using such method global local variables indicates to the JIT that the corresponding slot may be reused. One could also declare new local scratch variables each place a new independent one is needed signalling the lifetime of the variables to the JIT. I do not see which alternative is best. Possibly the JIT will be able to infer the necessary information from the usage pattern anyway. Some benchmarking of local variable strategies is described in [MSO], where the conclusion is that the JIT seems to be clever enough that the strategy used in th CIL code is not so important.

3.11Towards a port of Moscow ML to CLR

All cases of the old compileExp has been carried over to the new one following the strategy of section 2.3 i.e. as the result of a test case. This means that one should consider the pilot compiler quite complete. A lot of bugs and misunderstanding must be expected to be present, though.

The experience from working a little with a small subset of the standard library implementation was that a straightforward port to CLR was quite routine work. But a short look at the internals of some of the IO library modules shows that those will need some more real design work to be ported because the IO system of CaML Light does not resemble the IO system of CLR very much. Blindly implementing the primitives of the former on CLR would not be a good idea.

Testing design decisions will be easier having an interactive system running on CLR so one should try to get there quickly. Since every new top-level declaration evaluated by the interactive system must be compiled and then loaded and executed, the interactive system must have substituted CLR runtime code generation (RTCG) for the current IL assembler loop to get acceptable response times.

A quick way to support RTCG for a compiler running on CLR would be to make a C# method in MosMLRT.Stdlib for each Cil_inst macroinstruction by transforming the corresponding case in Cil_ephr to a sequence of RTCG method calls. The method would be available for a new Cil_ephr via a zillion prim_val declarations.

One should be able to bootstrap the compiler on CLR before attempting to compile the interactive system.

 

The following strategy for extending the current work to a CLR back end for Moscow ML can thus be proposed:

Extend the standard library support to the parts needed for bootstrapping. IO seems to be the most likely part to be non-routine.

Bootstrap the compiler.

Develop a RTCG library module to replace the [Gough][Gough]current CIL assembler emitter.

Implement any missing standard library support needed for the interactive system.

Implement the interactive system.

Start experiments to improve performance.

Implement the rest of the standard library.

Make sure ML system will run on other important CLR platforms.

Consider how to make the new Moscow ML compliant with the Common Language Specification.

 

4         User’s guide to the pilot compiler

4.1      To install the pilot, compile and run a SML program

To try the pilot compiler you will need to run one of the Microsoft Windows platforms, where the Microsoft .NET SDK is supported and have that SDK installed. I assume below that the C# compiler, csc, the CIL assembler, ilasm, the global assembly cache utility, gacutil and the batch CIL verifier, peverify is in your path.

Get the pilot compiler from http://www.it-c.dk/~kokholm/mos.ml/pilot-31-05-2002.zip and unzip with your preferred unzip utility.

To prepare for using the runtime assembly, install it in the global assembly cache by cd’ing to the “compiler” subdirectory op the top directory of the unpacked distribution and running “gacutil /i MosMLRT.dll”. Alternatively, copy the MosMLRT.dll assembly to any directory from which you want to run compiled programs.

The following script show how you could compile and run the T_handle test case starting in the top level directory of the distribution:

 

C:\home\kokholm\dotml>cd compiler\ciltest

C:\home\kokholm\dotml\compiler\ciltest>..\mosmlcmp -P none T_handle.sml

File "T_handle.sml", line 20-23, characters 8-133:

!       ......."a" => (5555555 * 5555555; "bignum")

!            | "b" => raise Fail "Fejl"

!            | "c" => raise Subscript

!            | "d" => "noerror".

! Warning: pattern matching is not exhaustive

 

C:\home\kokholm\dotml\compiler\ciltest>ilasm /quiet T_handle.il

 

Assembling 'T_handle.il' , no listing file, to EXE --> 'T_handle.EXE'

Source file is ANSI

 

Operation completed successfully

C:\home\kokholm\dotml\compiler\ciltest>T_handle

67

308608025

67

17

(#a=a# (#b=b# (#c=c# (#noerror# (#e<>a,b,c# .)))))

1010101

105

105

39

(otte# 321 ulla3)

! Uncaught exception:

! eFoo hejsa

 

If you do not include “–P none” in the call of the compiler it will probably exit because it cannot find Misc.ui to preload Misc. If you include the flag –lam_debug, the compiler will show the Lambda expressions as they are compiled. You can access the partial standard library during compilation by adding a flag “-stdlib <path to standard library>” to the command line; in the situation of the example above “-stdlib c:\home\kokholm\dotml\mosmllib”. Then you do not need “‑P none”, but you will have to define an environment variable DOTMLLIB with the value “.:<path to standard library>” (without the quotes, the separator is a colon) in order to execute the compiled code.

If you are using a cygwin shell to run the above you may want to use the run_mosmlc script in the compiler directory to run both stages and the verifier in one command.

The compiler has been compiled with filename handling for unix systems. I have only tested with filenames starting with an uppercase letter and all other letters lower case and I recommend following that convention.

4.2      To modify, recompile and test the compiler and library

The compiler and library are built using make. You will need Moscow ML to compile the compiler. There are make files in the compiler directory for cygwin make (Makefile) and Microsoft nmake (Makefile.w32). If you make clean, you will need perl to recreate a couple of files. There is only a cygwin makefile in the mosmllib directory.

The regression tests are in the ciltest subdirectories of the compiler and mosmllib directories. The regression tests are performed by perl scripts by the name regress, that must have access to utilities like diff to run.

5         Technical details

5.1      Structure of pilot compiler

The source of the pilot compiler has been created by initially making a copy of Moscow ML source code, deleting some parts that were not needed. Thus the directory structure matches the Moscow ML structure closely.

The pilot compiler has three phases:

1.      mosmlcmp.exe, compiled with Moscow ML from the modified compiler sources compiles the input file to a CIL assembler file with extension il.

2.      The ilasm assembler program (from .NET SDK) compiles the il file to an EXE file.

3.      The exe file is verified with the peverify program (also from .NET SDK)

The finished executable must have acces to the MosMLRT.dll runtime assembly in the same directory or installed in the global assembly cache.  Any SML unit dependency must be found in the path given by the DOTMLLIB environment variable.

5.2      Source changes overview

Changed/new SML files overview with comments on differences with the Moscow ML version:

File name

Comment

Cil_back.sml

Cil_back.sig

Replacement for Back.sml, reusing much of that. Mostly changes in compileExp, but preserveing the overall structure of that function.

Cil_ephr.sml

Replacement for Emit_ephr.sml and Emitcode.sml.

Defines start_emit_phrase and end_emit_phrase functions that writes the header and trailer of the CIL assembler file.

Defines emit_phrase that translates the Cil_inst macro instruction stream to CIL assembler instructions.

Cil_inst.sml

Replacement for Instruct.sml,

Defines the symbolic CIL macro instruction set used in the code stream generated by Cil_back.compileLambda .

Cil_glob.sml

New. A few values shared among the other Cil_*.sml files

Compiler.sml

Changed reference from Back to Cil_back and byte code file extension from uo to il. Added a flag to signal the start of the end_emit phase to Cil_back; the flag needed to compile the phrases created by Sigmtch correctly.

Sigmtch.sml

Changed references from Back and Emit_phr to Cil_back and Cil_ephr.

Lambda.sml

Changed reference from Instruct to Cil_inst.

Mainc.sml

Added a command line flag to display Lambda code during compilation.

Prim.sml

Added an int to the Pclosure constructor: the number of arguments taken by the closure body, to communicate that information from the UNdeBruijn splitting to the compilation of the closure creation expression compilation in compexp.

Pr_lam.sml

Change of Lambda expression printing due to the change in Prim.sml

Types.sml

Commented out a “raise Toplevel” to enable compilation of a test case, with

Value polymorphism at top level (compiler/pstest/Inductive.sml).

5.3      Examples of compilation of expressions and code generation

I will give three examples from Cil_back and Cil_ephr of how Lambda expressions are compiled as cases in the compexp local function in compileExp and further emitted to CIL assembler. The compexp function has been called with 5 arguments: compexp cilstk sz dp exp C:

 

cilstk

The list of value types on the stack when evaluation of exp starts

sz dp

2 ints describing the state of the Lambda expression virtual stack

exp

The expression to be compiled

C

The code continuation – a list of CIL macro instructions

 

The compexp is a large case statement on the value of exp.

5.3.1      Accessing field of BLOCK

Accessing the i’th element of the sequence of a BLOCK, e.g. the i’th field of a tuple, is compiled in Cil_back.sml by the case

   | Lprim (Pfield i, [e1]) =>

           (noteDepth (cilstk,1);

            compexp cilstk sz dp e1 (Cgetfield i :: C))

The noteDepth utility function registers that the CLR stack (due to the value of e1) will be 1 element larger at some point of the evaluation than at the start, but not more at positions in the code visible at this point. The evaluation of e1 may internally demand even more stack size, but that will be noted in the compilation of e1. This stack size  information is used for computing a value for the mandatory .maxstack CIL directive for every method.

The case then puts a Cgetfield macroinstruction in front of the continuation and then compiles e1 with the new continuation and with the same stack type description, cilstk, as the whole exp.

The Cgetfield macroinstruction assumes the value of e1 is on the top of the CLR stack. The value will be a BLOCK by SML type safety, but the evaluation of e1 may have left something the verifier can only infer will be of type Value so the emitted instruction stream starts with a castclass instruction to BLOCK. The next instruction loads (pushes) the Value[] val field of the BLOCK on top of stack to the stack, removing the BLOCK. Then the array index is loaded on stack and that element of the val array is loaded, removing the array reference and index. The net effect is to pop an element from the stack and push a Value object. From Cil_ephr.sml:

      | Cgetfield i :: C =>

        (output (os,"castclass\t[MosMLRT]MosML.BLOCK\n\

                    \ldfld\tclass [MosMLRT]MosML.Value[] [MosMLRT]MosML.BLOCK::val\n\

                  \ldc.i4\t"^myitos(i)^"\n\

                  \ldelem.ref\n");

         emit C)

Note the verbosity with the types and field refs spelled out in all details including having to spell out the fact that [MosMLRT]MosML.Value denotes a class. This is by demand of the CIL assembler and makes the intermediate il files even huger. The myitos function just translates an SML int to a C#-string. Note that this code will actually expand the stack on element more than noted above, but the expansion will be confined to the interior of this instruction sequence and we can take care of all such local expansions by adding a constant (currently 3) to the maxstack value computed by the noteDepth calls. The .maxstack CIL directive only has to give an upper bound on the stack size.

5.3.2      Integer addition

The case for adding the values of two expressions, (Cil_back.sml:)

      | Lprim (Psmladdint, [e1,e2]) =>

           (noteDepth (cilstk,2);

            compIntArithmetic cilstk Cadd  sz dp e1 e2 C)

Simply calls compIntArithmetic defined elsewhere by

   and compIntArithmetic cilstk inst sz dp e1 e2 C =

    let val C2 = case C of

                Ccvt (cilValue, cilInt) :: C1 => C1

              | _ => Cnewint :: C

               in

                        compexp cilstk sz dp e1

                        (Ccvt (cilValue, cilInt) ::

                         compexp (cilInt::cilstk) sz dp e2

                         (Ccvt (cilValue, cilInt) :: inst :: C2))

               end

Which first checks if the code continuation starts with an unboxing operation, Ccvt. If so, the unboxing operation is removed, else a boxing operation is inserted. Now we have a new continuation C2. The two expressions are compiled with code continuations starting with unboxing operations, e2 with a cilInt type (from e1) added to the stack description. The code in Cil_ephr for outputting CIL code for the Ccvt and Cadd symbolic instructions are quite simple:

                 | Cadd :: C =>

                        (output (os,"add.ovf\n");

                         emit C)

                 | Ccvt (cilValue, cilInt)  :: C =>

                        (output (os,"castclass\t[MosMLRT]MosML.MLInt\n\

                                    \ldfld\tint32 [MosMLRT]MosML.MLInt::val\n") ;

                         emit C)

The add.ovf is a checked integer addition. The unboxing must use a castclass instruction to keep the CIL verifier happy and then just extracts the val field of the MLInt object.

5.3.3      Closure application

A closure application, applying the (closure) value result of evaluating the expression f to the values given by evaluating the list of expressions args is compiled by the following case in Cil_back (changed slightly by removing a few things irrelevant for this discussion and let binding some inermediate results):

     | Lapply (f,args) =>

let val nargs = List.length args

               val C1 = Ccall nargs :: C

               val newcilstk = cilValueList nargs cilstk

               val C2 = compexp newcilstk sz dp f C1

           in (noteDepth (cilstk,nargs+1);

               compExpList cilstk sz dp args C2

           end

The new code continuation C1 just is C with a Ccall macroinstruction prepended. This is the most complex macroinstruction. It packages the argument values left on the stack by the code compiled from args in an MLArgList, gets the delegate from the CLOSURE left on the stack by the code compiled from f  and then invokes the delegate. But over saturation of the call and the potential for issuing a tail call instruction is also handled by the Ccall emitted code.

Then f is compiled with continuation C1 with a stack description with nargs cilValue types added to the original (by the cilValuelist function). Finally the arguments are compiled one by one by the compExpList utility function defined elsewhere. The noteDepth function registers that the stack depth will be up to nargs+1 places larger than the initial stack at the boundaries visible in this code.

I will not show the CIL assembler code produced by Ccall here but refer to the source code or the inermediate il files generated by the compiler.

5.4      Closure details

The details of how closures work is exposed at three types of sites: where the closure is created, where it is called and inside the compiled body code is run. I will explain the function of the CIL assembler by a mock-up written in C#. A full mock-up example is in UnitTmpl.cs in the compiler directory of the source.

 

The initial construction of a closure (the compilation of a Lprim(Pclosure…)) expression does this: Build the list of free variables and save them with an empty list of supplied arguments in an object that will become the private object of a delegate pointing to the compiled closure body method:

           //bundle the free variables to be stored in the closure in a list

           //together with an empty list of arguments supplied so far:

           MlargList free = new MLArgList(free_1);

           free = new MLArgList(free_2, free);

            //…

           free = new MLArgList(free_nfree, free);

           clo initialargs = new clo(free, null);

           CLOSURE f = new CLOSURE(new MLFunction(initialargs.cl_TMPL), missing);

The call site will in principle just do   

             //package arguments supplied here in args list of length numargs

            MLArgList tail = new MLArgList(arg_1);

            MLArglist args = tail

            args = new MLArgList(arg_2, args);

             //…

           args = new MLArgList(arg_numargs, args);

           //

//and call delegate. “tail” is the last element of args

             f.func(f.missing-numargs, tail, args);

The prologue of a closure method shows how under-saturation is handled by concatenating the lists of supplied variables to create a new base object for a delegate pointing to the same closure body method.

class clo : MLArgs {

   public Value cl_TMPL(int missing,MLArgList tail,MLArgList args){

      tail.prev = args_sofar; //Concatenate the lists! tail.prev used to be null

      if (missing > 0) {

           //build new closure and return it

           clo newargs = new clo(free, args);

           return new CLOSURE(new MLFunction(newargs.cl_TMPL), missing);

      }

      else {

         //unpack the free vars and args to the local variables

         //               expected by the compiled closure body

         Value free_nfree = free.arg;

         Free= free.prev;

         //…

         Value free_1 = free.arg   

         //

         Value arg_nargs = args.arg;

         args= args.prev;

         //…

         Value arg_1 = args.arg    

         //execute the compiled closure body and return its value

         //

      }

}

The call site will actually handle the possibility of an over-saturated call.  Instead of just invoking f.func as shown above it executes CIL instructions equivalent to a call

           call_closure(f, f.missing-numargs, tail, args)

where call_closure is

       public static Value call_closure(CLOSURE f, int numargs, MLArgList tail , MLArgList args) {

           if (f.missing >= numargs) //”Normal, not oversaturated”

             //Tail call,

             return f.func(f.missing-numargs, tail, args);

           //The over saturated case:

//supply as many args as possible to the closure

//call and repeat with the closure returned and

//the rest of the arguments          :

           int missing = f.missing - numargs;

           Value result = f;

           while (missing<0) {

             CLOSURE closure_to_call = (CLOSURE)result;

             int surplus = -missing-closure_to_call.missing;

             MLArgList newtail = args;

             for (int i=1; i<surplus;i++)

               newtail = newtail.prev;

             if (surplus == 0)

               //Tail call

               return closure_to_call.func(0, tail, newtail.prev);

             result = closure_to_call.func(0, tail, newtail.prev);

             missing = -surplus;

             tail = newtail;

           }

       }

The call_closure code has to be inlined as assembler in practise to insert tail instructions where necessary.

5.5      Exception handling example

The following CIL example explains the code emitted from compiling the expression a+(b*c handle _ => 13), assuming a,b,c are let bound as variables number 1 to 3:

First a is pushed and unboxed

ldloc.s V_1

castclass       [MosMLRT]MosML.MLInt

ldfld   int32 [MosMLRT]MosML.MLInt::val

 

A local variable to save the value on stack is declared and the value saved

{.locals( int32 HANDLE_1_0)

 stloc.s HANDLE_1_0

 

Stack is empty, we can enter .try block, compute b*c, store result in local variable SCRATCH_VALUE, and branch to label DOTML_1:

 .try {

  ldloc.s V_2

  castclass       [MosMLRT]MosML.MLInt

  ldfld   int32 [MosMLRT]MosML.MLInt::val

  ldloc.s V_3

  castclass       [MosMLRT]MosML.MLInt

  ldfld   int32 [MosMLRT]MosML.MLInt::val

  mul.ovf

  newobj  instance void [MosMLRT]MosML.MLInt::.ctor(int32)

  stloc.s SCRATCH_VALUE

  leave   DOTML_1}

 

Handle any exception, rethrow if uncatchable in ML, else put 13 in SCRATCH_VALUE:

 catch [mscorlib]System.Exception {

  call class [MosMLRT]MosML.BLOCK [MosMLRT]MosML.MLException::'filter'(class

                                                [mscorlib]System.Exception e)

  dup

  brtrue DOTMLLOC_1

  rethrow

  DOTMLLOC_1:

  stloc.s V_4

  ldc.i4  13

  newobj  instance void [MosMLRT]MosML.MLInt::.ctor(int32)

  stloc.s SCRATCH_VALUE

  leave   DOTML_1}

 

Restore the saved stack value:

 DOTML_1:

 ldloc.s HANDLE_1_0

}

 

Push the value of the handled expression and perform the addition.

ldloc.s SCRATCH_VALUE

castclass       [MosMLRT]MosML.MLInt

ldfld   int32 [MosMLRT]MosML.MLInt::val

add.ovf

newobj  instance void [MosMLRT]MosML.MLInt::.ctor(int32)

6         Testing

6.1      Regression tests

During the development of the pilot compiler, test cases played an all-important role. First test input files was created for discovering what Lambda expressions were generated by specific SML expressions. Later, when the compiler started producing executable code, the correct output from the test input files were saved and a regession test script was written that would compile and run all test input files and compare the output with the expected. The tests are divided into files according to their main area of interest, but common types like lists are used and therefore tested everywhere.

The regression script was run frequently, normally after every small adjustments of the compiler, to discover immediately if a change to support some additional feature or repair a bug had adverse effects on functionality already implemented. Whenever a test for a new feature provoked a bug in some area already considered finished, a dedicated test case for that bug was devised (and the problem resolved).

In addition to covering old bugs, the selection of expressions in test cases has been selected in an attempt to be complete with respect to the tested SML features, but has been somewhat limited by my SML experience. Fortunately the supervisors supplied extra, nasty test cases or ideas to stretch the limits of the compiler. I tend to believe the high bug hit rate of their ideas were due to their cleverness and not due to flakiness of the compiler.

The way the compileExp function works together with optimisation by using code continuation matching, the effects of UNdeBruijn lifting and optimisations of case/switch expressions based on the distribution of cases is sufficiently complex that it can be hard to predict which case and code path will be emitted and exercised by a particular input program. I considered adding macroinstructions inserting trace instructions in the generated code everywhere in compileExp to be able to check the internal completeness of the tests, but decided that it would be bad at that stage to completely clutter the compileExp code that way.

In the cases in compileExp potentially exercised by the compilation of conditional expressions (if-then-else, andalso, orelse, case/match) I have inserted comments in the emitted code so I could at least be sure that the test cases would cover the compileExp with respect to code generation.

6.2      Benchmarks

The following table summarizes some benchmarks run to compare the performance of code compiled with the pilot compiler with code compiled by Moscow ML. The tests were run on a 466 MHz Pentium Celeron pc with 256 MB RAM and running Windows XP. All CLR assemblies were built with no debugging symbols – running the C# compiler and CIL assembler with the /debug flag typically imposed a 2.5 times performance penalty on the resulting assembly.

The first column indicates where the benchmark SML source program is located. The second column my guess at what feature is particularly benchmarked in that case. Garbage collection will be exercised in all cases, except the one marked nogc. The next two columns are running times in seconds (typical result of 5 tries) and finally the ratio of run times, where relevant. Times were measured by running the programs in a Cygwin bash shell using the standard time command.

 

 

The Bench test is shown here because it was the first program used for comparing the performance of an early version of the pilot compiler to Moscow ML and was used continuously to check that changes in the design did have some positive effect or not a great negative effect on performance.

The Arith test spends most of its time performing integer arithmetic. Due to the deboxing optimisation for pure arithmetic expressions in the pilot, performance is respectable for the pilot. The slower performance of Moscow ML is because integers are represented as 31 bits in a 32 bit field, where the last bit is a flag bit that has to be masked (shifted away) during the actual computation. 

The Queens and Tak(euchi) tests closure performance but also garbage collection and Takt in particular tuples. These benchmarks gives the impression that closure performance is not so bad, possibly just half the speed of Moscow ML, but an indication that BLOCK handling might be less efficient. To factor out closures, the While* tests uses the while primitive to implement loops. The While and While2 tests build let bound tuples respectively short lists and forgets them immediately, while the While3 test build one large list and forgets it in the end. The indications are that building BLOCKs are perhaps 5 times as slow with the pilot, and garbage collecting a lot of them even slower. There are few possibilities for the user to influence the way garbage collection is performed in CLR. One may force a garbage collection at a particular time, which shouldn’t help here. One may change the gcConcurrent flag from true to false, which can be done at application granularity level in an application config file. I have tried that but saw no significant change in the results.

The Handle* cases tries to gauge the overhead for having try block, where the exception is not thrown. The Handle1 repeatedly computes 3*(4*(5*(4+(6 handle _ => 7)))) in a while loop, the Handle0 computes the same expression with 6 substituted for ( 6 handle _=>7). In the pilot, the compiler will generate code to save 4 ints in local variables and restore them again in each loop. The execution times indicates that the overhead of using handle statements for this kind of test are somewhat higher relatively in the pilot than in Moscow ML, but not terribly high in itself.

Finally, the OpenMisc test just does “open Misc”. The times for the pilot starts at over 3 seconds and after some iterations stabilize around 1.2 seconds. The test shows that one will have to have the standard library installed in the global assembly cache to obtain good start up performance for realistic programs – the other benchmarks load no libraries.

Comparing the sizes of .EXE files in the pilot version of part of the standard library to the corresponding .uo files in Moscow ML indicates a factor around 5. A lot of this will be inline closure-call handling code. A single test indicated that the pilot output files could be much more compressible than the Moscow ML ones. Both greater size and compressibility seems irrelevant for files on local disc.

I haven’t measured the performance of the compilation process itself, but there is really no comparison at this stage: Moscow ML is lightning fast, while the pilot compiler seems to take seconds on even the smallest files. This difference is natural since the pilot compiler generates a huge CIL assembler file and then starts ilasm to assemble it and finally runs the verifier. A future production compiler using RTCG instead of textual CIL assembler should be expected to be nearly the same speed as (current) Moscow ML – depending on the efficiency of the RTCG library.

6.3      Testing on other CLR platforms

I have tried the pilot with Mono on Windows XP and Microsoft SSCLI on FreeBSD. I have not experimented with DotGNU, and according to the status description on the website as of end of May 2002, tail calls and delegates are not implemented in DotGNU at this point in time, so it would have failed any test involving functions.

6.3.1      Mono:

Tried version: binary kit for MS Windows downloaded 23-04-2002.

1.      There is no CIL assembler for (this version of) mono, so I settled for trying to run the already compiled test cases on the mono runtime (the jit version).

2.      The reflection and some other needed standard libraries were missing, so loading of other SML units could not be done.

3.      This version of mono doesn’t support the “tail.” instruction prefix, so I had to remove those prefixes from the intermediate il files. After that, all compiler tests succeeded, except for the test of correct tail recursion.

4.      A few attempts at performance testing were not very successful: I couldn’t run the tests that use (tail) recursion for looping since the programs would die in a stack overflow. Running the While2 test didn’t succeed because the process grew out of virtual memory on my machine (around 700 MB) – running on MS CLR the process grows to around 5 MB in size. This seems to indicate a problem with garbage collection on the tested version of mono. The While3 test was more successful and ran at twice the speed of the MS CLR!

5.      Under some circumstances exceptions that should have been caught by the try/catch in the Main method were not caught. In particular, the stack overflow because of the missing “tail.” instruction prefix resulted in mono aborting without any message.

6.      A single small syntax quirk was discovered: MS CLR accepts array creation (with equal semantics) with CIL assembler either “newarr Value” or “newarr class Value”, but mono will not run code compiled from the latter aborting with an “implement me error message”.

6.3.2      SSCLI

I tested the compiler with Microsoft’s SSCLI, a “shared-source” implementation of the CLR on FreeBSD 4.5 [MSDN], using the “free” (as opposed to checked) build. The version used was downloaded late march 2002.

I could not build the runtime assembly because of problems with the strong-name implementation, so I used a copy of the one compiled on the .NET SDK. The compiler test cases then ran correctly under the “clix” runtime system except for some of the expressions in T_fun. The problems with T_fun were not resolved due to time restraints.

7         Conclusion

A design and pilot implementation of a CLR backend for the compiler in Moscow ML has been developed.

Tests indicates that the design is sufficiently comprehensive and efficient to be the basis of a usable SML compiler targeting CLR that will be able to bootstrap itself and compile an interactive SML system on CLR.

Some areas for improving the current design have been indicated.

 

The following strategy for extending the current work to a CLR back end for Moscow ML can be proposed:

Extend the standard library support to the parts needed for bootstrapping. IO seems to be the most likely part to be non-routine.

Bootstrap the compiler.

Develop a RTCG library module to replace the current CIL assembler emitter.

Implement any missing standard library support needed for the interactive system.

Implement the interactive system.

Start experiments to improve performance.

Implement the rest of the standard library.

Make sure ML system will run on other important CLR platforms.

Consider how to make the new Moscow ML compliant with the Common Language Specification.


 

8         References

[CLRS]

Cormen, Leiserson, Rivest and Shamir: “Introduction to algorithms”, 2.ed. MIT press 2001

[DotGNU] 

http://www.southern-storm.com.au/portable_net.html

[ECMA]

http://www.ecma.ch/ecma1/STAND/ecma-335.htm

[Gough]

John Gough, “Compiling for the .NET Common Language Runtime (CLR)”, Prentice Hall 2002

[Java]

http://java.sun.com/

[Lidin]

Serge Lidin, “Inside Microsoft .NET IL assembler”, Microsoft press 2002

[Mono]

http://www.go-mono.org/

[MosML]   

http://www.dina.kvl.dk/~sestoft/mosml.html

[MSDN]

http://msdn.microsoft.com/net

[MSO]

Morten Sylvest Olsen, “A pizza compiler for .NET”, Master’s Thesis DTU 2002

[PFO]

http://www.it-c.dk/courses/PFOO/F2002/index.html

[PMB]

Peter M. Berthelsen, “Compiling SML to Java bytecode”, Master’s Thesis, DTU 1998

 

9         List of code listing appendices

1.      Cil_back.sml

2.      Cil_ephr.sml

3.      Cil_inst.sml

4.      Cil_glob.sml

5.      MosMLRT.cs

6.      UnitTmpl.cs

7.      C# test cases

8.      SML test cases



[1] See the Moscow ML web site for references to material on Standard ML.

[2] The original acronym was MSIL, newer Microsoft documentation may use IL. I follow [Gough] in using CIL.

[3] Invalid CIL cannot be executed and so is less interesting.

[4] The byte code file will be missing if one is only compiling a SML signature. I will ignore this case in the following.