Note 3, Programming Language Concepts (sestoft@dina.kvl.dk) 2002-02-20 * ---------------------------------------------------------------------- Until now, we have written programs in abstract syntax, which is convenient when handling programs as data. However, programs are usually written in concrete syntax, as sequences of characters in a text file. So how do we get from concrete syntax to abstract syntax? First of all, we must give a concrete syntax describing the structure of wellformed programs. We use regular expressions to describe local structure, that is, small things such as names, constants, and operators. We use context free grammars to describe global structure, that is, statements, the proper nesting of parentheses within parentheses, and (in Java) of methods within classes, etc. Local structure is often called lexical structure, and global structure is called syntactic or grammatical structure. Recommended preparatory reading ------------------------------- Read parts of Torben Mogensen: Basics of Compiler Design (DIKU, University of Copenhagen, June 2001): * Sections 2.1 to 2.5 about regular expressions and non-deterministic finite automata. A lexer generator such as mosmllex turns a regular expression into a non-deterministic finite automaton, then creates a deterministic finite automaton from that. * Sections 3.1 to 3.5 about context-free grammars and syntax analysis. * Section 3.15 about using LR parser generators. An LR parser generator such as mosmlyac turns a context-free grammar into an LR parser. This may make little sense until we have discussed a concrete example application of an LR parser generator in the lecture. Scanners, lexer, parsers, scanner generators, and parsers generators -------------------------------------------------------------------- A scanner (or lexer) is a program that reads characters from a text file and assembles them into a stream of lexical tokens (or lexemes). A scanner usually ignores the amount of whitespace (blanks " ", newlines "\n", carriage returns "\r", tabulation characters "\t", and page breaks "\f") between non-blank symbols. A parser is a program that accepts a stream of lexical tokens from a scanner, and builds an abstract syntax tree representing that stream of tokens. A scanner generator is a program that converts a collection of regular expressions into a scanner (which recognizes tokens described by the regular expressions). A parser generator is a program that converts a parser specification (a decorated context free grammer) into a parser. The parser, together with a suitable scanner, recognizes program texts derivable from the grammar. The decorations on the grammar say how a text derivable from a given production should be represented as an abstract syntax tree. We shall use the scanner generator mosmllex and the parser generator mosmlyac. The classical scanner and parser generators for C are called lex and yacc (Bell Labs, 1975). The modern powerful GNU versions are called flex and bison; they are part of all Linux distributions. There are also free lexer and parser generators for Java, for instance JLex and JavaCup (available from Princeton University), or JavaCC (lexer and parser generator in one, available from Webgain). For C#, there is a free parser generator CoCo/R from the University of Linz, but that is for LL-grammars, not LR-grammars as the above tools. A set of C# compiler tools are available from University of Paisley, http://cis.paisley.ac.uk/crow-ci0/. The parsers we are considering here are called bottom-up parsers, or LR parsers, and they are characterized by reading characters from the Left and making derivations from the Right-most nonterminal. Hand-written parsers (including those built using so-called parser combinators in functional languages), are usually top-down parsers, or LL parsers, which read characters from the Left and make derivations from the Left-most nonterminal. For an introductory presentation of hand-made top-down parsers in Java, see Grammars and Parsing with Java (http://www.dina.kvl.dk/~sestoft/programmering/parsernotes.pdf). Regular expressions in lexer specifications ------------------------------------------- A regular expression r in a mosmllex lexer specification is a character `a` or `\n` or ... a character range [`a`-`f`] a character set [`0`-`9` `a`-`f`] a concatenation r1 r2 of regular expressions a closure r* of a regular expression (zero or more iterations) a positive closure r+ of a regular expression (one or more iterations) a choice r1|r2 between regular expressions The tokens or lexemes of our example expression language may look like this: The keywords: LET let IN in END end The special symbols: PLUS + TIMES * MINUS - EQ = LPAR ( RPAR ) An integer constant INT is a non-empty sequence of the digits 0-9: [`0`-`9`]+ A variable NAME begins with a lowercase (a-z) or uppercase (A-Z) letter, ends with zero or more letters or digits (and is not a keyword): [`a`-`z``A`-`Z`][`a`-`z``A`-`Z``0`-`9`]* Context free grammars in parser specifications ---------------------------------------------- A context free grammar consists of * terminal symbols (identifiers x, integer constants 12, string constants "foo", special symbols + and * etc, keywords let, in, ...) * nonterminal symbols A (denoting grammar classes) * a start symbol S * rules (or productions) of the form A ::= tnseq where tnseq is a sequence of terminal or nonterminal symbols The grammar for our expression language may be given as follows: Expr ::= NAME | INT | - INT | ( Expr ) | let NAME = Expr in Expr end | Expr * Expr | Expr + Expr | Expr - Expr Usually one specifies that there must be no input left over after parsing, by requiring that the well-formed expression is followed by end-of-file: Main ::= Expr EOF Hence we have two nonterminals (Main and Expr), of which Main is the start symbol. There are eight productions (seven for Expr and one for Main), and the terminal symbols are the tokens of the lexer specification. The grammar given above is ambiguous: a string such as 1 + 2 * 3 can be derived in two ways: Expr -> Expr * Expr -> Expr + Expr * Expr -> 1 + 2 * 3 and Expr -> Expr + Expr -> Expr + Expr * Expr -> 1 + 2 * 3 where the former derivation corresponds to (1 + 2) * 3 and the latter corresponds to 1 + (2 * 3). The latter is the one we want: multiplication (*) should bind more strongly than addition (+) and subtraction (-). With most parser generators, one can specify that some operators should bind more strongly than others. Also, the string 1 - 2 + 3 could be derived in two ways: Expr -> Expr - Expr -> Expr - Expr + Expr and Expr -> Expr + Expr -> Expr - Expr + Expr where the former derivation corresponds to 1 - (2 + 3) and the latter corresponds to (1 - 2) + 3. Again, the latter is the one we want: these particular arithmetic operators of the same precedence (binding strength) should be evaluated from left to right. This is indicated in the parser specification in file Exprpar.grm by the %left declaration of the symbols PLUS and MINUS (and TIMES). SML: Structures and Moscow ML compilation units ----------------------------------------------- So far we have been working inside the Moscow ML interactive top-level (mosml), entering type and function declarations, and evaluating expressions. Now we need more modularity to our programs, so we shall declare the expression language abstract syntax inside a separate file called Absyn.sml: ------------------------------------------------------------ (* Abstract syntax for the simple expression language *) datatype expr = CstI of int | Var of string | Let of string * expr * expr | Prim of string * expr list ------------------------------------------------------------ This file may be compiled in isolation using the Moscow ML batch compiler, like this: mosmlc -c Absyn.sml This creates two files: a bytecode file Absyn.uo and an interface file Absyn.ui (containing type information only). The joint contents of these files corresponds to that of a Java .class file. Now the types (and functions and variables, if there were any) declared in Absyn may be used in the mosml interactive top-level by evaluating: load "Absyn"; Then you can refer to the type Absyn.expr, the constructor Absyn.CstI, Absyn.Var, etc., using the structure name Absyn as qualifier. Or you can load "Absyn"; open Absyn; and then you can refer, without the qualifier, to expr, CstI, Var, etc. Evaluating load "Absyn" is necessary only in interactive mosml sessions. The batch compiler mosmlc, by contrast, automatically accesses any needed structures, and it is unnecessary and illegal to use load "Absyn" or similar in structures compiled with mosmlc. Lexer and parser specifications for the expression language ----------------------------------------------------------- A complete parser specification for the simple expression language is found in file expr/Exprpar.grm. Running mosmlyac -v Exprpar.grm generates a parser as an SML program in file Exprpar.sml, and its signature in file Exprpar.sig (corresponding to a Java interface). These files must be compiled using mosmlc -c -liberal Exprpar.sig Exprpar.sml A complete lexer specification for the simple expression language is found in file expr/Exprlex.lex. Running mosmllex Exprlex.lex generates a lexer as an SML program in file Exprlex.sml. This file must be compiled using mosmlc -c Exprlex.sml Since the parser specification defines the token datatype, which is used by the lexer, the parser must be generated and compiled before the lexer is compiled. Thus run mosmlyac and compile the parser before you compile the lexer. In summary, to generate the lexer and parser, and compile them together with the abstract syntac, do the following (in the directory where you have put the files Exprpar.grm etc): mosmlc -c Absyn.sml mosmlyac -v Exprpar.grm mosmlc -c -liberal Exprpar.sig Exprpar.sml mosmllex Exprlex.lex mosmlc -c Exprlex.sml mosml -P full parse.sml The file parse.sml defines a function parse : string -> expr that combines the generated lexer function Exprlex.Token and the generated parser function Exprpar.Main: fun parse str = let val lexbuf = Lexing.createLexerString str val expr = Exprpar.Main Exprlex.Token lexbuf in Parsing.clearParser(); expr end handle exn => (Parsing.clearParser(); raise exn); The function creates a lexer buffer from the string, and then calls the parser's Main entry function Exprpar.Main to parse an expr from the lexbuf, using the lexer's tokenizer Exprlex.Token to read from the lexbuf. If the parsing succeeds, it clears the parser state and returns the expr; if parsing fails with an exception, it clears the parser state and re-raises the exception. The Exprpar.output file generated by mosmlyac --------------------------------------------- Calling mosmlyac with option -v causes it to produce a file Exprpar.output which describes the parser generated from the parser specification. The parser is a stack (or pushdown) automaton, which is a finite automaton equipped with a stack of the nonterminals parsed so far. The Exprpar.output file has two parts. The first part of Exprpar.output is a listing of the parser specification, in which the grammar rules have been numbered for reference (and some administrative rules, here 0 and 10, have been added): ------------------------------------------------------------ 0 $accept : %entry% $end 1 Main : Expr EOF 2 Expr : NAME 3 | CSTINT 4 | MINUS CSTINT 5 | LPAR Expr RPAR 6 | LET NAME EQ Expr IN Expr END 7 | Expr TIMES Expr 8 | Expr PLUS Expr 9 | Expr MINUS Expr 10 %entry% : '\001' Main ------------------------------------------------------------ The remainder of Exprpar.output is a description of the states of the finite stack automaton. The automaton states are numbered (0 to 25 in this case) but their numbers have nothing to do with the grammar rule numbers above. For each numbered automaton state, two pieces of information are given: the corresponding parsing state (as a set of so-called LR(0)-items), and the transition relation. For an example, consider state 4. In state 4 we are trying to parse an Expr, we have seen the keyword `let', and we now expect to parse the remainder of the let-expression. This is shown by the dot (.), which describes the current position of the parser inside a phrase. The transition relation says that if the remaining input does begin with a NAME, we should read (shift) it and go to state 10; otherwise there is an error: ------------------------------------------------------------ state 4 Expr : LET . NAME EQ Expr IN Expr END (6) NAME shift 10 . error ------------------------------------------------------------ For another example, consider state 5. According to the parsing state, we are trying to parse an Expr, we have seen a left parenthesis, and we now expect to parse an Expr and then a right parenthesis. According to the transition relation, the input must begin with an integer constant, or the keyword `let', or a left parenthesis, or a minus sign, or a name. If we see the required symbol, we read it and go to state 3, 4, 5, 6, or 7, respectively. When (later) we have completed parsing the Expr, we go to state 11: ------------------------------------------------------------ state 5 Expr : LPAR . Expr RPAR (5) CSTINT shift 3 LET shift 4 LPAR shift 5 MINUS shift 6 NAME shift 7 . error Expr goto 11 ------------------------------------------------------------ For yet another example, consider state 20. Here we have seen Expr PLUS Expr and expect to see one of the operators TIMES, PLUS or MINUS, or one of the the keywords `in' or `end', or a right parenthesis, or end of file. If we see the TIMES operator, we read it and go to state 16, thus having read Expr PLUS Expr TIMES, and expecting to read Expr, after which we will have Expr PLUS Expr TIMES Expr which will then be reduced to Expr PLUS Expr The upshot of this is that TIMES binds more strongly than PLUS (as we are used to). If we see any other operator, we reduce by grammar rule 8, thus getting an Expr: ------------------------------------------------------------ state 20 Expr : Expr . TIMES Expr (7) Expr : Expr . PLUS Expr (8) Expr : Expr PLUS Expr . (8) Expr : Expr . MINUS Expr (9) TIMES shift 16 END reduce 8 EOF reduce 8 IN reduce 8 MINUS reduce 8 PLUS reduce 8 RPAR reduce 8 ------------------------------------------------------------ The nonterminal symbols $accept and %entry% and the terminal symbols '\001' and $end used in a few of the states are auxiliary symbols introduced by the parser generator to properly handle the start and end of the parse. Studying the Exprpar.output file is especially useful if there are shift/reduce or reduce/reduce conflicts in the generated parser. Such conflicts arise because the grammar is ambiguous: some string may be derived in more than one way. For instance, if we remove the precedence and associativity declarations (%left) from the tokens PLUS, MINUS and TIMES in the parser specification Exprpar.grm, then there will be shift/reduce conflicts in the parser. A typical conflict message has this form: ------------------------------------------------------------ 20: shift/reduce conflict (shift 14, reduce 8) on MINUS 20: shift/reduce conflict (shift 15, reduce 8) on PLUS 20: shift/reduce conflict (shift 16, reduce 8) on TIMES state 20 Expr : Expr . TIMES Expr (7) Expr : Expr . PLUS Expr (8) Expr : Expr PLUS Expr . (8) Expr : Expr . MINUS Expr (9) MINUS shift 14 PLUS shift 15 TIMES shift 16 END reduce 8 EOF reduce 8 IN reduce 8 RPAR reduce 8 ------------------------------------------------------------ The four lines after `state 20' describes a parser state in which the parser has recognized Expr PLUS Expr (which can be reduced to Expr), or is about to read a TIMES or PLUS or MINUS token while recognizing Expr Expr. The first line of the conflict message says that when the next token is MINUS, for instance the second MINUS found while parsing this input: 11 + 22 - 33 then it is unclear whether we should read that token (shift and go to state 14), or reduce Expr PLUS Expr to Expr (by grammar rule 8) before proceeding. The former choice would make PLUS right associative, as in (11 + (22 - 33)), and the latter would make it left associative, as in ((11 + 22) - 33). The state transition table line MINUS shift 14 shows that the parser generator decided, in the absence of other information, to shift (and go to state 14) when the next symbol is MINUS. By declaring %left MINUS PLUS in the parser specification we tell the parser generator to reduce instead, making PLUS and MINUS left associative, and this one problem goes away. This also solves the problem reported for PLUS in the second line of the message. The problem with TIMES is similar, but the desired solution is different. The third line of the conflict message says that when the next token is TIMES, for instance the TIMES found while parsing this input: 11 + 22 * 33 then it is unclear whether we should read that token (shift and go to state 16), or reduce Expr PLUS Expr to Expr (by grammar rule 8) before proceeding. The former choice would make TIMES bind more strongly than PLUS, as in (11 + (22 * 33)), and the latter would make TIMES and PLUS bind equally strongly and left associative, as in ((11 + 22) * 33). The former choice is the one we want, so in Exprpar.grm we should declare %left MINUS PLUS /* lowest precedence */ %left TIMES /* highest precedence */ Doing so makes all conflicts go away. The parser generated by e.g. mosmlyac looks like a finite automaton generated from a regular expression. However, instead of just a single current state, it has a stack containing states and grammar symbols. If there is an automaton state such as #20 on top of the stack, then that state's transition table and the next input symbol determines the action. The action may be shift or reduce. For an example of a shift action, assume that the state is #20 and the next input symbol is *, that is, TIMES. Then the action is shift #16 which means that * is removed from the input and pushed on the stack together with state #16. For an example of a reduce action, assume that the state is #20 and the next input symbol is EOF. Then the action is reduce 8, which means that the stack is reduced by using rule number 8: Expr ::= Expr PLUS Expr in reverse. That is, the grammar symbols Expr PLUS Expr and the corresponding states are removed from the stack, and Expr is pushed instead. After a reduce, the state below the Expr (for instance, state #1) is inspected for a suitable goto rule (for instance, Expr goto 9), and the new state #9 is pushed. For a complete parsing example, consider the parser states traversed during parsing of the string "x + 52 * wk EOF" which will be decorated with start and end symbols as follows: \001 x + 52 * wk EOF $end. Input Parse stack (top on right) Action -------------------------------------------------------------------------- x+52*wk EOF #0 shift #1 x+52*wk EOF #0 \001 #1 shift #7 +52*wk EOF #0 \001 #1 x #7 reduce 2 +52*wk EOF #0 \001 #1 Expr goto #9 +52*wk EOF #0 \001 #1 Expr #9 shift #15 52*wk EOF #0 \001 #1 Expr #9 + #15 shift #3 *wk EOF #0 \001 #1 Expr #9 + #15 52 #3 reduce 3 *wk EOF #0 \001 #1 Expr #9 + #15 Expr goto #20 *wk EOF #0 \001 #1 Expr #9 + #15 Expr #20 shift #16 wk EOF #0 \001 #1 Expr #9 + #15 Expr #20 * #16 shift #7 EOF #0 \001 #1 Expr #9 + #15 Expr #20 * #16 wk #7 reduce 2 EOF #0 \001 #1 Expr #9 + #15 Expr #20 * #16 Expr goto #21 EOF #0 \001 #1 Expr #9 + #15 Expr #20 * #16 Expr #21 reduce 7 EOF #0 \001 #1 Expr #9 + #15 Expr goto #20 EOF #0 \001 #1 Expr #9 + #15 Expr #20 reduce 8 EOF #0 \001 #1 Expr goto #9 EOF #0 \001 #1 Expr #9 shift #13 #0 \001 #1 Expr #9 EOF #13 reduce 1 #0 \001 #1 Main goto #8 #0 \001 #1 Main #8 reduce 10 #0 %entry% goto #2 accept #2 accept Notation: #0, #1, ..., #25 are parser automaton states, and 0, 1, ..., 10 are grammar rule numbers. Better reporting of lexer and parser errors ------------------------------------------- The function parse: string -> expr previously defined gives very little useful information in case the string to be parsed is ill-formed. Using the Location structure from the Moscow ML Library we can provide much better error messages. We define an error reporting function parses : string -> expr to parse from a string, and another one parsef: string -> expr to read and parse a text from a file (whose name is given by that function's argument). Lexer and parser specifications for uSQL, simple SQL SELECT statements ---------------------------------------------------------------------- The language uSQL (micro-SQL) is a very simple subset of SQL SELECT statements without WHERE, GROUP BY, ORDER BY etc. It permits SELECTs on (qualified) column names, and the use of aggregate functions. For instance: SELECT name, zip FROM Person SELECT COUNT(*) FROM Person SELECT * FROM Person, Zip SELECT Person.name, Zip. code FROM Person, Zip The uSQL language is described in the following files: usql/grammar.txt an informal description of the grammar usql/Absyn.sml abstract syntax usql/Sqllex.lex lexer specification usql/Sqlpar.grm parser specification usql/parse.sml declaration of a uSQL parser Lexer and parser specifications for uJava, a subset of Java ----------------------------------------------------------- The language uJava (micro-Java) is a subset of Java, and provides a much more realistic example of lexer and parser specifications than the expression language studied so far. If you wish, you can study the lexer and parser specifications now, or postpone that until we need them in a later seminar. The uJava language is described in the following files (see seminar 8): oo/grammar.txt an informal description of the grammar oo/Absyn.sml abstract syntax oo/Oolex.lex lexer specification oo/Oopar.grm parser specification oo/parse.sml declaration of a uJava parser oo/ex1.oo an example uJava program oo/ex2.oo another example uJava program oo/Util.java utilities to run uJava programs as Java Compile as directed in grammar.txt, then start mosml -P full parse.sml and evaluate parsef "ex1.oo"; What is missing, of course, is an interpreter (a semantics) for the abstract syntax of uJava. We shall return to that later. History and literature ---------------------- Regular expressions were introduced by the mathematician Stephen Cole Kleene in 1956. Michael O. Rabin and Dana Scott in 1959 gave the first algorithms for constructing a deterministic finite automaton (DFA) from a nondeterministic finite automaton (NFA), and for minimization of DFAs. IBM Journal of Research and Development 3 (1959) 114-125. Formal grammars were developed within linguistics by Noam Chomsky around 1956, and were first used in computer science by John Backus and Peter Naur in 1960 to describe the Algol programming language. Their variant of the notation was subsequently called Backus-Naur Form or BNF. Chomsky originally devised four grammar classes, each class more general than those below it: Chomsky class Example rules Comment ------------------------------------------------------------------- 0: Unrestricted a B b ::= c General rewrite system 1: Context-sensitive a B b ::= a c b Non-abbreviating rewrite 2: Context-free B ::= a B b Some interesting subclasses of context-free grammars: LR(1) general bottom-up parsers, Earley LALR(1) bottom-up, Yacc, mosmlyac LL(1) top-down, recursive descent 3: Regular B ::= a | a B Finite automata ------------------------------------------------------------------- The unrestricted grammars cannot be parsed in general; they are of theoretical interest but of little practical use in computing. All context-sensitive grammars can be parsed, but may take an excessive amount of time and space, and so are of little practical use. The context-free grammars are very useful in computing, in particular the subclasses LL(1), LALR(1), and LR(1) mentioned above. Earley gave an O(n^3) algorithm for parsing general context-free grammars in 1969. The regular grammars are just regular expressions; parsing according to a regular grammar can be done in linear time using a constant amount of memory. Donald E. Knuth described the LR subclass of context-free grammars and how to parse them in 1965 (On the Translation of Languages from Left to Right, Information and Control 8 (1965) 607-639). The influential Yacc LALR parser generator was created by S. C. Johnson at Bell Labs in 1975. There is a huge literature about regular expressions, automata, grammar classes, formal languages, the associated computation models, practical lexing and parsing, etc. Two classical textbooks are: Alfred V. Aho, John E. Hopcroft, Jeffrey D. Ullman: The Design and Analysis of Computer Algorithms, Addison-Wesley 1974. John E. Hopcroft, Jeffrey D. Ullman: Introduction to Automata Theory, Languages, and Computation, Addison-Wesley 1979. A classical compiler textbook with good coverage of lexing and parsing is: A.V. Aho, R. Sethi, J.D. Ullman, Compilers, Principles, Techniques, and Tools, Addison-Wesley 1986. Parser combinators for recursive descent (LL) parsing with backtracking are popular in the functional programming community. Graham Hutton: Higher-order functions for parsing, Journal of Functional Programming 2 (July 1992) 323-343. Get it from http://www.cs.nott.ac.uk/~gmh/parsing.pdf L.C. Paulson: ML for the Working Programmer, Cambridge University Press 1996. Parser combinators were introduced in a remarkably early book on functional programming techniques: W. H. Burge: Recursive Programming Techniques, Addison-Wesley 1975. There is a parser combinator library in mosml/examples/parsercomb in the Moscow ML distribution.