Computer Science 3 - 2000

Programming Language Translation

Practical for Week 22, beginning 25 September 2000 - Solutions

You can obtain machine readable versions of these grammars from the solution kit PRAC22A.ZIP if you would like to experiment with them further.

Task 2

Extending the calculator grammar can be done in two ways. Here is one of them, which corresponds to the approach taken in languages like Pascal, which do not allow two signs to appear together:

     COMPILER Calc  $XCN
     /* Simple four function calculator - extended
        P.D. Terry, Rhodes University, 2000 */

     CHARACTERS
       digit      = "0123456789" .
       hexdigit   = digit + "ABCDEF" .

     IGNORE CHR(1) .. CHR(31)

     TOKENS
       decNumber  = digit { digit } .
       hexNumber  = "$" hexdigit { hexdigit } .

     PRODUCTIONS
       Calc       = { Expression "=" } EOF .
       Expression = ["+" | "-" ] Term { "+" Term  | "-" Term } .
       Term       = Factor { "*" Factor | "/" Factor } .
       Factor     = Primary [ "!" ] .
       Primary    = decNumber | hexNumber | "(" Expression ")" .
     END Calc.

Another approach, similar to that taken in C++ is as follows:

     PRODUCTIONS
       Calc       = { Expression "=" } EOF .
       Expression = Term { "+" Term  | "-" Term } .
       Term       = Factor { "*" Factor | "/" Factor } .
       Factor     = ( "+" | "-" ) Factor | Primary [ "!" ] .
       Primary    = decNumber | hexNumber | "(" Expression ")" .
     END Calc.

This allows for expressions like 3 + - 7 or even 3 * -4 or even 3 / + - 4!. Because of the way the grammar is written, the last of these is equivalent to 3 / ( + ( - (4!))). It is clearer like this than if one tries to simplify the definition of Factor still further to

       Factor     = ( "+" | "-" ) Primary [ "!" ] .

in which the interpretation of -4! would be (-4)! and not -(4!) as it should be.

Task 3

The task of writing an EBNF description of BNF results in a grammar as shown below. There are a few tricks to be learned from this one. Firstly, productions are separated one from the next by the end of line, not by a period. This means we cannot IGNORE the line break characters. This has to be done in a way that depends on your operating system, in general. In practice we might define the eol "character set" as the singleton CHR(10) and then define the EOLN "token" as a single character token, as in the code below, which will work on Unix (where line breaks are demarcated by LF (CHR(10)) or WinTel where they are demarcated by CR + LF (CHR(13) + CHR(10)). Secondly, we wish spaces to become significant characters within the nonterminals that are demarcated by < > brackets. In Cocol there is an implicit IGNORE CHR(32) - but this relates to ignoring spaces between tokens, as is common in almost all programming languages. The only way we can make spaces significant within a token definition is to define the singleton character set CHR(32), as Coco also forbids one from writing a string into a Cocol definition with an embedded spaces, as exemplified by "this has some spaces". Thirdly, we do not want to define the terminal token to include spaces within it.

The simplest grammar now reads like this, but this allows one to have multiple and potentially misleading epsilons in a term, as in

<A> ::= a e e e e e b

     COMPILER BNF $XCN
     /* Grammar to describe BNF productions
        P.D. Terry, Rhodes University, 2000 */

     CHARACTERS
       letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
       alpha  = letter + "0123456789_" .
       eol    = CHR(10) .
       space  = CHR(32) .

     IGNORE CHR(1) .. CHR(11) + CHR(13)

     COMMENTS FROM "(*" TO "*)"  NESTED
     COMMENTS FROM "/*" TO "*/"  NESTED

     TOKENS
       EOL         = eol .
       nonterminal = "<" letter { alpha | space } ">" .
       terminal    =  letter { alpha } .

     PRODUCTIONS
        BNF        = { Production } EOF .
        Production = SYNC nonterminal WEAK "::=" Expression SYNC EOL .
        Expression = Term { WEAK "|" Term } .
        Term       = Factor { Factor } .
        Factor     = nonterminal | terminal | "(" Expression ")" | "e" .
     END BNF.

A slightly better solution is to have the PRODUCTIONS section reading:

     PRODUCTIONS
        BNF        = { Production } EOF .
        Production = SYNC nonterminal WEAK "::=" Expression SYNC EOL .
        Expression = Term { WEAK "|" Term } .
        Term       = Factor { Factor } | "e" .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF.

although that still allows one to have multiple and potentially misleading

| e | e | e

options in an expression. If one want to restrict the right hand side to contain at most one e, and still to have an LL(1) grammar, one is forced to demand that the e appear first, as in the grammar below:

     PRODUCTIONS
        BNF        = { Production } EOF .
        Production = SYNC nonterminal WEAK "::=" Expression SYNC EOL .
        Expression = [ "e" "|" ] Term { WEAK "|" Term } .
        Term       = Factor { Factor } .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF.

Task 4

The Boolean grammar is very similar to the arithmetic one, as it happens! To get the operator precedences correct, while keeping the grammar LL(1) we need to introduce four levels of non-terminal:

     COMPILER Bool $XCN
     /* Boolean expression parser
        P.D. Terry, Rhodes University, 2000 */

     IGNORE CHR(1) .. CHR(31)

     COMMENTS FROM "(*" TO "*)"  NESTED
     COMMENTS FROM "/*" TO "*/"  NESTED

     CHARACTERS
       letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .

     TOKENS
       variable   = letter .
     PRODUCTIONS
       Bool       = { Expression "=" } .
       Expression = Term { Or Term } .
       Term       = Factor { [ And ] Factor } .
       Factor     = "NOT" Factor | Primary { "'" } .
       Primary    = True | False | variable | "(" Expression ")" .
       True       = "TRUE" | "1" .
       False      = "FALSE" | "0" .
       And        = "AND" | "&" | "." .
       Or         = "OR" | "|" | "+" .
     END Bool.

Many people wrote the constants into the TOKENS section as follows:

     TOKENS
       variable   = letter .
       True       = "TRUE" | "1" .
       False      = "FALSE" | "0" .
       And        = "AND" | "&" | "." .
       Or         = "OR" | "|" | "+" .

but I suggest that you write fixed token definitions straight into the PRODUCTIONS section where they appear, and retain the TOKENS section for the definition of things like identifiers, numbers and strings that are all similar in form but different in spelling.

Task 5

The description of a book index has always produced some innovative and imaginative solutions whenever I have used it as an example. There is no single correct answer - looking at the example given usually leads to students discovering a set of productions to which I can respond "but what if you had an entry in the index reading like this" and finding another plausible one. Here is one suggested solution, in which I have played tricks with the selection of character sets:

     COMPILER Index $CNX
     /* Grammar describing index in a book
        P.D. Terry, Rhodes University, 2000 */

     CHARACTERS
     /* Notice the careful and unusual choice of character sets */
       digit      = "0123456789" .
       startword  = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz('" + '"' .
       inword     = startword + digit + "-+)" .
       eol        = CHR(13) .

     IGNORE CHR(1) .. CHR(12)

     TOKENS
     /* Notice the careful and unusual definition for identifier */
       word       = startword { inword } [ "," ] .
       number     = digit { digit } .
       EOL        = eol .

     PRODUCTIONS
       Index      = { Entry } EOF .
       Entry      = Key References SYNC EOL .
       Key        = word { word } .
       References = DirectRefs | CrossRef .
       DirectRefs = PageRefs { "," ( PageRefs | Appendix ) } .
       PageRefs   =  number [ "-" number ] .
       Appendix   = "Appendix" number .
       CrossRef   = "--" "see" Key .
     END Index .

Task 6

The Clang extensions saw many students falling into the traps of not drawing a distinction between the various types of inline opcodes, forgetting that if structures were introduced one would have to extend the definition of a designator to be able to "select" the components, and (less frequently) producing restricted definitions of the structure form of type. A solution which allows one to have structures within structures etc is as follows:

     COMPILER Clang $XCN  /* generate compiler, C++ */
     /* Simple grammar for Clang level 1.1
        P.D. Terry, Rhodes University, 2000 */

     IGNORE CASE
     IGNORE CHR(9) .. CHR(13)
     COMMENTS FROM "(*" TO "*)"

     CHARACTERS
       cr         = CHR(13) .
       lf         = CHR(10) .
       letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
       digit      = "0123456789" .
       instring   = ANY - "'" - cr - lf .

     TOKENS
       identifier = letter { letter | digit } .
       number     = digit { digit } .
       string     = "'" (instring | "''") { instring | "''" } "'" .

     PRODUCTIONS
       Clang             = "PROGRAM" identifier ";" Block "." .
       Block             = { ConstDeclarations | VarDeclarations } CompoundStatement .
       ConstDeclarations = "CONST" OneConst { OneConst } .
       OneConst          = identifier "=" number ";" .

     /* We extend the possibilities for VarDeclarations in a way
        that allows us to have structures within structures and so on.  */

       VarDeclarations   = TypeSpecifier OneVar { "," OneVar } ";" .
       TypeSpecifier     = "INT" | "BOOL" | "CHAR" | Structure .
       Structure         = "STRUCT" "{" VarDeclarations { VarDeclarations } "}" .
       OneVar            = identifier [ UpperBound ] .
       UpperBound        = "[" number "]" .
       CompoundStatement = "BEGIN" Statement { ";" Statement } "END" .

     /* We add another option into Statement - note that Statement is nullable.
        It is quite permissible to have empty statements in most languages */

       Statement         = [   CompoundStatement | Assignment     | InlineStatement
                             | IfStatement       | WhileStatement
                             | ReadStatement     | WriteStatement ] .
       Assignment        = Variable ":=" Expression .
       Variable          = Designator .

     /* We extend the possibilities for Designator to select fields and elements */

       Designator        = identifier [ "[" Expression "]" ] { "." identifier [ "[" Expression "]" ] } .
       IfStatement       = "IF" Condition "THEN" Statement .
       WhileStatement    = "WHILE" Condition "DO" Statement .
       Condition         = Expression RelOp Expression .
       ReadStatement     = "READ" "(" Variable { "," Variable } ")" .
       WriteStatement    = "WRITE" [ "(" WriteElement { "," WriteElement }  ")" ] .
       WriteElement      = string | Expression .
       Expression        = ( "+" Term | "-" Term | Term ) { AddOp Term } .
       Term              = Factor { MulOp Factor } .
       Factor            = Designator | number | "(" Expression ")" .
       AddOp             = "+" | "-" .
       MulOp             = "*" | "/" .
       RelOp             = "=" | "<>" | "<" | "<=" | ">" | ">=" .

       InlineStatement   = "INLINE" { Code } "END" .

     /* In defining the InlineStatement we need to distinguish between the various
        classes of opcode - there are three of these.  A production like
               Code = Identifier [ [ "-"] number | string ] .
        is correct in the sense that it would parse all possible inline statements,
        but in most simple assemblers the mnemonics are best thought of as key words
        (Some macro-assemblers allow programmers to define their own opcodes
        in the form of identifiers, however */

       Code              = OneWord | TwoWord | PrintString .
       OneWord           =   "NEG" | "ADD" | "SUB" | "MUL" | "DVD" | "ODD"
                           | "EQL" | "NEQ" | "LSS" | "GEQ" | "GTR" | "LEQ" | "STK"
                           | "STO" | "HLT" | "INN" | "PRN" | "INC" | "PRC" | "NLN"
                           | "VAL" | "NOP" .
       TwoWord           = ( "LIT" | "ADR" | "PSH" | "POP" | "DSP" | "BRN" | "BZE" )
                           [ "+" | "-" ] number .
       PrintString       =  "PRS" string .
     END Clang.