Computer Science 3 - 2002

Programming Language Translation


Practical for Week 9, beginning 15 April 2002 - Solutions

You can obtain machine readable versions of these grammars from the solution kit PRAC09A.ZIP if you would like to experiment with them further.


Task 2

Extending the calculator grammar can be done in two ways. Here is one of them, which corresponds to the approach taken in languages like Pascal, which do not allow two signs to appear together:

     COMPILER Calc  $XCN
     /* Simple four function calculator - extended
        P.D. Terry, Rhodes University, 2002 */

     CHARACTERS
       digit      = "0123456789" .
       hexdigit   = digit + "ABCDEF" .

     IGNORE CHR(1) .. CHR(31)

     TOKENS
       decNumber  = digit { digit } .
       hexNumber  = "$" hexdigit { hexdigit } .

     PRODUCTIONS
       Calc       = { Expression "=" } EOF .
       Expression = ["+" | "-" ] Term { "+" Term  | "-" Term } .
       Term       = Factor { "*" Factor | "/" Factor } .
       Factor     = Primary [ "!" ] .
       Primary    = decNumber | hexNumber | "(" Expression ")" .
     END Calc.

Another approach, similar to that taken in C++, is as follows:

     PRODUCTIONS
       Calc       = { Expression "=" } EOF .
       Expression = Term { "+" Term  | "-" Term } .
       Term       = Factor { "*" Factor | "/" Factor } .
       Factor     = ( "+" | "-" ) Factor | Primary [ "!" ] .
       Primary    = decNumber | hexNumber | "(" Expression ")" .
     END Calc.

This allows for expressions like 3 + - 7 or even 3 * -4 or even 3 / + - 4!. Because of the way the grammar is written, the last of these is equivalent to 3 / ( + ( - (4!))). It is clearer like this than if one tries to simplify the definition of Factor still further to

       Factor     = ( "+" | "-" ) Primary [ "!" ] .

in which the interpretation of -4! would be (-4)! and not -(4!) as it should be.


Task 3

The task of writing an EBNF description of BNF results in a grammar as shown below. There are a few tricks to be learned from this one. Firstly, productions are separated one from the next by the end of line, not by a period. This means we cannot IGNORE the line break characters. This has to be done in a way that depends on your operating system, in general. In practice we might define the eol "character set" as the singleton CHR(10) and then define the EOLN "token" as a single character token, as in the code below, which will work on Unix (where line breaks are demarcated by LF (CHR(10)) or WinTel where they are demarcated by CR + LF (CHR(13) + CHR(10)). Secondly, we wish spaces to become significant characters within the nonterminals that are demarcated by < > brackets. Thirdly, we do not want to define the terminal token to include spaces within it, as we need to be able to distingush each terminal from the next if and when they are separated by spaces. In Cocol there is an implicit IGNORE CHR(32) - but this relates to ignoring spaces between tokens, as is common in almost all programming languages. The only way we can make spaces significant within a token definition is to define the singleton character set CHR(32), as Coco also forbids one from writing a string into a Cocol definition with an embedded spaces, as exemplified by "this has some spaces". Lastly, BNF notation still allows for the use of (round) parentheses; most people left these out of their solutions.

The simplest grammar now reads like this, but this allows one to have multiple and potentially misleading epsilons in a term, as in

<A> ::= a e e e e e b

     COMPILER BNF $XCN
     /* Grammar to describe BNF productions
        P.D. Terry, Rhodes University, 2002 */

     CHARACTERS
       letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
       alpha  = letter + "0123456789_" .
       eol    = CHR(10) .
       space  = CHR(32) .

     IGNORE CHR(1) .. CHR(9) + CHR(11) + CHR(13)

     COMMENTS FROM "(*" TO "*)"  NESTED
     COMMENTS FROM "/*" TO "*/"  NESTED

     TOKENS
       EOL         = eol .
       nonterminal = "<" letter { alpha | space } ">" .
       terminal    =  letter { alpha } .

     PRODUCTIONS
        BNF        = { Production } EOF .
        Production = SYNC nonterminal WEAK "::=" Expression SYNC EOL .
        Expression = Term { WEAK "|" Term } .
        Term       = Factor { Factor } .
        Factor     = nonterminal | terminal | "(" Expression ")" | "e" .
     END BNF.

A slightly better solution is to have the PRODUCTIONS section reading:

     PRODUCTIONS
        BNF        = { Production } EOF .
        Production = SYNC nonterminal WEAK "::=" Expression SYNC EOL .
        Expression = Term { WEAK "|" Term } .
        Term       = Factor { Factor } | "e" .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF.

although that still allows one to have multiple and potentially misleading

| e | e | e

options in an expression. If one want to restrict the right hand side to contain at most one e, and still to have an LL(1) grammar, one is forced to demand that the e appear first, as in the grammar below:

     PRODUCTIONS
        BNF        = { Production } EOF .
        Production = SYNC nonterminal WEAK "::=" Expression SYNC EOL .
        Expression = [ "e" "|" ] Term { WEAK "|" Term } .
        Term       = Factor { Factor } .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF.


Task 4

The description of a book index has always produced some innovative and imaginative solutions whenever I have used it as an example. There is no single correct answer - looking at the example given usually leads to students discovering a set of productions to which I can respond "but what if you had an entry in the index reading like this" and finding another plausible one. Here is one suggested solution, in which I have played tricks with the selection of character sets:

     COMPILER Index $CNX
     /* Grammar describing index in a book
        P.D. Terry, Rhodes University, 2002 */

     CHARACTERS
     /* Notice the careful and unusual choice of character sets */
       digit      = "0123456789" .
       startword  = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz('" + '"' .
       inword     = startword + digit + "-+)" .
       eol        = CHR(10) .

     IGNORE CHR(1) .. CHR(9) + CHR(11) .. CHR(31)

     TOKENS
     /* Notice the careful and unusual definition for identifier */
       word       = startword { inword } .
       number     = digit { digit } .
       EOL        = eol .

     PRODUCTIONS
       Index      = { Entry } EOF .
       Entry      = Key References SYNC EOL .
       Key        = word { "," word | word } .
       References = DirectRefs | CrossRef .
       DirectRefs = PageRefs { "," PageRefs  } .
       PageRefs   = number [ "-" number ] | Appendix .
       Appendix   = "Appendix" number .
       CrossRef   = "--" "see" Key .
     END Index .


Task 5

The Parva extensions saw many students falling into the traps of not drawing a distinction between the various types of inline opcodes, forgetting that if structures were introduced one would have to extend the definition of a designator to be able to "select" the components, and (less frequently) producing restricted definitions of the structure form of type. A solution which allows one to have structures within structures etc is as follows:

     COMPILER Parva1  $XCN
     /* Parva1 level 0 grammar - no methods, parameters
        P.D. Terry, Rhodes University, 2002 */

     CHARACTERS
       cr         = CHR(13) .
       lf         = CHR(10) .
       back       = CHR(92) .
       letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
       digit      = "0123456789" .
       instring   = ANY - '"' - "'" - cr - lf - CHR(0) .

       cntl       = CHR(0) .. CHR(31).
       noQuote1   = ANY - '"' - cntl - back .
       noQuote2   = ANY - "'" - cntl - back .
       graphic1   = ANY - cntl .

     IGNORE CHR(9) .. CHR(13)
     COMMENTS FROM "//" TO lf
     COMMENTS FROM "/*" TO "*/"

     TOKENS
       identifier = letter { letter | digit | "_" } .
       number     = digit { digit } .

     /* to describe strings and chars with escape sequences we can try */

       string       = '"' { noQuote1 | back graphic1 } '"' .
       character    = "'" ( noQuote2 | back graphic1 ) "'" .

     /* this one also works

       string     = '"' { graphic | back graphic | back "'" | back '"' | back back } '"' .
       character  = "'" ( graphic | back graphic | back "'" | back '"' | back back ) "'" .

     */

     PRODUCTIONS
       Parva1            = "void" "main" "(" ")" Block .
       Block            = "{" { Declaration | Statement } "}" .
       Declaration      = ConstDeclaration | VarDeclarations .

     /* We add other options into Statement */

       Statement        = SYNC (
                              Block | AssignOrIncDec | InlineStatement
                            | WhileStatement  | IfStatement | DoStatement
                            | ReadStatement | WriteStatement
                            | ReturnStatement | EmptyStatement
                          ) .

       ConstDeclaration =  "const" identifier "=" number WEAK ";" .

     /* We extend the possibilities for VarDeclarations in a way
        that allows us to have structures within structures and so on.  */

       VarDeclarations  = TypeSpecifier OneVar { "," OneVar } WEAK ";" .
       TypeSpecifier    = "int" | "boolean" | "char" | Structure .
       Structure        = "struct" "{" VarDeclarations { VarDeclarations } "}" .
       OneVar           = identifier [ ArraySize ] .
       ArraySize        = "[" number "]" .

       AssignOrIncDec   = Variable ( "=" Expression | "++" | "--" ) WEAK ";" .
       Variable         = Designator .

     /* We extend the possibilities for Designator to select fields and elements */

       Designator       = identifier [ "[" Expression "]" ] { "." identifier [ "[" Expression "]" ] } .

     /* Adding the "else" part is easy */

       IfStatement      = "if" "(" Condition ")" Block [ "else" Block ] .
       WhileStatement   = "while" "(" Condition ")" Block .

     /* Adding the do-while statement is easy - note the trailing semicolon, which many forgot */

       DoStatement      = "do" Block "while" "(" Condition ")" WEAK ";" .

     /* Allowing prompts in read statements is easy - many submitted solutions were restrictive */

       ReadStatement    = "read" "(" ReadElement { "," ReadElement } ")" WEAK ";" .
       ReadElement      = string | Variable .

       WriteStatement   = "print" "(" WriteElement { "," WriteElement } ")" WEAK ";" .
       WriteElement     = string | Expression .

       ReturnStatement  = "return" WEAK ";" .
       EmptyStatement   = WEAK ";" .

/*    We need an extended grammar for Expressions to be able to handle Boolean
      Expressions with the operators at the correct precedence.  The model here
      is based on the precedence levels found in C++.  Note the true/false constants  */

       Condition        = Expression .
       Expression       = AndExp { "||" AndExp } .
       AndExp           = OrExp { "&&" OrExp } .
       OrExp            = AddExp [ RelOp AddExp ] .
       AddExp           = MultExp { AddOp MultExp } .
       MultExp          = UnaryExp { MulOp UnaryExp } .
       UnaryExp         = Factor | "+" UnaryExp | "-" UnaryExp | "!" UnaryExp .
       Factor           =   Designator | number | character | "true" | "false"
                          | "(" Expression ")" .

       MulOp            = "*" | "/" | "%"  .
       AddOp            = "+" | "-" .
       RelOp            = "==" | "!=" | "<" | "<=" | ">" | ">=" .

       InlineStatement  = "inline" { Code } "end" WEAK ";" .

     /* In defining the InlineStatement we need to distinguish between the various
        classes of opcode - there are three of these.  A production like
               Code = Identifier [ [ "-" ] number | string ] .
        is correct in the sense that it would parse all possible inline statements,
        but in most simple assemblers the mnemonics are best thought of as key words
        (Some macro-assemblers allow programmers to define their own opcodes
        in the form of identifiers, however) */

       Code             = OneWord | TwoWord | PrintString .
       OneWord          =   "NEG" | "ADD" | "SUB" | "MUL" | "DVD" | "ODD"
                          | "EQL" | "NEQ" | "LSS" | "GEQ" | "GTR" | "LEQ" | "STK"
                          | "STO" | "HLT" | "INN" | "PRN" | "INC" | "PRC" | "NLN"
                          | "VAL" | "NOP" .
       TwoWord          = ( "LIT" | "ADR" | "PSH" | "POP" | "DSP" | "BRN" | "BZE" )
                          [ "+" | "-" ] number .
       PrintString      =  "PRS" string .

     END Parva1.


Home  © P.D. Terry