Practical 21 solutions

Computer Science 3 - 2007

Programming Language Translation

Practical for Week 21, beginning 17 September 2007 - Solutions

You can obtain machine readable versions of these grammars from the solution kit PRAC21A.ZIP or PRAC21AC.ZIP if you would like to experiment with them further.

Several groups clearly had paid little attention to this practical. The IS project has a lot to answer for! Please study these solutions carefully if you did not complete this practical.

Task 2

Extending the calculator grammar can be done in several ways. Here is a simple one of them, which corresponds to the approach taken in languages like Pascal, which do not allow two signs to appear together:

    COMPILER Calc1 $CN
    /* Simple four function calculator - extended
       P.D. Terry, Rhodes University, 2007 */

    CHARACTERS
      digit      = "0123456789" .
      hexdigit   = digit + "ABCDEF" .

    TOKENS
      decNumber  = digit { digit } .
      hexNumber  = "$" hexdigit { hexdigit } .

    IGNORE CHR(0) .. CHR(31)

    PRODUCTIONS
      Calc1      = { Expression "=" } EOF .
      Expression = ["+" | "-" ] Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = decNumber | hexNumber | [ "sqrt" | "abs" ] "(" Expression ")" .
    END Calc1.

Another approach, similar to that taken in C++, is as follows:

    PRODUCTIONS
      Calc2      = { Expression "=" } EOF .
      Expression = Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = ( "+" | "-" ) Factor | Primary .
      Primary    = decNumber | hexNumber | [ "sqrt" | "abs" ] "(" Expression ")" .
    END Calc2.

This allows for expressions like 3 + - 7 or even 3 * -4 or even 3 / + - 4. Because of the way the grammar is written, the last of these is equivalent to 3 / ( + ( - (4))).

Here are some other suggestions. What, if any, differences are there between these and the other solutions presented so far?

    PRODUCTIONS
      Calc3      = { Expression "=" } EOF .
      Expression = Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = ( "+" | "-" ) Factor | Primary | "sqrt" "(" Expression ")" | "abs" "(" Expression ")" .
      Primary    = ( decNumber | hexNumber | "(" Expression ")" ) .
    END Calc3.

    PRODUCTIONS
      Calc4      = { Expression "=" } EOF .
      Expression = ["+" | "-" ] Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = Primary | "sqrt" "(" Expression ")" | "abs" "(" Expression ")" .
      Primary    = decNumber | hexNumber | "(" Expression ")" .
    END Calc4.

Several people suggested productions like this

      Factor     = ( "+" | "-" ) Factor | Primary | "sqrt(" Expression ")" ) .

A terminal like "sqrt(" is restrictive. It is usually better to allow white space to appears between method names and parameter lists if the user prefers thiis style.

Task 3

The task of writing an EBNF description of BNF results in a grammar as shown below. There are a few tricks to be learned from this one. Firstly, productions are separated one from the next by the end of line, not by a period. This means we cannot IGNORE the line break characters. This has to be done in a way that depends on your operating system, in general. In practice we might define the eol "character set" as the singleton CHR(10) and then define the EOLN "token" as a single character token, as in the code below, which will work on Unix (where line breaks are demarcated by LF (CHR(10)) or WinTel where they are demarcated by CR + LF (CHR(13) + CHR(10)). Secondly, we wish spaces to become significant characters within the nonterminals that are demarcated by < > brackets. Thirdly, we do not want to define the terminal token to include spaces within it, as we need to be able to distingush each terminal from the next if and when they are separated by spaces. In Cocol there is an implicit IGNORE CHR(32) - but this relates to ignoring spaces between tokens, as is common in almost all programming languages. The only way we can make spaces significant within a token definition is to define the singleton character set CHR(32), as Coco also forbids one from writing a string into a Cocol definition with an embedded spaces, as exemplified by "this has some spaces". Lastly, BNF notation still allows for the use of (round) parentheses; most people left these out of their solutions.

Incidentally, spaces are very rarely significant in computer languages - the definition of nonterminal here is one of the very few exceptions one can think of.

A simple definition of the possible tokens looks like this, which is what many people attempted:

    CHARACTERS
      letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      alpha  = letter + "0123456789_" .
      eol    = CHR(10) .
      space  = CHR(32) .

    TOKENS
      EOL         = eol .
      nonterminal = "<" ( letter | space ) { alpha | space } ">" .
      terminal    =  letter { alpha } .

However, this is not really adequate. We should be able to allow almost anything as a "terminal". But a definition like this is doomed to failure:

    CHARACTERS
      letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      alpha  = letter + "0123456789_" .
      inTerm = ANY - CHR(0) .. CHR(32) .
      eol    = CHR(10) .
      space  = CHR(32) .

    TOKENS
      EOL         = eol .
      nonterminal = "<" ( letter | space ) { alpha | space } ">" .
      terminal    = inTerm { inTerm } .

because then one cannot distinguish terminals from non-terminals (why not?). We might try

    CHARACTERS
      control     = CHR(0) .. CHR(31) .
      letter      = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      alpha       = letter + "0123456789_" .
      graphic     = ANY - control - " " .
      startTerm   = graphic - "<" .
      eol         = CHR(10) .
      space       = CHR(32) .

    TOKENS
      EOL         = eol .
      nonterminal = "<" letter { alpha | space } ">" .
      terminal    = startTerm { graphic } .

but this is also inadequate, as a sequence like (oneterm|twoTerm) with no helpful spaces will all be scanned as a single terminal, and furthermore there is no way to represent the metasymbol < as a terminal in a set of productions (or any of the other metasymbols, for that matter). Perhaps the merits of the Wirth/Cocol notation are now becoming more apparent! One could, of course, try to insist that users insert space around all terminals, but to be more helpful it may be best to exclude all the meta-characters from starting a terminal, and then to insist that if one wants them as terminals one should use a 'string' notation after all. So a better grammar reads like this, although this allows one to have multiple and potentially misleading epsilons in a term, as in

<A> ::= a eps eps eps eps b

    COMPILER BNF1 $CN
    /* Grammar to describe BNF productions
       P.D. Terry, Rhodes University, 2007 */

    CHARACTERS
      control     = CHR(0) .. CHR(31) .
      letter      = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      alpha       = letter + "0123456789_" .
      graphic     = ANY - control - " " .
      noQuote     = graphic - "'" .
      startTerm   = graphic - "()<|:'" .
      eol         = CHR(10) .
      space       = CHR(32) .

    TOKENS
      EOL         = eol .
      nonterminal = "<" letter { alpha | space } ">" .
      terminal    = startTerm { graphic } | "'" noQuote { noQuote | "''" } "'" .

    COMMENTS FROM "(*" TO "*)"  NESTED
    COMMENTS FROM "/*" TO "*/"  NESTED

    IGNORE CHR(0) .. CHR(9) + CHR(11) .. CHR(31)

    PRODUCTIONS
       BNF1       = { Production } EOF .
       Production = nonterminal "::=" Expression SYNC EOL .
       Expression = Term { "|" Term } .
       Term       = Factor { Factor } .
       Factor     = nonterminal | terminal | "(" Expression ")" | "eps" .
    END BNF1.

A slightly better solution is to have the PRODUCTIONS section reading:

     PRODUCTIONS
        BNF2       = { Production } EOF .
        Production = nonterminal "::=" Expression EOL .
        Expression = Term { "|" Term } .
        Term       = Factor { Factor } | "eps" .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF2.

although that still allows one to have multiple and potentially misleading

| eps | eps | eps

options in an Expression. If one want to restrict the right hand side to contain at most one eps, and still to have an LL(1) grammar, one is forced to demand that the eps appear first, as in the grammar below:

     PRODUCTIONS
        BNF3       = { Production } EOF .
        Production = nonterminal "::=" Expression EOL .
        Expression = [ "eps" "|" ] Term { "|" Term } .
        Term       = Factor { Factor } .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF.

One might be tempted to use a very simple factorization:

        Production = nonterminal "::=" Expression EOL .
        Expression = Factor { [ "|" ] Factor } .
        Factor     = nonterminal | terminal .

While this "works", it does not impose the correct precedence on the (explicit) | selector, which is required to be weaker than the (implicit) concatenator operator. In EBNF and BNF a production like

A = B C | D E

must be interpreted as meaning

A = ( B C ) | ( D E )

and not as

A = B ( C | D ) E

Task 4

The Parva extensions produced some interesting submissions. Many of them (understandably!) were too restrictive in certain respects, while others were too permissive. Here is a suggested solution:

    COMPILER Parva $CN
    /* Parva level 1.5 grammar - Prac 21 extensions
       P.D. Terry, Rhodes University, 2007
       Grammar only */

    CHARACTERS
      lf         = CHR(10) .
      backslash  = CHR(92) .
      control    = CHR(0) .. CHR(31) .
      letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit      = "0123456789" .
      stringCh   = ANY - '"' - control - backslash .
      charCh     = ANY - "'" - control - backslash .
      printable  = ANY - control .

    TOKENS

/* Insisting that identifiers cannot end with an underscore is quite easy */

      identifier = letter { letter | digit | "_" { "_" } ( letter | digit ) } .

/* Most submissions suggested a simpler, more restrictive solution still
      identifier = letter { letter | digit | "_" ( letter | digit ) } .
*/

      number     = digit { digit } .
      stringLit  = '"' { stringCh | backslash printable } '"' .
      charLit    = "'" ( charCh   | backslash printable ) "'" .

    COMMENTS FROM "//" TO lf
    COMMENTS FROM "/*" TO "*/"

    IGNORE CHR(9) .. CHR(13)

    PRODUCTIONS
      Parva             = "void" identifier "(" ")" Block .

/* We have introduced the StatementSequence as it comes in quite useful in several places.
   Notice that a statement sequence can be empty.
   This has ramification in the switch statement latr in the grammar */

      Block             = "{" StatementSequence "}" .
      StatementSequence = { Statement } .

/* Staement has to change to incorporate the new possibilities */

      Statement         =   Block | ConstDeclarations | VarDeclarations
                          | Assignment | IncOrDecStatement
                          | IfStatement | WhileStatement | DoWhileStatement | ForStatement
                          | SwitchStatement
                          | ReturnStatement | HaltStatement | BreakStatement | ContinueStatement
                          | ReadStatement | WriteStatement
                          | ";" .

/* Declarations remain the same as before */

      ConstDeclarations = "const" OneConst { "," OneConst } ";" .
      OneConst          = identifier "=" Constant .
      Constant          = number | charLit | "true" | "false" | "null" .
      VarDeclarations   = Type OneVar { "," OneVar } ";" .
      OneVar            = identifier [ "=" Expression ] .

/* To deal with statements like i = 6; and i++; we need to manipulate the production
   for Assignment if we want to preserve an LL(1) grammar */

      Assignment        = Designator ( "=" Expression | "++" | "--" ) ";" .

/* Prefix increment/decrement statements like ++i; or ++list[7]; cause no LL(1) problems.
   Note that it is necessary to add a concluding semicolon */

      IncOrDecStatement = ( "++" | "--" ) Designator ";" .

/* In all these it is useful to maintain generality by using Designator, not identifier */

      Designator        = identifier [ "[" Expression "]" ] .

/* The if-then-else construction is most easily described as follows. Although
   this is not LL(1), this works admirably - it is simply the well-known dangling
   else ambiguity, which the parser resolves by associating elsif and else clauses
   with the most recent if */

      IfStatement       = "if" "(" Condition ")" Statement [ "else" Statement ] .

/* The Pascal-like "for" statement is almost trivial. But note that for generality one should
   use Designator instead of identifier, and Expression instead of Constant */

      ForStatement      = "for" Designator "=" Expression "to" Expression [ "by" Expression ] Statement .

/* The C-like "do-while" statement is also straightforward
   Note that it is necessary to add a concluding semicolon */

      DoWhileStatement  = "do"  Statement "while" "(" Condition ")" ";" .

/* The switch statement has to be handled carefully.  The labelled "case" arms are
   optional, but the "default" option can only appear once.  The selection can best
   be done by a general Expression rather than an identifier, and the case "labels"
   can be constants of any type that would match the type of the selector expression.
   In the grammar below the default option can only occur as the last of the "arms".
   This is not the case in Java and C#, although traditionally programs are coded in
   that way. */

      SwitchStatement   = "switch" "(" Expression ")" "{"
                            { OneCase }
                            [ "default" ":" StatementSequence ]
                          "}" .
      OneCase           = "case" Constant ":" StatementSequence .

/* The case arms usually have to contain a break statement, which is syntactically
   simply another form of statement.  There is actually a context-sensitive feature
   embedded in this - break statements cannot really be placed "anywhere", but we
   reserve further discussion for a later occasion.  Similarly the "continue"
   statement is syntactically simple, but actually context-sensitive */

      BreakStatement    = "break" ";" .
      ContinueStatement = "continue" ";" .

/* Most of the rest of the grammar remains unchanged: */
      WhileStatement    = "while" "(" Condition ")" Statement .
      ReturnStatement   = "return" ";" .
      HaltStatement     = "halt" ";" .
      ReadStatement     = "read" "(" ReadElement { "," ReadElement } ")" ";" .
      ReadElement       = stringLit | Designator .
      WriteStatement    = "write" "(" WriteElement { "," WriteElement } ")" ";" .
      WriteElement      = stringLit | Expression .
      Condition         = Expression .
      Expression        = AddExp [ RelOp AddExp ] .
      AddExp            = [ "+" | "-" ] Term { AddOp Term } .
      Term              = Factor { MulOp Factor } .
      Factor            =   Designator | Constant
                          | "new" BasicType "[" Expression "]"
                          | "!" Factor | "(" Expression ")" .
      Type              = BasicType [ "[]" ] .
      BasicType         = "int" | "bool" .
      AddOp             = "+" | "-" | "||" .

/* The % operator has the same precedence as other multiplicative operators */

      MulOp             = "*" | "/" | "%" | "&&" .
      RelOp             = "==" | "!=" | "<" | "<=" | ">" | ">=" .
    END Parva.

Task 5

The description of a book index has always produced some innovative and imaginative solutions whenever I have used it as an example. There is no single correct answer - looking at the example given usually leads to students discovering a set of productions to which I can respond "but what if you had an entry in the index reading like this" and finding another plausible one. Here is one suggested solution, in which I have played tricks with the selection of character sets. An important idea is to be able to decompose the solution to reflect the important ideas that the entries in an index have two main components - a "subject" and a "list of references".

Many submissions tried to define a "space token". This is futile; spaces between tokens are always ignored by scanners produced by Coco. One can define tokens that contain spaces, but this is only of much use in tokens like strings or the non-terminals of Task 3.

     COMPILER Index $CNX
     /* Grammar describing index in a book
        P.D. Terry, Rhodes University, 2007 */

     CHARACTERS
     /* Notice the careful and unusual choice of character sets */
       digit      = "0123456789" .
       startword  = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz('" + '"' .
       inword     = startword + digit + "-+)" .
       eol        = CHR(10) .

     IGNORE CHR(0) .. CHR(9) + CHR(11) .. CHR(31)

     TOKENS
     /* Notice the careful and unusual definition for identifier */
       word       = startword { inword } .
       number     = digit { digit } .
       EOL        = eol .

     PRODUCTIONS
       Index      = { Entry } EOF .
       Entry      = Key References EOL .
       Key        = word { "," word | word } .
       References = DirectRefs | CrossRef .
       DirectRefs = PageRefs { "," PageRefs  } .
       PageRefs   = number [ "-" number ] | Appendix .
       Appendix   = "Appendix" number .
       CrossRef   = "--" "see" Key .
     END Index .

I received an intriguing submission on the following lines (modified slightly from the original). Compare it with the one above and try to work out whether this is better than the one above, or whether it is incorrect.

     COMPILER Index1 $CNX
     /* Grammar describing index in a book
        Based on code by Wright, Nottingham, Ngwaile and Charlton
        This version by P.D. Terry, Rhodes University, 2007 */

     CHARACTERS
       control    = CHR(0) .. CHR(31) .
       digit      = "0123456789" .
       inword     = ANY - control - digit - ", " .
       lf         = CHR(10) .
       cr         = CHR(13) .
     TOKENS
       word       = inword { inword | digit } .
       number     = digit { digit } .
       EOL        = cr [ lf ] | lf .

     IGNORE control - cr - lf

     PRODUCTIONS
       Index1     = { Entry } EOF .
       Entry      = Key References EOL .
       Key        = Words { "," Words } .
       Words      = word { word } .
       References = DirectRefs | CrossRef .
       DirectRefs = PageRefs { "," PageRefs  } .
       PageRefs   = number [ "-" number ] | Appendix .
       Appendix   = "Appendix" number .
       CrossRef   = "--" "see" Key .
     END Index1 .