Prac 21 Solution

Computer Science 3 - 2004

Programming Language Translation

Practical for Week 21, beginning 15 September 2004 - Solutions

You can obtain machine readable versions of these grammars from the solution kit PRAC21A.ZIP or PRAC21AC.ZIP if you would like to experiment with them further.

Task 2

Extending the calculator grammar can be done in several ways. Here is a simple one of them, which corresponds to the approach taken in languages like Pascal, which do not allow two signs to appear together:

    COMPILER Calc1 $CN
    /* Simple four function calculator - extended
       P.D. Terry, Rhodes University, 2004 */

    CHARACTERS
      digit      = "0123456789" .
      hexdigit   = digit + "ABCDEF" .

    TOKENS
      decNumber  = digit { digit } .
      hexNumber  = "$" hexdigit { hexdigit } .

    IGNORE CHR(0) .. CHR(31)

    PRODUCTIONS
      Calc1      = { Expression "=" } EOF .
      Expression = ["+" | "-" ] Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = Primary { "!" } .
      Primary    = decNumber | hexNumber | [ "sqrt" ] "(" Expression ")" .
    END Calc1.

Another approach, similar to that taken in C++, is as follows:

    PRODUCTIONS
      Calc2      = { Expression "=" } EOF .
      Expression = Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = ( "+" | "-" ) Factor | Primary { "!" } .
      Primary    = decNumber | hexNumber | [ "sqrt" ] "(" Expression ")" .
    END Calc2.

This allows for expressions like 3 + - 7 or even 3 * -4 or even 3 / + - 4!. Because of the way the grammar is written, the last of these is equivalent to 3 / ( + ( - (4!))). It is clearer like this than if one tries to simplify the definition of Factor still further to

       Factor     = ( "+" | "-" ) Primary { "!" } .

in which the interpretation of -4! would be (-4)! and not -(4!) as it should be.

Here are some other suggestions. What, if any, differences are there between these and the other solutions presented so far?

    PRODUCTIONS
      Calc4      = { Expression "=" } EOF .
      Expression = Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = ( "+" | "-" ) Factor | Primary | "sqrt" "(" Expression ")" ) .
      Primary    = ( decNumber | hexNumber | "(" Expression ")" ) { "!" } .
    END Calc4.

    PRODUCTIONS
      Calc5      = { Expression "=" } EOF .
      Expression = ["+" | "-" ] Term { "+" Term  | "-" Term } .
      Term       = Factor { "*" Factor | "/" Factor } .
      Factor     = Primary { "!" } | "sqrt" "(" Expression ")" .
      Primary    = decNumber | hexNumber | "(" Expression ")" .
    END Calc5.

Several people suggested productions like this

      Factor     = ( "+" | "-" ) Factor | Primary | "sqrt(" Expression ")" ) .

A terminal like "sqrt(" is restrictive. It is usually better to allow white space to appears between method names and parameter lists if the user prefers thiis style.

Task 3

The task of writing an EBNF description of BNF results in a grammar as shown below. There are a few tricks to be learned from this one. Firstly, productions are separated one from the next by the end of line, not by a period. This means we cannot IGNORE the line break characters. This has to be done in a way that depends on your operating system, in general. In practice we might define the eol "character set" as the singleton CHR(10) and then define the EOLN "token" as a single character token, as in the code below, which will work on Unix (where line breaks are demarcated by LF (CHR(10)) or WinTel where they are demarcated by CR + LF (CHR(13) + CHR(10)). Secondly, we wish spaces to become significant characters within the nonterminals that are demarcated by < > brackets. Thirdly, we do not want to define the terminal token to include spaces within it, as we need to be able to distingush each terminal from the next if and when they are separated by spaces. In Cocol there is an implicit IGNORE CHR(32) - but this relates to ignoring spaces between tokens, as is common in almost all programming languages. The only way we can make spaces significant within a token definition is to define the singleton character set CHR(32), as Coco also forbids one from writing a string into a Cocol definition with an embedded spaces, as exemplified by "this has some spaces". Lastly, BNF notation still allows for the use of (round) parentheses; most people left these out of their solutions.

Incidentally, spaces are very rarely significant in computer languages - the definition of nonterminal here is one of the very few exceptions one can think of.

The simplest grammar now reads like this, but this allows one to have multiple and potentially misleading epsilons in a term, as in

<A> ::= a eps eps eps eps b

    COMPILER BNF1 $CN
    /* Grammar to describe BNF productions
       P.D. Terry, Rhodes University, 2004 */

    CHARACTERS
      letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      alpha  = letter + "0123456789_" .
      eol    = CHR(10) .
      space  = CHR(32) .

    TOKENS
      EOL         = eol .
      nonterminal = "<" ( letter | space ) { alpha | space } ">" .
      terminal    =  letter { alpha } .

    COMMENTS FROM "(*" TO "*)"  NESTED
    COMMENTS FROM "/*" TO "*/"  NESTED

    IGNORE CHR(0) .. CHR(9) + CHR(11) .. CHR(31)

    PRODUCTIONS
       BNF1       = { Production } EOF .
       Production = nonterminal "::=" Expression EOL .
       Expression = Term { "|" Term } .
       Term       = Factor { Factor } .
       Factor     = nonterminal | terminal | "(" Expression ")" | "eps" .
    END BNF1.

A slightly better solution is to have the PRODUCTIONS section reading:

     PRODUCTIONS
        BNF2       = { Production } EOF .
        Production = nonterminal "::=" Expression EOL .
        Expression = Term { "|" Term } .
        Term       = Factor { Factor } | "eps" .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF2.

although that still allows one to have multiple and potentially misleading

| eps | eps | eps

options in an Expression. If one want to restrict the right hand side to contain at most one eps, and still to have an LL(1) grammar, one is forced to demand that the eps appear first, as in the grammar below:

     PRODUCTIONS
        BNF3       = { Production } EOF .
        Production = nonterminal "::=" Expression EOL .
        Expression = [ "eps" "|" ] Term { "|" Term } .
        Term       = Factor { Factor } .
        Factor     = nonterminal | terminal | "(" Expression ")" .
     END BNF.

Several submissions attempted to use a very simple factorization:

        Production = nonterminal "::=" Expression EOL .
        Expression = Factor { [ "|" ] Factor } .
        Factor     = nonterminal | terminal .

While this "works", it does not impose the correct precedence on the (explicit) | selector, which is required to be weaker than the (implicit) concatenator operator. In EBNF and BNF a production like

A = B C | D E

must be interpreted as meaning

A = ( B C ) | ( D E )

and not as

A = B ( C | D ) E

Task 4

The Parva extensions produced some interesting submissions. Many of them (understandably!) were too restrictive in certain respects, while others were too permissive. Here is a suggested solution:

    COMPILER Parva $CN
    /* Parva level 1.5 grammar - Prac 21 extensions
       P.D. Terry, Rhodes University, 2004
       Grammar only */

    CHARACTERS
      lf         = CHR(10) .
      backslash  = CHR(92) .
      control    = CHR(0) .. CHR(31) .
      letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit      = "0123456789" .
      stringCh   = ANY - '"' - control - backslash .
      charCh     = ANY - "'" - control - backslash .
      printable  = ANY - control .

    TOKENS

/* Insisting that identifiers cannot end with an underscore is quite easy */

      identifier = letter { letter | digit | "_" ( letter | digit ) } .

      number     = digit { digit } .
      stringLit  = '"' { stringCh | backslash printable } '"' .
      charLit    = "'" ( charCh   | backslash printable ) "'" .

    COMMENTS FROM "//" TO lf
    COMMENTS FROM "/*" TO "*/"

    IGNORE CHR(9) .. CHR(13)

    PRODUCTIONS
      Parva             = "void" identifier "(" ")" Block .

/* We have introduced the StatementSequence as it comes in quite useful in several places */

      Block             = "{" StatementSequence "}" .
      StatementSequence = { Statement } .

/* Labels in front of statements can be introduced in several ways, while
   the options in Statement are easily extended to handle the new forms */

      Statement         = { Label }
                          (   Block
                            | ConstDeclarations | VarDeclarations
                            | Assignment        | IncOrDecStatement
                            | IfStatement       | WhileStatement
                            | ReturnStatement   | HaltStatement
                            | ReadStatement     | WriteStatement
                            | RepeatStatement   | GoToStatement
                            | SwitchStatement   | BreakStatement
                            | ";"
                          ) .
      Label             = number .

/* Declarations remain the same as before */

      ConstDeclarations = "const" OneConst { "," OneConst } ";" .
      OneConst          = identifier "=" Constant .
      Constant          = number | charLit | "true" | "false" | "null" .
      VarDeclarations   = Type OneVar { "," OneVar } ";" .
      OneVar            = identifier [ "=" Expression ] .

/* To deal with statements like i = 6; and i++; we need to manipulate the production
   for Assignment if we want to preserve an LL(1) grammar */

      Assignment        = Designator ( "=" Expression | "++" | "--" ) ";" .

/* Prefix increment/decrement statements like ++i; or ++list[7]; cause no LL(1) problems */

      IncOrDecStatement = ( "++" | "--" ) Designator ; .

/* In all these it is useful to maintain generality by using Designator, not identifier */

      Designator        = identifier [ "[" Expression "]" ] .

/* The if-then-elsif-else construction is most easily described as follows. Although
   this is not LL(1), this works admirably - it is simply the well-known dangling
   else ambiguity, which the parser resolves by associating elsif and else clauses
   with the most recent if */

      IfStatement       = "if" "(" Condition ")" Statement
                          { "elsif" "(" Condition ")" Statement }
                          [ "else" Statement ] .

/* The Pascal-like "repeat" statement is almost trivial.  Note that we can make use of
   StatementSequence rather than Statement between "repeat" and "until" */

      RepeatStatement   = "repeat" StatementSequence "until" "(" Condition ")" ";" .

/* GoTo statements are trivially easy */

      GoToStatement     = "goto" number ";" .

/* The switch statement has to be handled carefully.  The labelled "case" arms are
   optional but the "default" option can only appear once.  The selection can best
   be done by a general Expression rather than an identifier, and the case "labels"
   can be constants of any type that would match the type of the selector expression */

      SwitchStatement   = "switch" "(" Expression ")" "{"
                            { OneCase }
                            [ "default" ":" StatementSequence ]
                          "}" .
      OneCase           = "case" Constant ":" StatementSequence .

/* The case arms usually have to contain a break statement, which is syntactically
   simply another form of statement.  There is actually a contet-sensitive feature
   embedded in this - break statements cannot really be placed "anywhere", but we
   reserve further discussion for a later occasion. */

      BreakStatement    = "break" ";" .

/* Most of the rest of the grammar remains unchanged: */

      WhileStatement    = "while" "(" Condition ")" Statement .
      ReturnStatement   = "return" ";" .
      HaltStatement     = "halt" ";" .
      ReadStatement     = "read" "(" ReadElement { "," ReadElement } ")" ";" .
      ReadElement       = stringLit | Designator .
      WriteStatement    = "write" "(" WriteElement { "," WriteElement } ")" ";" .
      WriteElement      = stringLit | Expression .
      Condition         = Expression .
      Expression        = AddExp [ RelOp AddExp ] .
      AddExp            = [ "+" | "-" ] Term { AddOp Term } .
      Term              = Factor { MulOp Factor } .
      Factor            =   Designator | Constant
                          | "new" BasicType "[" Expression "]"
                          | "!" Factor | "(" Expression ")" .
      Type              = BasicType [ "[]" ] .
      BasicType         = "int" | "bool" .
      AddOp             = "+" | "-" | "||" .

/* The % operator has the same precedence as other multiplicative operators */

      MulOp             = "*" | "/" | "%" | "&&" .
      RelOp             = "==" | "!=" | "<" | "<=" | ">" | ">=" .
    END Parva.

Task 5

The description of a book index has always produced some innovative and imaginative solutions whenever I have used it as an example. There is no single correct answer - looking at the example given usually leads to students discovering a set of productions to which I can respond "but what if you had an entry in the index reading like this" and finding another plausible one. Here is one suggested solution, in which I have played tricks with the selection of character sets. An important idea is to be able to decompose the solution to reflect the important ideas that the entries in an index have two main components - a "subject" and a "list of references".

     COMPILER Index $CNX
     /* Grammar describing index in a book
        P.D. Terry, Rhodes University, 2004 */

     CHARACTERS
     /* Notice the careful and unusual choice of character sets */
       digit      = "0123456789" .
       startword  = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz('" + '"' .
       inword     = startword + digit + "-+)" .
       eol        = CHR(10) .

     IGNORE CHR(0) .. CHR(9) + CHR(11) .. CHR(31)

     TOKENS
     /* Notice the careful and unusual definition for identifier */
       word       = startword { inword } .
       number     = digit { digit } .
       EOL        = eol .

     PRODUCTIONS
       Index      = { Entry } EOF .
       Entry      = Key References EOL .
       Key        = word { "," word | word } .
       References = DirectRefs | CrossRef .
       DirectRefs = PageRefs { "," PageRefs  } .
       PageRefs   = number [ "-" number ] | Appendix .
       Appendix   = "Appendix" number .
       CrossRef   = "--" "see" Key .
     END Index .