Tutorial 8

Computer Science 301 - Translators

Tutorial 8 - Practice in attribute grammars

1. Consider the by now rather hackneyed grammar for expressions which has been written out again below. How would you add attributes so that a parser, after recognising an Expression, would be able to determine whether that Expression was a "constant expression" (meaning that in principle it could be evaluated at compile time) or was a "variable expression" (meaning that it could not be evaluated until run time)? Assume that the Designator production can be attributed so that it returns the result of searching the symbol table to determine whether the associated identifier denotes a variable or a constant.

  Expression
  =
  (   "+" Term
    | "-" Term
    | Term
  ) { AddOp
      Term
     } .

  Term
  =  Factor
     { MulOp
       Factor
     } .

  Factor
  =
      Designator
    | number
    | "(" Expression
      ")" .

2. Consider the familiar grammar below for describing a set of EBNF productions. How could you add attributes to this grammar so that it could determine which non-terminal had the longest right hand side (in terms of the number of terminals and non terminals in any one of its alternatives)?

  COMPILER EBNF $CN

  CHARACTERS
    letter   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit    = "0123456789" .
    noquote1 = ANY - "'" .
    noquote2 = ANY - '"' .

  TOKENS
    nonterminal = letter {letter | lowline } .
    terminal    = "'" noquote1 {noquote1} "'" | '"' noquote2 {noquote2} '"' .

  IGNORE  CHR(9) .. CHR(13)

  PRODUCTIONS
     EBNF
     =
       { Production
       }
       EOF .

     Production
     =
       SYNC nonterminal
       WEAK "="
       Expression
       SYNC "." .

     Expression
     =
     Term
     { WEAK "|" Term
     } .

     Term
     =
     [ Factor
       { Factor
       }
     ] .

     Factor
     =
         nonterminal
       | terminal
       | "[" Expression "]"
       | "(" Expression ")"
       | "{" Expression "}" .
  END EBNF.

3. Consider the familiar grammar below for describing a set of EBNF productions. How could you add attributes to this grammar so that it could check a set of productions to determine whether any non-terminals had not been defined, and whether any non-terminals could not possibly be reachable?

  COMPILER EBNF $CN

  CHARACTERS
    letter   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit    = "0123456789" .
    noquote1 = ANY - "'" .
    noquote2 = ANY - '"' .

  TOKENS
    nonterminal = letter {letter | lowline } .
    terminal    = "'" noquote1 {noquote1} "'" | '"' noquote2 {noquote2} '"' .

  IGNORE  CHR(9) .. CHR(13)

  PRODUCTIONS
     EBNF
     =
       { Production
       }
       EOF
     .

     Production
     = SYNC nonterminal
       WEAK "="
       Expression
       SYNC "." .

     Expression
     = Term
       { WEAK "|" Term
       } .

     Term
     = [ Factor
         { Factor
         }
       ] .

     Factor
     =
         nonterminal
       | terminal
       | "[" Expression "]"
       | "(" Expression ")"
       | "{" Expression "}" .
  END EBNF.

Checking on Tokens

In this weeks prac kit you will find the following grammar in the file TokenTests.atg

This grammar does nothing more than specify an application that will read a file of potential tokens and report on how repeated calls to the scanner would be able to report on what the tokens were. This is very similar to one that was displayed on the screen in the lecture a few days ago.

  using Library;

  COMPILER TokenTests $CN
  /* Test scanner construction and token definitions - C# version
     The generated program will read a file of words, numbers, strings etc
     and report on what characters have been scanned to give a token,
     and what that token is (as a magic number).  Useful for experimenting
     when faced with the problem of defining awkward tokens!

     P.D. Terry, Rhodes University, 2015 */

  /* Some code to be added to the parser class */

    static void Display(Token token) {
    // Simply reflect the fields of token to the standard outpul
      IO.Write("Line ");
      IO.Write(token.line, 4);
      IO.Write(" Column");
      IO.Write(token.col, 4);
      IO.Write(":  Kind");
      IO.Write(token.kind, 3);
      IO.WriteLine("  Val |" + token.val.Trim() + "|");
    } // Display

  CHARACTERS  /* You may like to introduce others */

    sp         = CHR(32) .
    backslash  = CHR(92) .
    control    = CHR(0) .. CHR(31) .
    letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit      = "0123456789" .
    stringCh   = ANY - '"' - control - backslash .
    charCh     = ANY - "'" - control - backslash .
    printable  = ANY - control .

  TOKENS      /* You may like to introduce others */

    ident      = letter { letter | digit } .
    integer    = digit { digit } .
    double     = digit { digit } "." digit { digit }  .
    string     = '"' { stringCh | backslash printable } '"' .
    char       = "'" ( charCh   | backslash printable ) "'" .

  IGNORE control

  PRODUCTIONS

     TokenTests
     = { (    ANY
           |  "."
           |  ".."
           |  "ANY"
         )                     (. Display(token); .)
       } EOF                   (. Display(token); .)
       .

  END TokenTests.

Use Coco/R to generate and then compile source for a complete token tester. You do this most simply by

cmake TokenTests

A command like

TokenTests tokens.txt

will run the program TokenTests and try to parse the file tokens.txt, displaying information about the tokens it might or might not recognize. Study the output carefully and convince yourself that the scanner is doing a good job. Unfortunately it displays the "kind" of token as a magic number, but with a little effort you should be able to work out what is what - if you look at the generated file Scanner.cs it might become clearer still.

If you give the command in the form

TokenTests tokens.txt -L

it will send an error listing to the file listing.txt, which might be more convenient. Try this out. (Actually this application will accept anything in the way of input, so you cannot really generate much in the way of syntax errors. But if you are perverse you might think of a way to do so.)

Some other tokens

This little application might be useful if you need to describe awkward tokens. For example:

The code as given might be having trouble recognizing 68. as an integer number (68) followed by a period. We discussed the fix in class, you should remember. Try it out.
Extend the definition of the double token to allow for numbers like 12.34, 12.34E+5, 1.2E4, and 1234.0E-10. Try out your application on a text file that contains tokens that it should recognize, and also incorrect ones like 12.34F+3 (use your imagination)
Extend the definition of ident to allow you to recognize identifiers that can contain underscores and digits internally, as in This_Is_2_Much or Whoops____I_did_it_again, but cannot end with an underscore.
Extend the number recognizers so that a number cannot begin with 0 (unless it is the number 0.xxx (double) or 0 (integer).

A point that I have not stressed in class (mea culpa as the Romans would have said, or "my bad" as you lot say) is that the TOKENS grammar does not have to be LL(1). The tokens are, in fact, specified by what amount to Regular Expressions, and they are handled, not by a recursive descent algorithm, but using the sort of automata theory you may remember from Computer Science 202.