Test for week

Computer Science 301 - 2012 - Test on Week 21 - Solutions

This should take no longer than 25 minutes and is simply intended to probe whether you have understood the material in the prac questions. YOU MAY NOT USE YOUR TEXTBOOK, NOTES OR A COMPUTER. Answer on the questions sheet (3 sides).

This test was very badly done in general (with some notable exceptions). Please study the answers carefully.

1. You were expecting this one, I bet! What are your name (surname) and student number? [1]

Pat (Terry) 63T0844

2. What do you understand by the term ambiguous grammar? [2]

There is a very simple, concise answer to this, so come to appreciate it!

There must be at least one string in the language defined by this grammar which can be parsed in at least two
ways.

That is all you really need to say.

There may be many such strings in general (see the example in grammar G9 in section 6.4), but the presence
of just one such string is enough to cause ambiguity in the grammar as a whole (which may be one reason
why it is hard to establish whether an arbitrary, large, grammar is ambiguous; finding such a string may be a
non-trivial exercise).

There are all sorts of "near misses" that one sees in answers. One submission said something like "A
grammar is ambiguous if two or more valid parse trees may be formed from that grammar". Well, most
grammars permit many parse trees! The crucial "missing" idea in that submission is that this must happen
for a particular special sentence or two.

3. Part of the grammar for the extended Parva language you met in the practical might have been expressed as follows:

   PRODUCTIONS
     Parva            = "void" "main" "(" ")" Block .
     Block            = "{" { Statement } "}" .
     Statement        = (   Block | WhileStatement | IfStatement | RepeatStatement
                          | ReadStatement | WriteStatement | Assignment
                          | ReturnStatement | EmptyStatement
                          | ConstDeclaration | VarDeclarations
                        ).
     IfStatement      = "if" "(" Condition ")" Statement [ "else" Statement ] .
     WhileStatement   = "while" "(" Condition ")" Statement .
     RepeatStatement  = "repeat" { Statement } "until" "(" Condition ")" ";" .

What (if anything) would be the effect of replacing the productions as follows? [4]

     Parva            = "void" "main" "(" ")" Block .
     Block            = "{" { Declaration | Statement } "}" .
     Declaration      = ConstDeclaration | VarDeclarations .
     Statement        = (   WhileStatement | IfStatement | RepeatStatement
                          | ReadStatement | WriteStatement | Assignment
                          | ReturnStatement | EmptyStatement
                        ).
     IfStatement      = "if" "(" Condition ")" Block [ "else" Block ] .
     WhileStatement   = "while" "(" Condition ")" Block .
     RepeatStatement  = "repeat" Block "until" "(" Condition ")" ";" .

Very few submissions got this right, so it is useful to comment on some of the misconceptions. The main point that was quite often missed is that the second set of productions forces you to have braces around the subsidiary statement (or statements) in if, while and repeat statements. You have to write, for example

if (a == b) { c = d; } else { c = f; }

rather than the simpler, C-like code

if (a == b) c = d; else c = f;

which is allowed by the first set of productions. The first set does not forbid writing the braces, of course, as Statement is still able to derive Block. However, the point of all this is that by requiring the braces one eliminates the "dangling else" ambiguity and non-LL(1) problem from which the first set of productions suffers. If you don't believe me, try submitting the grammars to Coco/R. Another point of interest is that Statement can no longer derive Block, so that blocks cannot be nested - rather silly code like

{ a = b; c = d; { e = f; } g = h; }

would be allowed by the first set of productions, but not by the second set.

Wrong answers received might have included:

(a) "The second set requires you to declare variables before you use them". Alas, no. Both sets are completely equivalent in that respect, and allow declarations and statements to be freely mingled.

In general, productions like

     A = B | C .
     B = "x" | "y" .
     C = "p" | "q" .

and

     A = "x" | "y" | "p" | "q" .
     A = "y" | "p" | "x" | "q" .

or indeed, any other permutations of these, do not force any ordering at all.

(b) The sets of productions were only claimed to be "part of the grammar". Unfortunately, I left off the line defining Parva itself from the first set by accident. This obviously confused people, so I apologize. This really had nothing to do with the differences I hoped you would spot.

if (a == b) int c;

with only a declaration instead of a statement". Well, yes, but the same apparently useless/meaningless code is allowed by the second set as

if (a == b) { int c; }

Few languages will prevent the dedicated idiots among us from writing meaningless or misleading code, though I suggest C++ is worse in this respect than some other languages I could mention. What do you make of the statements in this gem, which compiles quite happily and illustrates several potentially hard to find bugs?

   int silly() { return (int) silly + silly(); }

   void main () {
     int a, b;
     if (a);           while (silly);         while (silly());
     while (a) {}      if (a) 3;              if (a) a+b;
     if (a) int x; else int y;                42;      -6;
     silly;            silly();               a+++--b;
     ++a + + + - - --b;
   }

4. Show how one might specify a numeric literal in Cocol, allowing for the possibility of numbers of any of the following forms [4]

      123   1.23   1.   .23   1.2E12  1.2E-12   1E6   1E-6   .234E6   .234E-6

Many students had not appreciated two key ideas:

(a) We are looking for a TOKENS definition of a numeric literal, not a PRODUCTIONS definition. A numeric literal is a single token formed of contiguous characters, with no spaces permitted within it (the sequences of tokens permitted and defined by PRODUCTIONS definitions can have spaces between them).

(b) One cannot define one token in terms of others. This point was made in lectures, but probably not emphasized enough.

All of this means that we have to find one (big) regular expression-like way of specifying the pattern of a numeric literal, and, since there are many permissible forms of this, the definition is, indeed, quite complicated. Here are two ways of doing it:

    CHARACTERS
      digit  = "0123456789" .

    TOKENS
      number =   "." digit { digit } [ ( "e" | "E" ) [ "+" | "-" ] digit { digit } ]
               | digit { digit } [ "." { digit } ] [ ( "e" | "E" ) ["+" | "-"] digit { digit } ] .

or

    TOKENS
      number =   ( "." digit { digit } | digit { digit } [ "." { digit } ] )
                    [ ( "e" | "E" ) ["+" | "-"] digit { digit } ] .

It takes some skill (and experience) to make sure that one gets all the options correct, while ruling out a definition that allows totally erroneous numbers like 1.2.3.45, for example. Here is a definition that looks seductively correct:

    TOKENS
      number = [ digit { digit } ] [ "." { digit } ] [ ( "e" | "E" ) ["+" | "-"] digit { digit } ] .

but this allows a "number" to be completely "empty" - devoid of any characters at all.

5. Develop a Cocol grammar for describing a list of Boolean expressions, like those you should remember from Boolean Algebra courses you may have taken. In this context, allow your Boolean expressions to be terminated with an = operator, and to include parentheses, single-letter variables, the constants FALSE and TRUE (sometimes written as 0 and 1), and the operators AND (written either as AND or "&&" or as a "dot", or simply omitted), OR (written either as OR or as "||" or as a "plus") and NOT, written either as a prefix NOT or as a suffix apostrophe. So some examples of Boolean expressions might be

         a AND B OR (C OR NOT D) =     a . b + (c + d') =     a b + (c + d') =
         NOT (a OR b) AND TRUE =       (a + b)' . 1 =         (a + b)' AND 1 =
         b AND NOT C AND D =           b . c' . d =           b c' d =

Note that there is a precedence ordering among Boolean operators - parentheses take precedence over NOT, which takes precedence over AND, which takes precedence over OR. [8]

This was very disappointing. I had hoped that after you had seen the "calculator" task and the "Parva" task in the prac that you would have realized that what was required was a rather straightforward variation on the former, with the operators NOT AND and OR taking the place of leading minus, * and + respectively. But no, the answers received ranged from good ones from students who had clearly done the practical to complete rubbish, presumably from people who had not. Ah well.

Here is a complete solution - actually this goes a bit further to allow NOT to be represented by ! as well as by NOT. Note that the NOT/! operator appears as a prefix (before the operand) while the ' operator appears as a suffix (after the operand) so they are not interchangeable, as some of you thought.

    COMPILER Bool $CN

    CHARACTERS
      letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .

    TOKENS
      variable   = letter .

    IGNORE CHR(0) .. CHR(31)

    PRODUCTIONS
      Bool       = { Expression "=" } EOF .
      Expression = Term { Or Term } .
      Term       = Factor { [ And ] Factor } .
      Factor     = Not Factor | Primary { "'" } .
      Primary    = True | False | variable | "(" Expression ")" .
      True       = "TRUE"   |  "1" .
      False      = "FALSE"  |  "0" .
      And        = "AND"  |  "&&"  |  "." .
      Or         = "OR"   |  "||"  |  "+" .
      Not        = "NOT"  | "!" .
    END Bool.

6. In the practical you were invited to develop a grammar that would describe a list of academic staff, with provision made for titles, initials, names and degrees. A set of productions for this might have been the following

    PRODUCTIONS
      Staff        = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = NameLast { NameLast } .
      NameLast     = initial { initial } name | name .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff.

(a) Give a precise, mathematical, definition of the concept "A is a nullable non-terminal". [2]

As in the case of question 2, there is a very simple concise answer. Mathematically the definition of "A is nullable" can be expressed:

A Þ^* e

In English: It is possible to produce (derive) an empty string by applying appropriate productions to A and to the sentential forms derivable from A.

Of course one may be able to produce (derive) non-empty strings as well. In general this will be the case, as a non-terminal that can derive only the empty string is not very useful.

(b) Identify - giving reasons - the nullable non-terminals in the above set of productions (name and initial are, of course, terminals). [4]

The non-terminals are Staff, Person, Fullname, NameLast, Title and Degree. Title and Degree are clearly not nullable. Don't confuse this with the fact that it is possible to have a valid name without a Title or Degree - once you derive from either Title or Degree you must produce a single terminal! NameLast must derive at least one terminal - name - and similarly FullName must derive at least one name. Person must derive at least a name and a period. That leaves only Staff. In one sense Staff is nullable - because EOF is not really a token, just a convenient way of detecting the end of the input. Scanners will, of course, be primed to return a pseudo-terminal if they run out of input, but a completely empty list would be acceptable to this grammar (everybody gone on holiday?).

The answers received showed up a misconception that many of you have. Consider the production

          A = [ B ] { C } .
          B = x | y .
          C = p | q .

Here A is certainly nullable. Its right side has a sequence of two "nullable" components (or "deletable structures" as Coco calls them). But B and C themselves are not nullable. Compare this with the equivalent grammar

              A = B  C  .
              B = [ x | y ] .
              C = { p | q } .

Here A is nullable. Its right side is a sequence with two components B and C. Both of these are themselves now nullable. The same would hold for yet another equivalent grammar:

              A = B  C  .
              B =   x | y
                  | /* eps*/ .
              C =   ( p | q ) C
                  | /* eps*/ .

(c) The PRODUCTIONS sections above make use of the EBNF "meta brackets" { } and [ ] in the production rules. How would you rewrite the PRODUCTIONS section without making any use of parentheses ( ) or of either of the {} or [] form of meta brackets? [5]

This was not always well done, yet the transformations required are always easy. If you have a right hand side like

P = Something { Option } SomethingElse .

introduce a new nonterminal RepeatedOptions in place of { Option }

and then rewrite things as

P = Something RepeatedOptions SomethingElse .
RepeatedOptions = Option RepeatedOptions | e .

Similarly, and even more simply, if you have something like

Q = Startoff [ OneOption ] Finish .

introduce a new nonterminal Possible in place of [ OneOption ]

and then rewrite things as

Q = Startoff Possible Finish .
Possible = OneOption | e .

With that idea, the PRODUCTIONS section given above might be rewritten mechanically as

    PRODUCTIONS
      Staff          = Persons EOF .
      Persons        = Person Persons | .
      Person         = OptionalTitle FullName Degrees SYNC "."  .

      OptionalTitle  = Title | .
      Title          = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .

      FullName       = NameLast OtherLastNames .
      OtherLastNames = NameLast OtherLastNames | .
      NameLast       = initial MoreInitials name | name .
      MoreInitials   = initial MoreInitials | .

      Degrees        = "," Degree Degrees | .
      Degree         =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                       | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff.

But this could also be refactored rather more radically and expressed

    PRODUCTIONS
      Staff6         = Persons EOF .
      Persons        = Person Persons | .
      Person         = OptionalTitle FullName Degrees SYNC "."  .

      OptionalTitle  = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" | .

      FullName       = NameLast OtherLastNames .
      OtherLastNames = NameLast OtherLastNames | .
      NameLast       = Initials name .
      Initials       = initial Initials | .

      Degrees        = "," Degree Degrees | .
      Degree         =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                       | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff6.

or even as follows (this is still an LL(1) grammar).

    PRODUCTIONS
      Staff7         = Person Staff7 | EOF .
      Person         = OptionalTitle FullName Degrees SYNC "."  .

      OptionalTitle  = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" | .

      FullName       = NameLast OtherNames .
      OtherNames     = FullName | .
      NameLast       = Initials name .
      Initials       = initial Initials | .

      Degrees        = "," Degree Degrees | .
      Degree         =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                       | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff7.

(These grammars are included in the solution kits for Practical 21, available on the web page.)