Computer Science 301 - 2015 - Test on week 4 - Solutions

Please answer all questions in the spaces provided (three sides). Text books may not be used. Max 25

This test is intended to check your ability to wite simple Coco/R specifications and your understanding of LL(1) rules. It should only take you about 45 minutes at the most. Textbooks may not be used.

1. (a) Oh dear. I need reminding yet again. Please give your firstname / surname / student number? [1]

Pat/Terry/63T0884

(b) Is your answer to (a) an acceptable example of a Regular Expression? If not, why not? [1]

Yes of course it is! All the characters stand for themselves. If you like, it defines a language with only one sentence - my name is all it can match. I am unique and don't need any meta symbols in that regular expression if it is to describe only me.

2. The following grammar attempts to describe expressions incorporating the familiar operators with their correct precedence and associativity.

   COMPILER Expression
   IGNORE CHR(0) .. CHR(31)
   PRODUCTIONS
     Expression = Term    { ( "+" | "-" ) Term  } .    // additive
     Term       = Factor  { ( "*" | "/" ) Factor } .   // multiplicative
     Factor     = Primary [ "^" Expression ] .         // exponentiation
     Primary    = "a" | "b" | "c" .
  END Expression.

(a) Convince me that this is an ambiguous grammar [3].

To do this you have only to find one example of an expression that can be parsed in more than one way - to justify the example it would be as well to show a pair of parse trees, not just make a wild claim!

Consider the expression a^b*c. This can, sadly, be parsed in two ways, one with the implicit meaning of a^(b*c) and the other with the meaning of (a^b)*c. The parse trees would look like this (a few intermediate nodes have been omitted to save space)

           Expression                                        Expression
               │                                                 │
             Term                                              Term
               │                                                 │
               │                                       ┌─────────┼────────┐
               │                                       │         │        │
            Factor                                  Factor      "*"     Factor
               │                                       │                  │
        ┌──────┼─────────┐                             ├──┬──────┐        │
        │      │         │                             │  │      │        │
     Primary  "^"    Expression                   Primary ^  Expression   │
        │                │                             │         │        │
        │               Term                           │       Term       │
        │         ┌──────┼─────┐                       │         │        │
        │         │      │     │                       │         │        │
        │       Factor  "*"  Factor                    │       Factor     │
        │         │            │                       │         │        │
        │         │            │                       │         │        │
        a         b            c                       a         b        c

(b) Is it an LL(1) grammar? Why? [1]

It cannot be an LL(1) grammar if it is ambiguous. An LL(1) grammar has the property that at each stage in the analysis of a sentence a parser can uniquely choose from any alternatives it encounters in expanding/contracting the sentential form. LL(1) implies non ambiguous. Non-ambiguous does not imply LL(1), however.

Many students confuse ambiguty with conformity (or lack thereof) to the LL(1) criteria. But a grammar can be non-LL(1) and still be non-ambiguous - for example, the trivial (complete) grammar

A = 'x' 'y' | 'x' 'z' .

defines a language with only two sentences - xy and xz. It does so quite unambiguously, but it is not LL(1). An LL(1) equivalent is easily found, of course:

A = 'x' ( 'y' | 'z' ) .

As a more realistic example of a non-ambiguous grammar that is non-LL(1) one has to look no further than the left-recursive grammar for some simple expressions:

   COMPILER Expression
   IGNORE CHR(0) .. CHR(31)
   PRODUCTIONS
     Expression = Term  | Expression "-" Term .      // additive
     Term       = Factor | Term "/" ) Factor .       // multiplicative
     Factor     = "a" | "b" | "c" .
  END Expression.

Just for completeness, let us see which of the two LL(1) "rules" are broken by that ambiguous grammar. If we rewrite the grammar to eliminate the metabrackets we get

     Expression = Term TailExp .
     TailExp    = AddOp Term TailExp | ε .
     Term       = Factor TailTerm .
     TailTerm   = MulOp Factor TailTerm | ε .
     Factor     = Primary TailFactor .
     TailFactor = "^" Expression | ε .
     Primary    = "a" | "b" | "c" .
     AddOp      = "+" | "-" .
     MulOp      = "*" | "/" .

The nullable nonterminals here are TailExp, TailTerm and TailFactor.

   FIRST(TailExp)    = { "+" , "-" }
   FIRST(TailTerm)   = { "*" , "/" }
   FIRST(TailFactor) = { "^" }

The FOLLOW sets are a little harder to see because to get to closure one has to chase through quite a few other productions:

   FOLLOW(TailExp)    = FOLLOW(Expression)
   FOLLOW(TailTerm)   = FOLLOW(Term) = FIRST(TailExp) U FOLLOW(Expression)
   FOLLOW(TailFactor) = FOLLOW(Factor) = FIRST(TailTerm) U FOLLOW(Term)

You are invited to track these through in detail; the outcome is that they are all the same:

   FOLLOW(TailExp)    = { "*" , "/" , "+" , "-" , EOF }
   FOLLOW(TailTerm)   = { "*" , "/" , "+" , "-" , EOF }
   FOLLOW(TailFactor) = { "*" , "/" , "+" , "-" , EOF }

and so Rule 2 is broken for TailExp and for TailTerm.

(c) Can you find a suitable grammar that is unambiguous? (Give the new productions) [3]

Finding an LL(1) (hence unambigous grammar), with the correct precedence and associativity would have been incredibly easy had you just studied section 6.2 of the text, or done the prac properly, where the same problem was posed.

   COMPILER Expression $CNF
   IGNORE CHR(0) .. CHR(31)
   PRODUCTIONS
     Expression = Term    { ( "+" | "-" ) Term  } .
     Term       = Factor  { ( "*" | "/*" ) Factor } .
     Factor     = Primary [ "^" Factor ] .
     Primary    = "a" | "b" | "c" .
  END Expression.

Why does this grammar now satisfy the LL(1) constraints? Rewritten it becomes

     Expression = Term TailExp .
     TailExp    = AddOp Term TailExp | ε .
     Term       = Factor TailTerm .
     TailTerm   = MulOp Factor TailTerm | ε .
     Factor     = Primary TailFactor .
     TailFactor = "^" Factor | ε .
     Primary    = "a" | "b" | "c" .
     AddOp      = "+" | "-" .
     MulOp      = "*" | "/" .

The nullable nonterminals here are still TailExp, TailTerm and TailFactor.

   FIRST(TailExp)    = { "+" , "-" }
   FIRST(TailTerm)   = { "*" , "/" }
   FIRST(TailFactor) = { "^" }

The FOLLOW sets are a little harder: to get to closure one has to chase through quite a few other productions:

   FOLLOW(TailExp)    = FOLLOW(Expression)
   FOLLOW(TailTerm)   = FOLLOW(Term) = FIRST(TailExp) U FOLLOW(Expression)
   FOLLOW(TailFactor) = FOLLOW(Factor) = FIRST(TailTerm) U FOLLOW(Term)

You are invited to track these through in detail; the outcome is:

   FOLLOW(TailExp)    = { EOF }
   FOLLOW(TailTerm)   = { "+" , "-" , EOF }
   FOLLOW(TailFactor) = { "*" , "/" , "+" , "-" , EOF }

and so Rule 2 is no longer broken.

There were various other suggestions made, such as

     Factor     = Primary [ "^" Term ] .
     Factor     = Primary [ "^" "(" Expression ")" ] .
     Factor     = Primary { "^" Term } .
     Factor     = [ Primary "^" ] Expression .

but these are unnecessarily restrictive (first suggestion) or non-equivalent (second suggestion; parentheses were not catered for in the first grammar and introducing them is "cheating"). The third suggestion gets the associativity incorrect.

Confession: The grammar is "incomplete" - there are some strings that you might have thought should be derivable but which are not. An example would correspond to the expression abc. To handle this properly one would probably need to introduce parentheses into the alphabet - as one always does in algebra anyway. So I should probably have posed the problem with the last production reading

Primary = "a" | "b" | "c" | "(" Expression ")" .

I suspect there was at least one student who, in searching for a rogue expression, fell foul of this omission. My apologies!

3. You are referred to the Parva grammar on a separate page.

Suppose that next year you are chosen to act as a tutor for this course and that a student comes to you proudly showing a revised grammar which contains a new statement type, among other changes:

   PRODUCTIONS
     Parva            = "void" "main" "(" ")" Block .
     Block            = "{" { Declaration | Statement } "}" .
     Declaration      = ConstDeclaration | VarDeclarations .
     Statement        =   Block | Assignment | ";"
                        | IfStatement | WhileStatement | DoWhileStatement
                        | ReturnStatement | HaltStatement
                        | ReadStatement | WriteStatement .
     DoWhileStatement = "do" { Statement } "while" "(" Condition ")" ";" .

What advice would you give this student as to whether this was correct, better/worse than the original, or had unexpected features that they might not have thought of? [5]

The big problem - which few of you spotted - is that this grammar is now non-LL(1), although this might not immediately be obvious. The production for DoWhileStatement has a nullable component of { Statement } for which the FIRST set contains the while token, as does the FOLLOW set. This also means that code like the following might not be parsed in the way the first layout suggests

           do
             read(a);
             while (a < 0); {
               write("please give a positive response ");
               read("try again", a);
             }
             total = total + a;
           while (total < 100);

                                           // this shows the problem!
           do read(a); while (a < 0);      // do-while loop - complete
           {                               // unprotected block statement
               write("please give a positive response ");
               read("try again", a);
           }
           total = total + a;
           while (total < 100);            // while loop - complete

But this should amuse you - leave out the semicolon after (a < 0) and it changes completely:

           do                              // start outer do-while
             read(a);
             while (a < 0) {               // inner while loop
               write("please give a positive response ");
               read("try again", a);
             }                             // end while
             total = total + a;
           while (total < 100);            // end outer do-while

This production is a "bad" breach of LL(1) - compare this with the dangling else problem, which as we have tried to point out, is actually quite innocuous. (Someone once suggested calling this the "dangling while", which was interesting.) But does it make the grammar ambiguous - think carefully about that one?

Moral of the story. Follow Niklaus Wirth if you design a language - use distinctive key bracketing words like repeat-until and while-do rather than trying a silly, confusing word that has two roles in life...

The rest of the changes were far less important - almost an attempt to trick you. Typically folk responded:

(a) "You can only put executable statements, not declarations, into the body of a DoWhile loop.

Well, yes, I'll buy that. You certainly cannot write

          do
            int i;
          while (total < 100);

which you can do in C++ or Java if you like that sort of nonsense. Nor can you write the more seductively looking code

          do
            int i = 6;
            total = total + i;
          while (total < 100);

either. But you cannot do that in C++ or Java either - where the grammar is defined as

DoWhileStatement = "do" Statement "while" "(" Condition ")" ";" .

which is the way to fix the LL(1) problem, requiring that most DoWhile statements have a Block as the loop body. Of course this does not, of itself, forbid nonsense!

          do {
            int i;
            int total;
          } while (true);

(b) "The changes require you to declare variables before you use them".

Alas, no. Both sets are equivalent in that respect, and allow declarations and other statements to be freely mingled (except in the incorrectly specified DoWhile loop!). More generally (but using a set of silly productions to make the point), productions like

          A = B | C .
          B = "x" | "y" .
          C = "p" | "q" .

and

          A = "x" | "y" | "p" | "q" .
          A = "y" | "p" | "x" | "q" .

or indeed, any other permutations of these, do not force any ordering at all.

(c) "The new set allow you to write meaningless code like

if (a == b) int c;

with only a declaration instead of a statement". Well, yes, much as in (a) - but the same apparently useless / meaningless code is allowed by either set as

if (a == b) { int c; }

Few languages will prevent the dedicated idiots among us from writing meaningless or misleading code, although I suggest C++ is worse in this respect than some other languages I could mention. What do you make of the statements in this gem, which compiles quite happily and illustrates several potentially hard to find bugs?

        void main () {
          int a, b;
          42;
          if (a);                   while (silly);         while (silly());
          while (a) {}              if (a) 3;              if (a) a+b;
          if (a) int x; else int y;
          silly;                    silly();               a+++--b;
          ++a + + + - - --b;        -6;
        }

(d) "The second set does not allow you to write multifunction programs, only a "main" function".

Well, the original set only allowed a single function too. Admittedly it did not have to be called main -and it is not a good idea to use "main" as a keyword. One has to find some other special meaning to be attached to the "main" identifier (watch this space!) In programming language definitions you might find "main" referred to as a "standard" or "predefined" identifier, and you might be interested to learn that in Pascal the words INTEGER, REAL, BOOLEAN were "standard identifiers", not "key words" like "if" and "while". All rather confusing to the newcomer.

4. Suppose you are asked to extend the description of Parva to allow a programmer to incorporate "inline" PVM assembler statements, as exemplified by

            pvm {
               PRNS  "Assigning -9 to variable 2"
               LDA   2
               LDC   -9
               STO
            }

Here is a new statement definition that attempts to allow this (you would obviously also need a trivial change to the production for Statement to be able to make this into a "useful production").

      InlineStatement
      = "pvm" "{"
          { identifier [ "+" | "-" ] [ number | stringLit ] }
        "}" .

Does this production really describe the new statement type properly? If not, suggest how it should be improved by giving a better production or productions. [6]

This was surprisingly badly done. Just how far had you got with the last practical exercise, which really posed much the same problem? I had hoped that everyone would have seen that this was far too "permissive" - not all identifiers are valid opcode mnemonics, and not all opcodes can take arbitrary parameters, nor is it legitimate to write statements like

LDC   +
PRNS  - "how long is a piece of string?"

What I had hoped to get was something on the lines of the following:

        InlineStatement = "pvm" "{" { Code } "}" .
        Code            = OneWord | TwoWord | PrintString .

        OneWord         =   "NOP" | "NEG"  | "ADD"  | "SUB"  | "MUL"  | "DIV" | "REM"
                          | "AND" | "OR"   | "NOT"  | "ANEW" | "LDXA" | "LDV" | "HALT"
                          | "CEQ" | "CNE"  | "CLT"  | "CLE"  | "CGT"  | "CGE" | "STK"
                          | "STO" | "INPI" | "INPB" | "PRNI" | "PRNB" | "PRNL" .

        TwoWord         = ( "LDC" | "LDA"  | "DSP"  | "BRN"  | "BZE" ) [ "+" | "-" ] number .

        PrintString     = "PRNS" stringLit .

Some years ago when I set this exercise, one perceptive person pointed out, the ability to have BRN and BZE statements would be very nice, but the system would be almost totally unusable, since computing the target addresses would be a nightmare. We would need to have some sort of label system, as defined perhaps by

        InlineStatement = "pvm" "{" { Code } "}" .
        Code            = [ Label ] ( OneWord | TwoWord | Branch | PrintString ) .

        OneWord         =   "NOP" | "NEG"  | "ADD"  | "SUB"  | "MUL"  | "DIV" | "REM"
                          | "AND" | "OR"   | "NOT"  | "ANEW" | "LDXA" | "LDV" | "HALT"
                          | "CEQ" | "CNE"  | "CLT"  | "CLE"  | "CGT"  | "CGE" | "STK"
                          | "STO" | "INPI" | "INPB" | "PRNI" | "PRNB" | "PRNL" .

        TwoWord         = ( "LDC" | "LDA"  | "DSP" ) [ "+" | "-" ] number .

        Branch          = ( "BRN"  | "BZE" ) ( Label | number )

        PrintString     = "PRNS" stringLit .

        Label           = identifier .

I had not thought of this complication when I set the problem but he or she was quite corect. As I have said before, every year at least one student comes up with something that I have not thought of, and for which I am grateful. That is really why I go to all my lectures - I learn so much in them (why not try that for yourselves?).

5. As is very common, the production above makes use of the EBNF "meta brackets" { } and [ ]. How would you rewrite your (improved) production or productions describing the InlineStatement without making use either of these forms of meta brackets? [4]

This was badly done by most, but the transformations required are always easy. If you have a right hand side like

P = Something { Option } SomethingElse

introduce a new non terminal RepeatedOptions

RepeatedOptions = { Option } .

and then rewrite the rules for P as

P = Something RepeatedOptions SomethingElse
RepeatedOptions = Option RepeatedOptions | ε .

Similarly, and even more simply, if you have something like

Q = Startoff [ OneOption ] Finish

introduce a new non terminal Possible

Possible = [ OneOption ] .

and then rewrite the rules for Q as

Q = Startoff Possible Finish
Possible = OneOption | ε .

Of course in some cases thhis messes up operator associativity, although this would not be an issue in this example.

With that idea, the PRODUCTIONS extract given above (without the label refinement) might become:

         InlineStatement = "pvm" "{" Codes "}" .
         Codes           = Code Codes | ε .
         Code            = OneWord | TwoWord | PrintString .
         OneWord         =   "NOP" | "NEG"  | "ADD"  | "SUB"  | "MUL"  | "DIV" | "REM"
                           | "AND" | "OR"   | "NOT"  | "ANEW" | "LDXA" | "LDV" | "HALT"
                           | "CEQ" | "CNE"  | "CLT"  | "CLE"  | "CGT"  | "CGE" | "STK"
                           | "STO" | "INPI" | "INPB" | "PRNI" | "PRNB" | "PRNL" .
         TwoWord         = ( "LDC" | "LDA"  | "DSP"  | "BRN"  | "BZE" ) Sign  number .
         Sign            = "+" | "-" | ε .
         PrintString     = "PRS" stringLit .

6. A daring extension to Parva would be to allow the use of the float type, in addition to int and bool. This would require one to be able to describe literal constants like 12.34 or 1.2E4 or 12E-5. Where and how would you have to change the Cocol specification of Parva to be able to recognise statements, declarations and expressions that incorporate these extensions? [6]

Firstly, one has to modify the production for BasicType. One might try the following, which leaves the production for Constant unchanged.

BasicType = "int" | "bool" | "float" .
Constant = number | charLit | "true" | "false" | "null" .

Then one has to find a way to extend the token definition for number to be able to express float literals as well as int literals. The problem here is to make sure that all the valid combinations are allowed, unambiguously, insisting that at least one digit appear, and with no possibility of obtaining several periods or Es in one number. For the examples given, the following would suffice

number = digit { digit } [ "." { digit } ] [ "E" [ "+" | "-" ] digit { digit } ] .

A little reflection will suggest that this is sub-optimal. One will not be able to distinguish integral numbers from floating numbers "after" the event. It would be better to have lexically distinct tokens. So how about

intNumber = digit { digit } .
floatNumber = digit { digit } [ "." { digit } ] [ "E" [ "+" | "-" ] digit { digit } ] .

along with a changed production

Constant = intNumber | floatNumber | charLit | "true" | "false" | "null" .

However, this won't work. Both intNumber and floatNumber can match a sequence like 1234.

Fixing the problem by imposing a requirement that a floatNumber either starts or terminates with a special character like F is a possibility, but a rather messy one. Admittedly this is done for hexadecimal and binary numbers, but they pop up rather infrequently in source code other than at an assembler level.

The trick is to realise that a floatNumber must have either (or both) a period or exponent. Exploiting this is easy, if tedious, and one ends up with a rather complicated pair of definitions

       TOKENS
         intNumber = digit { digit } .
         floatNumber = digit { digit }
                        (  "."  { digit } [ "E" [ "+" | "-" ] digit { digit } ]
                          | "E" [ "+" | "-" ] digit { digit }
                        ) .

Attempts like the following are wrong (you should puzzle out why this is so)

number = digit { digit | "." } "E" { "+" | "-" | digit } .
number = digit { digit | "." | "E" | "+" | "-" } .

Some languages allow you to have float numbers like 123. or 12.E-4 or .456 or .35E+1 - does the above definition allow these? If not, how would you write a token definition to handle all those possibilities as well?

Several submissions tried definitions like

intnumber = digit { digit } .
floatnum = intnumber [ "." intnumber [ "E" [ "-" ] intnumber ] ] .

but as I have tried to explain on numerous occasions, in Cocol specifications you cannot write one token definition in terms of another token definition. The TOKENS section is really restricted to the use of "regular expressions" -albeit in a slightly different notation from the one suggested in setion 5.3 of the text book.


  COMPILER Parva $CN

  CHARACTERS
    lf         = CHR(10) .
    backslash  = CHR(92) .
    control    = CHR(0) .. CHR(31) .
    letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit      = "0123456789" .
    stringCh   = ANY - '"' - control - backslash .
    charCh     = ANY - "'" - control - backslash .
    printable  = ANY - control .

  TOKENS
    identifier = letter { letter | digit | "_" } .
    number     = digit { digit } .
    stringLit  = '"' { stringCh | backslash printable } '"' .
    charLit    = "'" ( charCh   | backslash printable ) "'" .

  COMMENTS FROM "//" TO lf
  COMMENTS FROM "/*" TO "*/"

  IGNORE CHR(9) .. CHR(13)

  PRODUCTIONS
    Parva             = "void" identifier "(" ")" Block .
    Block             = "{" { Statement } "}" .
    Statement         =   Block | ConstDeclarations | VarDeclarations | Assignment | ";"
                        | IfStatement | WhileStatement | ReturnStatement | HaltStatement
                        | ReadStatement | WriteStatement .
    ConstDeclarations = "const" OneConst { "," OneConst } ";" .
    OneConst          = identifier "=" Constant .
    Constant          = number | charLit | "true" | "false" | "null" .
    VarDeclarations   = Type OneVar { "," OneVar } ";" .
    Type              = BasicType [ "[]" ] .
    BasicType         = "int" | "bool" .
    OneVar            = identifier [ "=" Expression ] .
    Assignment        = Designator "=" Expression ";" .
    Designator        = identifier [ "[" Expression "]" ] .
    IfStatement       = "if" "(" Condition ")" Statement .
    WhileStatement    = "while" "(" Condition ")" Statement .
    ReturnStatement   = "return" ";" .
    HaltStatement     = "halt" ";" .
    ReadStatement     = "read" "(" ReadElement { "," ReadElement } ")" ";" .
    ReadElement       = stringLit | Designator .
    WriteStatement    = "write" "(" WriteElement { "," WriteElement } ")" ";" .
    WriteElement      = stringLit | Expression .
    Condition         = Expression .
    Expression        = AddExp [ RelOp AddExp ] .
    AddExp            = [ "+" | "-" ] Term { AddOp Term } .
    Term              = Factor { MulOp Factor } .
    Factor            =   Designator | Constant
                        | "new" BasicType "[" Expression "]"
                        | "!" Factor | "(" Expression ")" .
    AddOp             = "+" | "-" | "||" .
    MulOp             = "*" | "/" | "%" | "&&" .
    RelOp             = "==" | "!=" | "<" | "<=" | ">" | ">=" .
  END Parva.


Home  © P.D. Terry