Week 22 test

Computer Science 301 - Test on week 22

Please answer all questions in the spaces provided (three sides). Text books may not be used. Max 25

This test is intended to check your ability to wite simple Coco/R specifications and your understanding of LL(1) rules. It should only take you about 25 minutes at the most. Textbooks may not be used.

1. Oh dear. I need reminding yet again. What are your name (surname) and student number? [1]

Pat Terry 63T0884

2. Consider the following simple Cocol grammar that attempts to describe the Roman Number representations of the numbers 1 through 9, that is I, II, III, IV, V, VI, VII, VIII, IX.

       COMPILER Roman
         PRODUCTIONS
           Roman       = OneToThree | FourOrNine | FiveToEight .
           OneToThree  = "I" [ "I" ] [ "I" ] .
           FourOrNine  = "I" ( "V" | "X" ) .
           FiveToEight = "V" [ OneToThree ].
         END Roman.

This grammar is correct, but suffers from several flaws. Explain these, and try to come up with a better one? (Keep it as simple as possible; no need to add comments and ignorable sections; concentrate on the PRODUCTIONS). [3]

Clearly many students are still confusing ambiguty with conformity (or lack thereof) to the LL(1) criteria. A grammar can be non-LL(1) and yet be non-ambiguous - for example, the trivial (complete) grammar

A = 'x' 'y' | 'x' 'z' .

defines a language with only two sentences - xy and xz. It does so quite unambiguously, but it is not LL(1). An LL(1) equivalent is easily found, of course:

A = 'x' ( 'y' | 'z' ) .

If a grammar is ambiguous it cannot be LL(1). The grammar above for Roman numbers is already non- LL(1) by virtue of the fact that FIRST(OneToThree) and FIRST(FourOrNine) are both the singleton set { "I" } (Rule 1 is broken for the non-terminal Roman.) But it is not this which makes it ambiguous. A string like II could be derived from OneToThree by selecting the first and second Is, or by selecting the first and third Is - it is this which makes it ambiguous. In this case the ambiguity is matched by the failure to meet Rule 2 - in the production

OneToThree = "I" [ "I" ] [ "I" ] .

there are two "nullable" components. For the first of these the FIRST set is { "I" } and "I" is in the FOLLOW set for this component by virtue of its being in the second [ "I" ] component. Probably most ambiguous grammars, if you try to apply the LL(1) rules, will break Rule 2, but I am not sure that this will always be true.
There are numerous ways of correcting the grammar. The brute force one would be to say

Roman = "I" | "II" | "III" | "IV" | "V" | "VI" | "VII" | "VIII" | "IX" | "X" .

and if you suggested that you got credit! Entering more into the spirit of things might lead to productions like

PRODUCTIONS Roman = OneToThreeOrFourOrNine | FiveToEight . ZeroToTwo = [ "I" [ "I" ]] . OneToThreeOrFourOrNine = "I" [ ZeroToTwo | "V" | "X" ] . FiveToEight = "V" [ "I" ZeroToTwo ] . END Roman.

3. You are referred to the Parva grammar on a separate page.

Suppose that next year you are chosen to act as a tutor for this course and that a student comes to you proudly showing a revised grammar which contains a new statement type, among other changes:

   PRODUCTIONS
     Parva            = "void" "main" "(" ")" Block .
     Block            = "{" { Declaration | Statement } "}" .
     Declaration      = ConstDeclaration | VarDeclarations .
     Statement        =   Block | Assignment | ";"
                        | IfStatement | WhileStatement | DoWhileStatement
                        | ReturnStatement | HaltStatement
                        | ReadStatement | WriteStatement .
     DoWhileStatement = "do" { Statement } "while" "(" Condition ")" ";" .

What advice would you give this student as to whether this was correct, better/worse than the original, or had unexpected features that they might not have thought of? [5]

The big problem is that this grammar is now non-LL(1), although this might not immediately be obvious. The production for DoWhileStatement has a nullable component of { Statement } for which the FIRST set contains the while token, as does the FOLLOW set. This means that code like the following might not be parsed in the way the layout suggests

           do
             read(a);
             while (a < 0) {
               write("please give a positive response ");
               read("try again", a);
             }
             total = total + a;
           while (total < 100);

This is a "bad" breach of LL(1) - compare this with the dangling else problem, which as we have tried to point out, is actually quite innocuous.
The rest of the changes were far less important - almost an attempt to trick you. Typically folk responded:
(a) "You can only put executable statements, not declarations, into the body of a DoWhile loop.
Well, yes, sort of. You certainly cannot write

do int i; while (total < 100);

which you can do in C++ or Java if you like that sort of nonsense. More sensibly, you cannot write

do int i = 6; total = total + i; while (total < 100);

either, I admit. But you cannot do that in C++ or Java either - where the grammar is defined as

DoWhileStatement = "do" Statement "while" "(" Condition ")" ";" .

which of course fixes the LL(1) problem, while requiring that most DoWhile statements have a Block as the loop body:

do { int i = 6; total = total + i; } while (total < 100);

(c) "The changes require you to declare variables before you use them".
Alas, no. Both sets are equivalent in that respect, and allow declarations and other statements to be freely mingled (except in the incorrectly specified DoWhile loop!). More generally, productions like

A = B | C . B = "x" | "y" . C = "p" | "q" .

and

A = "x" | "y" | "p" | "q" . A = "y" | "p" | "x" | "q" .

or indeed, any other permutations of these, do not force any ordering at all.
(b) "The second set allow you to write meaningless code like

if (a == b) int c;

with only a declaration instead of a statement". Well, yes, but the same apparently useless/meaningless code is allowed by the first set as

if (a == b) { int c; }

Few languages will prevent the dedicated idiots among us from writing meaningless or misleading code, although I suggest C++ is worse in this respect than some other languages I could mention. What do you make of the statements in this gem, which compiles quite happily and illustrates several potentially hard to find bugs?

void main () { int a, b; if (a); while (silly); while (silly()); while (a) {} if (a) 3; if (a) a+b; if (a) int x; else int y; 42; -6; silly; silly(); a+++--b; ++a + + + - - --b; }

4. Suppose we were asked to extend the description of Parva to allow a programmer to incorporate "inline" PVM assembler statements, as exemplified by

            pvm {
               PRNS  "Assigning -9 to variable 2"
               LDA   2
               LDC   -9
               STO
            }

Here is a new statement definition that attempts to allow this (we would obviously need a simple change to the production for Statement to be able to add this as a real possibility).

      InlineStatement
      = "pvm" "{"
          { identifier [ "+" | "-" ] [ number | stringLit ] }
        "}" .

Does it really describe the new statement type properly? If not, suggest how it should be improved. [6]

This was surprisingly badly done. I had hoped that everyone would have seen that this was far too "permissive" - not all identifiers are valid opcode mnemonics, and not all opcodes can take arbitrary parameters, nor is it legitimate to write statements like

LDC +
PRNS - "how long is a piece of string?"

What I had hoped to get was something on the lines of the following:

InlineStatement = "pvm" "{" { Code } "}" . Code = OneWord | TwoWord | PrintString . OneWord = "NOP" | "NEG" | "ADD" | "SUB" | "MUL" | "DIV" | "REM" | "AND" | "OR" | "NOT" | "ANEW" | "LDXA" | "LDV" | "HALT" | "CEQ" | "CNE" | "CLT" | "CLE" | "CGT" | "CGE" | "STK" | "STO" | "INPI" | "INPB" | "PRNI" | "PRNB" | "PRNL" . TwoWord = ( "LDC" | "LDA" | "DSP" | "BRN" | "BZE" ) [ "+" | "-" ] number . PrintString = "PRNS" stringLit .

As one perceptive person pointed out, the ability to have BRN and BZE statements would be very nice, but the system would be almost totally unusable, since computing the target addresses would be a nightmare. We would need to have some sort of label system, as defined perhaps by

InlineStatement = "pvm" "{" { Code } "}" . Code = [ Label ] ( OneWord | TwoWord | PrintString ) . OneWord = "NOP" | "NEG" | "ADD" | "SUB" | "MUL" | "DIV" | "REM" | "AND" | "OR" | "NOT" | "ANEW" | "LDXA" | "LDV" | "HALT" | "CEQ" | "CNE" | "CLT" | "CLE" | "CGT" | "CGE" | "STK" | "STO" | "INPI" | "INPB" | "PRNI" | "PRNB" | "PRNL" . TwoWord = ( "LDC" | "LDA" | "DSP" | "BRN" | "BZE" ) ( Label | [ "+" | "-" ] number ) . PrintString = "PRNS" stringLit . Label = identifier .

I had not thought of this complication when I set the problem! As I have said before, every year at least one student comes up with something that I have not thought of, and for which I am grateful. That is really why I go to all my lectures - I learn so much in them (why not try that for yourselves?).

5. As is very common, the production above makes use of the EBNF "meta brackets" { } and [ ]. How would you rewrite your (improved) production or productions describing the InlineStatement without making use either of these forms of meta brackets? [4]

This was not too badly done, and the transformations required are always easy. If you have a right hand side like

P = Something { Option } SomethingElse

introduce a new non terminal RepeatedOptions

RepeatedOptions = { Option } .

and then rewrite the rules for P as

P = Something RepeatedOptions SomethingElse
RepeatedOptions = Option RepeatedOptions | e .

Similarly, and even more simply, if you have something like

Q = Startoff [ OneOption ] Finish

introduce a new non terminal Possible

Possible = [ OneOption ] .

and then rewrite the rules for Q as

Q = Startoff Possible Finish
Possible = OneOption | e .

With that idea, the PRODUCTIONS extract given above might become:

InlineStatement = "pvm" "{" Codes "}" . Codes = Code Codes | e . Code = OneWord | TwoWord | PrintString . OneWord = "NOP" | "NEG" | "ADD" | "SUB" | "MUL" | "DIV" | "REM" | "AND" | "OR" | "NOT" | "ANEW" | "LDXA" | "LDV" | "HALT" | "CEQ" | "CNE" | "CLT" | "CLE" | "CGT" | "CGE" | "STK" | "STO" | "INPI" | "INPB" | "PRNI" | "PRNB" | "PRNL" . TwoWord = ( "LDC" | "LDA" | "DSP" | "BRN" | "BZE" ) Sign number . Sign = "+" | "-" | e . PrintString = "PRS" stringLit .

6. A daring extension to Parva would be to allow the use of the float type, in addition to int and bool. This would require one to be able to describe literal constants like 12.34 or 1.2E4 or 12E-5. Where and how would you have to change the Cocol specification of Parva to be able to recognise statements, declarations and expressions that incorporate these extensions? (Hint: there is a simple solution!) [6]

Firstly one has to modify the production for BasicType

BasicType = "int" | "bool" | "float" .

Then one has to find a way to extend the token definition for number to be able to express float literals as well as int literals. The problem here is to make sure that all the valid combinations are allowed, unambiguously, insisting that at least one digit appear, and with no possibility of obtaining several periods or Es in one number. For the examples given, the following would suffice

number = digit { digit } [ "." { digit } ] [ "E" [ "+" | "-" ] digit { digit } ] .

Attempts like the following are wrong (you should puzzle out why this is so)

number = digit { digit | "." } "E" { "+" | "-" | digit } .
number = digit { digit | "." | "E" | "+" | "-" } .

Some languages allow you to have float numbers like 123. or 12.E-4 or .456 or .35E+1 -does the above definition allow these? If not, how would you write a token definition to handle all those possibilities as well?
Several submissions tried definitions like

intnumber = digit { digit } .
floatnum = intnumber [ "." intnumber [ "E" [ "-" ] intnumber ] ] .

but as I have tried to explain on numerous occasions, you cannot write one token definition in terms of another token definition. The TOKENS section is really restricted to the use of "regular expressions" = albeit in a slightly different notation from the one suggested in setion 5.3 of the text book.

  COMPILER Parva $CN

  CHARACTERS
    lf         = CHR(10) .
    backslash  = CHR(92) .
    control    = CHR(0) .. CHR(31) .
    letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit      = "0123456789" .
    stringCh   = ANY - '"' - control - backslash .
    charCh     = ANY - "'" - control - backslash .
    printable  = ANY - control .

  TOKENS
    identifier = letter { letter | digit | "_" } .
    number     = digit { digit } .
    stringLit  = '"' { stringCh | backslash printable } '"' .
    charLit    = "'" ( charCh   | backslash printable ) "'" .

  COMMENTS FROM "//" TO lf
  COMMENTS FROM "/*" TO "*/"

  IGNORE CHR(9) .. CHR(13)

  PRODUCTIONS
    Parva             = "void" identifier "(" ")" Block .
    Block             = "{" { Statement } "}" .
    Statement         =   Block | ConstDeclarations | VarDeclarations | Assignment | ";"
                        | IfStatement | WhileStatement | ReturnStatement | HaltStatement
                        | ReadStatement | WriteStatement .
    ConstDeclarations = "const" OneConst { "," OneConst } ";" .
    OneConst          = identifier "=" Constant .
    Constant          = number | charLit | "true" | "false" | "null" .
    VarDeclarations   = Type OneVar { "," OneVar } ";" .
    Type              = BasicType [ "[]" ] .
    BasicType         = "int" | "bool" .
    OneVar            = identifier [ "=" Expression ] .
    Assignment        = Designator "=" Expression ";" .
    Designator        = identifier [ "[" Expression "]" ] .
    IfStatement       = "if" "(" Condition ")" Statement .
    WhileStatement    = "while" "(" Condition ")" Statement .
    ReturnStatement   = "return" ";" .
    HaltStatement     = "halt" ";" .
    ReadStatement     = "read" "(" ReadElement { "," ReadElement } ")" ";" .
    ReadElement       = stringLit | Designator .
    WriteStatement    = "write" "(" WriteElement { "," WriteElement } ")" ";" .
    WriteElement      = stringLit | Expression .
    Condition         = Expression .
    Expression        = AddExp [ RelOp AddExp ] .
    AddExp            = [ "+" | "-" ] Term { AddOp Term } .
    Term              = Factor { MulOp Factor } .
    Factor            =   Designator | Constant
                        | "new" BasicType "[" Expression "]"
                        | "!" Factor | "(" Expression ")" .
    AddOp             = "+" | "-" | "||" .
    MulOp             = "*" | "/" | "&&" .
    RelOp             = "==" | "!=" | "<" | "<=" | ">" | ">=" .
  END Parva.



Home  © P.D. Terry