Computer Science 3 - 2004

Programming Language Translation


Practical for Week 22, beginning 22 September 2004 - Solutions

You can obtain machine readable versions of these grammars from the solution kit PRAC22A.ZIP or PRAC22AC.ZIP if you would like to experiment with them further.


Task 1 - Time to get some culture - go to the theatre

Most submissions seemed to have some idea of what was wrong, often not properly developed.

      Play   = Act Act Act "interval" Act Act .
      Act    = Scene { Scene } .
      Scene  = { "speech" } "entry" { Action } .
      Action = "speech" | "entry" | "exit" | "death" | "gesticulation" .

To check Rule 1 we look at the productions for each non-terminal that yields alternatives on its right hand side. There is seemingly only one non-terminal in this category here, namely Action. So we have to look at

      FIRST("speech")  Ç  FIRST("entry")
      FIRST("speech")  Ç  FIRST("exit")
      FIRST("speech")  Ç  FIRST("death")
      FIRST("speech")  Ç  FIRST("gesticulation")

etc. We do not have to look at FIRST(Action) Ç anything!

Clearly Rule 1 is satisfied for any pairwise combination taken from the options in the productions for Action.

One has to be careful, very careful, about Rule 2. Rule 2 applies when there are e productions and alternatives. It does not apply to the productions for Action, and so some people will be lulled into thinking that (since this is the only place where alternative right hand sides are explicitly given), that Rule 2 is never broken. Alas, not so. When you use EBNF and use the useful [ option ] and { repetition } constructs, there are implicit e's that creep in. If we eliminate the metabrackets we get something like

      Play        = Act Act Act "interval" Act Act .
      Act         = Scene RestOfScene .
      Scene       = Prologue "entry" Actions .
      RestOfScene = Scene RestOfScene | e .
      Prologue    = "speech" Prologue | e .
      Actions     = Action Actions | e .
      Action      = "speech" | "entry" | "exit" | "death" | "gesticulation" .

in which the following now have to be checked for violations of Rule 2

            Productions                               Check

      RestOfScene = Scene RestOfScene | e .    (FIRST(RestOfScene) Ç FOLLOW(RestOfScene) )
      Prologue = "speech" Prologue | e .       (FIRST(Prologue) Ç FOLLOW(Prologue) )
      Actions = Action Actions | e .           (FIRST(Actions) Ç FOLLOW(Actions)  )

Now:

    First(RestOfScene) = First(Scene) = FIRST(Prologue) È { "entry" } = { "speech", "entry" } }
    Follow(RestOfScene) = FOLLOW(Act) = FIRST(Act) È {"interval" }
                                      = FIRST(Scene) È { "interval" } = { "speech", "entry", "interval" }

so Rule 2 is broken in this case. For the second production in this category

    FIRST(Prologue) = { "speech" }
    FOLLOW(Prologue) = { "entry" }

so Rule 2 is not broken that time, but for the third of these new productions

    FIRST(Actions) = FIRST(Action)  = { "speech", "entry", "exit", "death", "gesticulation" }
    FOLLOW(Actions) = FOLLOW(Scene) = FIRST(RestOfScene) È FOLLOW(Act)
                                    = { "speech", "entry" } È {"speech", "entry" "interval" }
                                    = {"speech", "entry" "interval" }

and Rule 2 is broken again. With more experience it is possible to check Rule 2 more quickly.

      Play   = Act Act Act "interval" Act Act .
      Act    = Scene { Scene } .
      Scene  = { "speech" } "entry" { Action } .
      Action = "speech" | "entry" | "exit" | "death" | "gesticulation" .

and in particular the production

Act = Scene { Scene } .

In this case Rule 2 really governs whether a parser will be able to determine whether it has to go around the loop corresponding to { Scene } for another iteration. It can only do this if FIRST(Scene) has nothing in common with what we could loosely call FOLLOW(loop), that is, FOLLOW( {Scene} ). From the third production we should quickly see that FIRST( {Scene} ) = { "speech" , "entry" }, and from the second production we should quickly see that FOLLOW( {Scene} ) = FOLLOW(Act) = { "interval" } È FIRST(Act), a set which clearly includes the elements { "speech" , "entry" } again.

Even more experience or common sense might bring you to the same conclusion just from "first principles". If there is no recognizable "end of scene" marker, as a member of the audience one cannot readily tell, when a new Scene starts, whether it is the start of a new Act or not. For that matter, one cannot even tell when a new speech starts, whether there has been a scene change or not! The "language" can be made LL(1) if the theatre raises and lowers the curtain between scenes, or actually changes the furniture on the stage, although of course it is no longer strictly equivalent to the original one.

      Play   = Act Act Act "interval" Act Act .
      Act    = "dimlights" Scene { Scene } "raiselights" .
      Scene  = "raisecurtain" { "speech" } "entry" { Action } "lowercurtain" .
      Action = "speech" | "entry" | "exit" | "death" | "gesticulation" .

Similarly, there is a "quick" way to see that the production

Scene = { "speech" } "entry" { Action } .

gets you into trouble. The first loop (corresponding to { "speech" } ) gives no trouble - one more time round the loop if we hear another "speech", very different from the follower, "entry" (you hear one, you see the other, in real life). The second loop is more of a problem. One more time round the loop if we detect any element of the set of { "speech", "entry", "exit", "death", "gesticulation" }? Well, that would be nice, but after the loop is all over we must start the next Scene, and that will also start with one of { "speech", "entry" }.


Task 2 - Palindromes again

Palindromes are character strings that read the same from either end. You were invited to explore various ways of finding grammars that describe palindromes made only of the letters a and b:

     (1)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
     (2)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
     (3)        Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
     (4)        Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .

This is one of those awful problems that looks deceptively simple, and indeed is deceptive. We need to be able to cater for palindromes of odd or even length, and we need to be able to cater for palindromes of finite length, so that the "repetition" that one immediately thinks of has to be able to terminate.

Here they are, coded into Cocol, with the pragmas that alow for easy testing:

   COMPILER Palindrome $TF /* does not terminate */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
   END Palindrome.

   COMPILER Palindrome $TF /* only allows odd length palindromes */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
   END Palindrome.

   COMPILER Palindrome $TF /* only allows even length palindromes */
   PRODUCTIONS
     Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
   END Palindrome.

   COMPILER Palindrome $TF /* allows any length palindromes */
   PRODUCTIONS
     Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .
   END Palindrome.

Of those grammars, the first is LL(1), but is useless (it is not "reduced" in the sense of the definitions on page 81). The second one is obviously non-LL(1) as the terminals "a" and "b" can start more than one alternative. The third one is less obviously non-LL(1). If you rewrite it

   COMPILER Palindrome /* only allows even length palindromes */
   PRODUCTIONS
     Palindrome = "a" Extra "a" | "b" Extra "b" .
     Extra      = Palindrome | e .
   END Palindrome.

and note that Extra is nullable, then FIRST(Extra) = { "a", "b" } and FOLLOW(Extra) = { "a", "b" }.

The fourth one describes both odd and even length palindromes, but is non-LL(1). Palindrome is nullable, and both FIRST(Palindrome) and FOLLOW(Palindrome) = { "a", "b" }. And, as most were quick to notice, it breaks Rule 1 immediately as well. Another suggestion is to try:

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome =  "a"  [ Palindrome  "a"] | "b"  [ Palindrome  "b" ] .
   END Palindrome.

but, ingenious as this appears, it does not work either. Rewritten it would become

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome =  "a"  PalA | "b" PalB .
     PalA       = Palindrome  "a" | .
     PalB       = Palindrome  "b" | .
   END Palindrome.

PalA and PalB are both nullable, and FIRST(PalA) = { "a" , "b" } while FOLLOW(PalA) = FOLLOW(Palindrome) = { "a", "b" } as well.

In fact, when you think about it, you simply will not be able to find an LL(1) grammar for this language. (That is fine; grammars don't have to be LL(1) to be valid grammars. They just have to be LL(1) or very close to LL(1) to be able to write recursive descent parsers.) Here's how to think about it. Suppose I asked you to hold your breath for as long as you could, and also to nod your head when you were half way through. I don't believe you could do it - you don't know before you begin exactly how long you will be holding your breath. Similarly, if I told you to get into my car and drive it till the tank was empty and to hoot the hooter when you were half way to running out you could not do it. Or if I told you to walk into a forest with your partner and kiss him/her when you were in the dead centre of the forest, you would not know when the magic moment had arrived.

LL(1) parsers have to be able to decide just by looking at one token exactly what to do next - if they have to guess when they are are half-way through parsing some structure they will not be able to do so. One would have to stop applying the options like Palindrome = "a" Palindrome "a" at the point where one had generated or analyzed half the palindrome, and if there is no distinctive character in the middle one would not expect the parser to be able to do so.

If course, if one changes the problem ever so slightly in that way one can find an LL(1) grammar. Suppose we want a grammar for palindromes that have matching a and b characters on either end and a distinctive c or pair of c characters in the centre:

   COMPILER Palindrome /* allows any length palindromes, but c must be in the middle */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "c" [ "c" ] .
   END Palindrome.


Task 4 - Pause for thought

Which of the following statements are true? Justify your answer.

(a) An LL(1) grammar cannot be ambiguous.
(b) A non-LL(1) grammar must be ambiguous.
(c) An ambiguous language cannot be described by an LL(1) grammar.
(d) It is possible to find an LL(1) grammar to describe any non-ambiguous language.

To answer this sort of question you must be able to argue convincingly, and most people did not do that at all!

(a) is TRUE. An LL(1) grammar cannot be ambiguous. If a language can be described by an LL(1) grammar it will always be able to find a single valid parse tree for any valid sentence, and no parse tree for an invalid sentence. The rules imply that no indecision can exist at any stage - either you can find a unique way to continue the implicit derivation from the goal symbol, or you have to conclude that the sentence is malformed.

But you cannot immediately conclude any of the "opposite" statements, other than (c) which is TRUE. If you really want to define an ambiguous language (and you may have perfectly good/nefarious reasons for doing so - stand-up comedians do it all the time) you will not be able to describe it by an LL(1) grammar, which has the property that it can only be used for deterministic parsing.

In particular (b) is FALSE. We can "justify" this by giving just a single counter example to the claim that it might be true. We have seen several such grammars. The palindrome grammars above are like this - even though they are not LL(1) for the reasons given, they are quite unambiguous - you would only be able to parse any palindrome in one way. Other examples are to be found in the left-recursive grammars for expressions discussed in the text at the start of chapter 6.

Similarly (d) is FALSE. Once again the palindrome example suffices - this language is simple, unambiguous, but we can easily argue that it is impossible to find an LL(1) grammar to describe it.


Task 5 - All very logical

The Boolean grammar is very similar to the arithmetic one, of course, so it was distressing that so many people had it rather badly wrong. To get the operator precedences correct, while keeping the grammar LL(1), we need to introduce four levels of non-terminal:

     COMPILER Bool $CN
     /* Boolean expression parser
        P.D. Terry, Rhodes University, 2004 */

     CHARACTERS
       letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
     TOKENS
       variable   = letter .
     COMMENTS FROM "(*" TO "*)"  NESTED
     COMMENTS FROM "/*" TO "*/"  NESTED
     IGNORE CHR(0) .. CHR(31)

     PRODUCTIONS
       Bool       = { Expression "=" } .
       Expression = Term { Or Term } .
       Term       = Factor { [ And ] Factor } .
       Factor     = "NOT" Factor | Primary { "'" } .
       Primary    = True | False | variable | "(" Expression ")" .
       True       = "TRUE" | "1" .
       False      = "FALSE" | "0" .
       And        = "AND" | "&" | "." .
       Or         = "OR" | "|" | "+" .
     END Bool.

Many people wrote the constants into the TOKENS section as follows:

     TOKENS
       variable   = letter .
       True       = "TRUE" | "1" .
       False      = "FALSE" | "0" .
       And        = "AND" | "&" | "." .
       Or         = "OR" | "|" | "+" .

but it is usually adequate to introduce fixed token definitions straight into the PRODUCTIONS section where they appear, and retain the TOKENS section for the definition of things like identifiers, numbers and strings that are all similar in form but different in spelling.


Task 6 - How are things stacking up?

The grammar for the assembler language was well done by some, and rather inadequately done by others. In particular, few submissions had handled the PRNS opcode correctly - this one takes a "string" as an argument, and the best way to define a string is in the TOKENS section (and, in fact, a way of doing this had been given to you in the Parva grammar for Task 7).

It is best to split the opcodes into 4 groups - those that take no argument, those that take only a numerical argument, those that take a numerical argument or a label, and PRNS.

There are two complications that were not handled all that well - one should treat the end-of-line as significant, and one should be able to handle blank lines. Here is one solution, which fudges it somewhat, by not attaching an eol to each statement but almost regarding an eol as a statement in its own right. This also means that one can have labels on a line by themselves:

     COMPILER Assem $NC
     /* Simple assembler for the PVM
        P.D. Terry, Rhodes University, 2004 */

     CHARACTERS
       lf       = CHR(10) .
       control  = CHR(0) .. CHR(31) .
       letter   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
       digit    = "0123456789" .
       stringCh = ANY - '"' - control .
     TOKENS
       identifier = letter { letter | digit } .
       number     = digit { digit } .
       stringLit  = '"' { stringCh } '"' .
       eol        = lf .
     COMMENTS FROM ";" TO lf
     IGNORE CHR(9) + CHR(11) .. CHR(13)

     PRODUCTIONS
       Assem        = [ "ASSEM" eol ] [ "BEGIN" eol ] { Statement | eol } [ "END" "." eol ] .
       Statement    = MemAddress ( OneWord | TwoWord | WriteString | Label | Branch ) .
       MemAddress   = [ "{" number "}" | number ] .
       OneWord      =   "ADD"  | "AND"  | "ANEW" | "CEQ"  | "CGE"  | "CGT"  | "CLE"
                      | "CLT"  | "CNE"  | "DIV"  | "HALT" | "INPB" | "INPI" | "LDV"
                      | "LDXA" | "MUL"  | "NEG"  | "NOP"  | "NOT"  | "OR"   | "PRNB"
                      | "PRNI" | "PRNL" | "REM"  | "STO"  | "SUB" .
       TwoWord      = ( "DSP" | "LDC" | "LDA" | "LDL" | "STL" ) SignedNumber .
       SignedNumber = [ "+" | "-" ] number .
       WriteString  = "PRNS" stringLit .
       Label        = identifier .
       Branch       = ( "BRN" | "BZE" ) ( number | identifier ) .
     END Assem.

Here is another method that may appeal more, where the eol is explicitly attached to a statement:

       Assem1       = [ "ASSEM" eol ] [ "BEGIN" eol ] { Statement } [ "END" "." eol ] .
       Statement    = [ MemAddress ] [ Label ] [ OneWord | TwoWord | WriteString | Branch ] eol .
       MemAddress   = "{" number "}" | number .

Some submissions attempted to distinguish between the .COD input format and the .PVM input format. The solutions above do not do this, and are quite permissive. Here is one way of making the distinction:

       Assem2       = CODFormat | PVMFormat .
       CODFormat    = "ASSEM" eol "BEGIN" eol PVMFormat "END" "." eol .
       PVMFormat    = { Statement } .
       Statement    = [ MemAddress ] [ Label ] [ OneWord | TwoWord | WriteString | Branch ] eol .
       MemAddress   = "{" number "}" | number .

but even this may strike you as too permissive; perhaps one should try something more like:

       Assem3       = CODFormat | PVMFormat .
       CODFormat    = "ASSEM" eol "BEGIN" eol { CODStatement } "END" "." eol .
       PVMFormat    = { PVMStatement } .
       CODStatement = [ "{" number "}" ] Statement .
       PVMStatement = [ number ] Statement .
       Statement    = [ Label ] [ OneWord | TwoWord | WriteString | Branch ] eol .


Task 7 - Parva expressions are not like those in C# and Java

What is needed here is simply to replace

    Expression  = AddExp [ RelOp AddExp ] .
    AddExp      = [ "+" | "-" ] Term { AddOp Term } .
    Term        = Factor { MulOp Factor } .
    Factor      =   Designator | Constant
                  | "new" BasicType "[" Expression "]"
                  | "!" Factor | "(" Expression ")" .
    BasicType   = "int" | "bool" .
    Constant    = number | charLit | "true" | "false" | "null" .
    AddOp       = "+" | "-" | "||" .
    MulOp       = "*" | "/" | "%" | "&&" .
    RelOp       = "==" | "!=" | "<" | "<=" | ">" | ">=" .

by

    Expression  = AndExp { "||" AndExp } .
    AndExp      = EqlExp { "&&" EqlExp } .
    EqlExp      = RelExp { EqlOp RelExp } .
    RelExp      = AddExp [ RelOp AddExp ] .
    AddExp      = MulExp { AddOp MulExp } .
    MulExp      = Factor { MulOp Factor } .
    Factor      = Primary | "+" Factor | "-" Factor | "!" Factor .
    Primary     =   Designator | Constant
                  | "new" BasicType "[" Expression "]"
                  | "(" Expression ")" .
    BasicType   = "int" | "bool" .
    Constant    = number | charLit | "true" | "false" | "null" .
    AddOp       = "+" | "-" .
    MulOp       = "*" | "/" | "%" .
    EqlOp       = "==" | "!=" .
    RelOp       = ">" | ">=" | "<" | "<=" .

Most submissions did not get this completely correct. Notice where the "prefix" + and - operators are introduced into the hierarchy (in Factor, where they are made right-associative), and that the productions for Expression, AndExp, EqlExp, AddExp and MulExp are of the general form

A = B { op B } .

and not

A = B [ op B ] .

which would be very limiting (not that you would have noticed on cheap and dirty testing!). However, note that an exception to this is found in

RelExp = AddExp [ RelOp AddExp ] .

and you might like to puzzle out why this should be the case.

Pascal and its derivatives make do with fewer levels of precedence. So far as I can see, the only real advantage of having more levels of precedence is that some expressions can be written with fewer parentheses - for example

      (a < b) or (c > d)     (* Pascal *)
      a < b || c > d         /* C */

However, remembering many levels of precedence is quite difficult, and when in doubt one should insert extra parentheses anyway.


Home  © P.D. Terry