Computer Science 3 - 2001

Programming Language Translation

Practical for Week 22, beginning 22 September 2003 - Solutions

This tutorial/practical was very badly done. Many people could "guess" the answers, but could not or did not justify their conclusions. If set in exams, in these sorts of questions it is important to do so.

As usual, you can find a "solution kit" as PRAC22A.ZIP if you wish to experiment further.

Task 1

Test the following grammar (by hand) for being LL(1)

   COMPILER Silly $CNF
   IGNORE CHR(1) .. CHR(31)
   PRODUCTIONS
     Silly = A ( "b" | "d" ) | ( "a" | "d" ) A .
     A     = "b" A "d" | { "c" } .
   END Silly.

The best way to solve these problems may be to eliminate the meta-symbols. For example:

   COMPILER Sillier $CNF
   IGNORE CHR(1) .. CHR(31)
   PRODUCTIONS
     Sillier = AB | CA .              /* 1 */
     AB      = A B .                  /* 2 */
     CA      = C A .                  /* 3 */
     A       = "b" A "d" | D .        /* 4 */
     B       = "b" | "d" .            /* 5 */
     C       = "a" | "d" .            /* 6 */
     D       = "c" D | .              /* 7 */
   END Sillier.

Of these productions 1, 4, 5, 6 and 7 offer alternatives and so should be checked for transgressions of "Rule 1". 4 and 7 show that the non-terminals A and D are nullable, and so these non-terminals should be checked for transgressions of "Rule 2".

Do not simply form the sets FIRST(A), FIRST(B), FIRST(C) and FIRST(D), as some people insisted on doing. You are explicitly warned about this in the text (page 97). If you have not read the text then please do so!

Productions 4 - 7 come through the Rule 1 test with flying colours:

     FIRST(A1) = { "b" }     FIRST(A2) = { "c" }.   Intersection is empty.
     FIRST(B1) = { "b" }     FIRST(B2) = { "d" }.   Intersection is empty.
     FIRST(C1) = { "a" }     FIRST(C2) = { "d" }.   Intersection is empty.
     FIRST(D1) = { "c" }     FIRST(D2) = Empty  .   Intersection is empty.

Production 1 needs a bit more thought:

     FIRST(Sillier1) = FIRST(AB) = FIRST(A) U FIRST(B) = { "b", "c" } U { "b" , "d" }
                     = { "b", "c", "d" }

     FIRST(Sillier2) = FIRST(CA) = FIRST(C) = { "a", "d" }

so Rule 1 is broken as these two sets have "d" in common.

For the nullable non-terminals

     FIRST(A) = { "b", "c" }    FOLLOW(A) = FIRST(B) U { "d" } = { "b", "d" }

so Rule 2 is broken for B, as FIRST(A) and FOLLOW(A) have "b" in common.

     FIRST(D) = { "c" }         FOLLOW(D) = FOLLOW(A) = { "b", "d" }

so Rule 2 is satisfied for D.

Task 2

Consider the following grammar expressed in EBNF for describing the progress of a typical university course:

   COMPILER Course $CNF
   IGNORE CHR(1) .. CHR(31)
   PRODUCTIONS
     Course       = Introduction Section { Section } Conclusion .
     Introduction = "lecture" [ "handout" ] .
     Section      = { "lecture" | "practical" "test" | "tutorial" | "handout" } "test" .
     Conclusion   = [ "panic" ] "examination" .
   END Course.

Is this an LL(1) grammar? If not, where does it break the rules? Can you find an equivalent LL(1) grammar?

Well, most people see this straightaway. The [ "handout" ] component in the production for Introduction is "nullable", and is followed by Section, whose FIRST set contains "handout". So we can anticipate a transgression of Rule 2.

If you want the full gory detail:

      Course          = Introduction Section MoreSections Conclusion .
      MoreSections    = Section MoreSections | e .
      Introduction    = "lecture" OptionalHandout .
      OptionalHandout = "handout" | e  .
      Section         = Components "test" .
      Components      = Component Components | e .
      Component       = "lecture" | "practical" "test" | "tutorial" | "handout" .
      Conclusion      = OptionalPanic "Examination"
      OptionalPanic   = "Panic" | e .

OptionalHandout is nullable and

    FIRST(OptionalHandout) = { "handout" }
    FOLLOW(OptionalHandout) = { "handout", "lecture", "practical", "tutorial", "test" }

In other respects the grammar is properly LL(1). Work through the details as an exercise.

Although we have a non-LL(1) grammar, a simple parser would work. The situation is completely analogous to the "dangling else" problem - if a handout follows the opening lecture it is bound to Introduction, rather than forming part of a subsequent Section.

One can easily find an equivalent LL(1) grammar - simply change the production for Introduction to

     Introduction = "lecture" .

although this means we can no longer regard the introductory handout as semantically special, which we might later want to do.

Be careful about guessing like this. Sometimes it can get you into trouble.

Task 3

The following grammar attempts to describe expressions incorporating the familiar operators with their correct precedence and associativity.

   COMPILER Expression $CNF
   IGNORE CHR(1) .. CHR(31)
   PRODUCTIONS
     Expression = Term    { ( "+" | "-" ) Term  } .
     Term       = Factor  { ( "*" | "/*" | "%" ) Factor } .
     Factor     = Primary [ "^" Expression ] .
     Primary    = "a" | "b" | "c" .
  END Expression.

Is this an ambiguous grammar? (Hint: try to find an expression that can be parsed in more than one way).

Again, many people "guessed" the right answer. To justify the claim that it is ambiguous it would be as well to show a pair of parse trees, not just make a wild claim! Considering the expression a^b*c. This can indeed be parsed in two ways, one with the implicit meaning of a^(b*c) and the other with the meaning of (a^b)*c. The parse trees would look like this (a few intermediate nodes have been omitted to save space)

           Expression                                        Expression
               |                                                 |
             Term                                              Term
               |                                                 |
               |                                       .---------+--------.
               |                                       |         |        |
            Factor                                  Factor      "*"     Factor
               |                                       |                  |
        .------+---------.                             |---------.        |
        |      |         |                             |  |      |        |
     Primary  "^"    Expression                   Primary ^  Expression   |
        |                |                             |         |        |
        |               Term                           |       Term       |
        |         .------+-----.                       |         |        |
        |         |      |     |                       |         |        |
     Number     Factor  "*"  Factor                Number    Factor       |
        |         |            |                       |         |        |
        |         |            |                       |         |        |
        a         b            c                       a         b        c

Is it an LL(1) grammar? If not, why not, and can you find a suitable grammar that is LL(1)?

If we rewrite the first grammar to eliminate the metabrackets we get

     Expression = Term TailExp .
     TailExp    = AddOp Term TailExp | e .
     Term       = Factor TailTerm .
     TailTerm   = MulOp Factor TailTerm | e .
     Factor     = Primary TailFactor .
     TailFactor = "^" Expression | e .
     Primary    = "a" | "b" | "c" .
     AddOp      = "+" | "-" .
     MulOp      = "*" | "/" .

The nullable nonterminals here are TailExp, TailTerm and TailFactor.

   FIRST(TailExp)    = { "+" , "-" }
   FIRST(TailTerm)   = { "*" , "/" }
   FIRST(TailFactor) = { "^" }

The FOLLOW sets are a little harder to see because to get to closure one has to chase through quite a few other productions:

   FOLLOW(TailExp)    = FOLLOW(Expression)
   FOLLOW(TailTerm)   = FOLLOW(Term) = FIRST(TailExp) U FOLLOW(Expression)
   FOLLOW(TailFactor) = FOLLOW(Factor) = FIRST(TailTerm) U FOLLOW(Term)

You are invited to track these through in detail; the outcome is that they are all the same:

   FOLLOW(TailExp)    = { "*" , "/" , "+" , "-" , EOF }
   FOLLOW(TailTerm)   = { "*" , "/" , "+" , "-" , EOF }
   FOLLOW(TailFactor) = { "*" , "/" , "+" , "-" , EOF }

and so Rule 2 is broken for TailExp and for TailTerm.

Finding an LL(1), unambigous grammar, with the correct precedence and associativity is not too difficult. In fact it would have been incredibly easy had you just read the text, page 81, where the solution is effectively given to you.

Follow up question: Why don't you try reading the text book occasionally?

   COMPILER Expression $CNF
   IGNORE CHR(1) .. CHR(31)
   PRODUCTIONS
     Expression = Term    { ( "+" | "-" ) Term  } .
     Term       = Factor  { ( "*" | "/*" | "%" ) Factor } .
     Factor     = Primary [ "^" Factor ] .
     Primary    = "a" | "b" | "c" .
  END Expression.

Why does our grammar now satisfy the LL(1) constraints? Rewritten it becomes

     Expression = Term TailExp .
     TailExp    = AddOp Term TailExp | e .
     Term       = Factor TailTerm .
     TailTerm   = MulOp Factor TailTerm | e .
     Factor     = Primary TailFactor .
     TailFactor = "^" Factor | e .
     Primary    = "a" | "b" | "c" .
     AddOp      = "+" | "-" .
     MulOp      = "*" | "/" .

The nullable nonterminals here are still TailExp, TailTerm and TailFactor.

   FIRST(TailExp)    = { "+" , "-" }
   FIRST(TailTerm)   = { "*" , "/" }
   FIRST(TailFactor) = { "^" }

The FOLLOW sets are a little harder to see because to get to closure one has to chase through quite a few other productions:

   FOLLOW(TailExp)    = FOLLOW(Expression)
   FOLLOW(TailTerm)   = FOLLOW(Term) = FIRST(TailExp) U FOLLOW(Expression)
   FOLLOW(TailFactor) = FOLLOW(Factor) = FIRST(TailTerm) U FOLLOW(Term)

You are invited to track these through in detail; the outcome is:

   FOLLOW(TailExp)    = { EOF }
   FOLLOW(TailTerm)   = { "+" , "-" , EOF }
   FOLLOW(TailFactor) = { "*" , "/" , "+" , "-" , EOF }

and so Rule 2 is no longer broken.

There were various other suggestions made, such as

     Factor     = Primary [ "^" Term ] .
     Factor     = Primary [ "^" "(" Expression ")" ] .
     Factor     = Primary { "^" Term } .

but these are unnecessarily restrictive (first suggestion) or non-equivalent (second suggestion; parentheses were not catered for in the first grammar and introducing them is "cheating"). The third suggestion gets the associativity incorrect.

Task 4

Palindromes are character strings that read the same from either end. You were invited to\ explore various ways of finding grammars that describe palindromes made only of the letters a and b:

     (1)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
     (2)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
     (3)        Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
     (4)        Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .

Which grammars achieve their aim? If they do not, explain why not. Which of them are LL(1)? Can you find other (perhaps better) grammars that describe palindromes and which are LL(1)?

This is one of those awful problems that looks deceptively simple, and indeed is deceptive. We need to be able to cater for palindromes of odd or even length, and we need to be able to cater for palindromes of finite length, so that the "repetition" that one immediately thinks of has to be able to terminate.

Here are some that don't work:

   COMPILER Palindrome /* does not terminate */
     PRODUCTIONS
       Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
   END Palindrome.

   COMPILER Palindrome /* only allows odd length palindromes */
     PRODUCTIONS
       Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
   END Palindrome.

   COMPILER Palindrome /* only allows even length palindromes */
     PRODUCTIONS
       Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
   END Palindrome.

Of those grammars, the first is LL(1), but is useless (it is not "reduced" in the sense of the definitions on page 81). The second one is obviously non-LL(1) as the terminals "a" and "b" can start more than one alternative. The third one is less obviously non-LL(1). If you rewrite it

   COMPILER Palindrome /* only allows even length palindromes */
     PRODUCTIONS
       Palindrome = "a" Extra "a" | "b" Extra "b" .
       Extra      = Palindrome | e .
   END Palindrome.

and note that Extra is nullable, then FIRST(Extra) = { "a", "b" } and FOLLOW(Extra) = { "a", "b" }.

Here is another attempt

   COMPILER Palindrome /* allows any length palindromes */
     PRODUCTIONS
       Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .
   END Palindrome.

This describes both odd and even length palindromes, but is non-LL(1). Palindrome is nullable, and both FIRST(Palindrome) and FOLLOW(Palindrome) = { "a", "b" }. And, as most were quick to notice, it breaks Rule 1 immediately as well.

Other suggestions were:

   COMPILER Palindrome /* allows any length palindromes */
     PRODUCTIONS
       Palindrome =  "a"  [ Palindrome  "a"] | "b"  [ Palindrome  "b" ] .
   END Palindrome.

but, ingenious as this appears, it does not work either. Rewritten it would become

   COMPILER Palindrome /* allows any length palindromes */
     PRODUCTIONS
       Palindrome =  "a"  PalA | "b" PalB .
       PalA       = Palindrome  "a" | .
       PalB       = Palindrome  "b" | .
   END Palindrome.

PalA and PalB are both nullable, and FIRST(PalA) = { "a" , "b" } while FOLLOW(PalA) = FOLLOW(Palindrome) = { "a", "b" } as well.

In fact, when you think about it, you simply will not be able to find an LL(1) grammar for this language. (That is fine; grammars don't have to be LL(1) to be valid grammars. They just have to be LL(1) or very close to LL(1) to be able to write recursive descent parsers.) Here's how to think about it. Suppose I asked you to hold your breath for as long as you could, and also to nod your head when you were half way through. I don't believe you could do it - you don't know before you begin exactly how long you will be holding your breath. Similarly, if I told you to get into my car and drive it till the tank was empty and to hoot the hooter when you were half way to running out you could not do it. Or if I told you to walk into a forest with your partner and kiss him/her when you were in the dead centre of the forest, you would not know when the magic moment had arrived.

LL(1) parsers have to be able to decide just by looking at one token exactly what to do next - if they have to guess when they are are half-way through parsing some structure they will not be able to do so. One would have to stop applying the options like Palindrome = "a" Palindrome "a" at the point where one had generated or analyzed half the palindrome, and if there is no distinctive character in the middle one would not expect the parser to be able to do so.

If course, if one changes the problem ever so slightly in that way one can find an LL(1) grammar. Suppose we want a grammar for palindromes that have matching a and b characters on either end and a distinctive c or pair of c characters in the centre:

   COMPILER Palindrome /* allows any length palindromes, but c must be in the middle */
     PRODUCTIONS
       Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "c" [ "c" ] .
   END Palindrome.

Task 5

Which of the following statements are true? Justify your answer.

(a) An LL(1) grammar cannot be ambiguous.
(b) A non-LL(1) grammar must be ambiguous.
(c) An ambiguous language cannot be described by an LL(1) grammar.
(d) It is possible to find an LL(1) grammar to describe any non-ambiguous language.

To answer this sort of question you must be able to argue convincingly, and most people did not do that at all!

(a) is TRUE. An LL(1) grammar cannot be ambiguous. If a language can be described by an LL(1) grammar it will always be able to find a single valid parse tree for any valid sentence, and no parse tree for an invalid sentence. The rules imply that no indecision can exist at any stage - either you can find a unique way to continue the implicit derivation from the goal symbol, or you have to conclude that the sentence is malformed.

But you cannot immediately conclude any of the "opposite" statements, other than (c) which is TRUE. If you really want to define an ambiguous language (and you may have perfectly good/nefarious reasons for doing so - humourists do it all the time) you will not be able to describe it by an LL(1) grammar, which has the property that it can only be used for deterministic parsing.

In particular (b) is FALSE. We can "justify" this by giving just a single counter example to the claim that it might be true. We have seen several such grammars. The palindrome grammars above are like this - even though they are non LL(1) for the reasons given, they are quite unambiguous - you would only be able to parse any palindrome in one way! Similarly, a bad train grammar discussed in class and simplified to

   Train = "loco" "coal" { "coal" | "fuel" } "coal" "guard" "." .

is non-LL(1), but it is not ambiguous - you could only parse the train

    loco   coal  coal  coal  guard  .

in one way. This is a particularly simple grammar and it is hopefully easy to see that any valid train defined by it could only be parsed in one way.

Similarly (d) is TRUE. Once again the palindrome example suffices - this language is simple, unambiguous, but we can easily argue that it is impossible to find an LL(1) grammar to describe it.