Tutorial for week

Computer Science 301 - 2014

Second Tutorial for week 6 - LL(1) grammar checks again

Stop watching videos and playing computer games - listen to steam radio

As an example of checking the LL(1) conditions for a grammar expressed in EBNF, let us consider how we might describe the daily activities of SAFM - "your news and information leader" (more realistically described as a non- stop party political broadcast). Hour by hour Tim, Tsepiso, Isaac, Nancy, Ashraf, Elvis, Karabo and their comrades present what are generously called shows. These include music and advertisements (since people pay no licence fees, the station has to raise revenue through advertising). What passes as "information" is provided by news bulletins, weather reports and talk shows. News bulletins consist of a stream of stories, while talk shows consist of interchanges between the host and listeners who are keen enough to phone in - with the person hosting the exchange getting the first and last word on the subject, of course.

Items like "advert", "rain" and "music" are really in the category of the lexical terminals which a scanner (in the person of someone listening to the radio) would recognize as key symbols while parsing a broadcast. So one attempt to describe a day's activities might be on the lines of

     Radio        =   { TalkShow | NewsBulletin | "music" | "advert" } EOF .
     NewsBulletin =   "advert" NewsItem { NewsItem } [ Weather ] Filler .
     NewsItem     =   "ANC" [ "cosatu" ] | "strike" | [ "cosatu" ] "ANC" | "zuma" | "corruption" | "murder" .
     TalkShow     =   "host" { "listener" "host" } [ Filler ] .
     Filler       =   "music" | "advert"  .
     Weather      =   { "snow" | "rain" | "cloudy" | "windy" } .

Analyze this grammar in detail. One way is to begin by recasting it in a form that introduces further non- terminals that allow for the elimination of the [ ] and { } meta-brackets. If it proves out to be non-LL(1), try to find an essentially equivalent description that is LL(1), or argue why this should be impossible.

This is pretty obviously non-LL(1) and most people can see one of the obvious reasons.

Here is a detailed analysis:

If we rewrite the grammar without meta-brackets we get something like:

    Radio        = Programmes EOF .
    Programmes   = Programme Programmes | e .
    Programme    = TalkShow | NewsBulletin | "song" | "advert" .
    NewsBulletin = "advert" NewsItems OptWeather Filler .
    NewsItems    = NewsItem NewsItems | e .
    OptWeather   = Weather | e .
    NewsItem     =   "ANC" OptCosatu | "strike" | OptCosatu "ANC" | "zuma"
                   | "corruption" | "malema" "song" | Accident .
    OptCosatu    = "cosatu" | e .
    TalkShow     = "host" Exchanges OptFiller .
    Exchanges    = Exchange Exchanges | e .
    OptFiller    = Filler | e .
    Exchange     = "listener" "host" .
    Filler       = "song" | "advert" .
    Accident     = "collision" "claims" "another" number "lives" .
    Weather      = OneWeather Weather | e .
    OneWeather   = "snow" | "rain" | "cloudy" | "windy" .

Rule 1 is broken in only two places:

(a) One of the options for Programme, that is NewsBulletin, starts with "advert", which is an option in its own right for Programme.

(b) Since OptCosatu is nullable, two of the options in NewsItem can both start with "ANC"

To check Rule 2 we need to look at the nullable non-terminals, which are

Programmes, NewsItems, OptWeather, Weather, OptCosatu, Exchanges, OptFiller

You should check through in detail, but the two that cause problems are OptCosatu and OptFiller:

    FIRST(OptCosatu)  = { cosatu }
    FOLLOW(OptCosatu) = { music, advert, ANC, strike, zuma, corruption, malema, cosatu,
                          collision, snow, rain, cloudy, windy }

    FIRST(OptFiller)  = { music, advert }
    FOLLOW(OptFiller) = { music, advert, host }

As mentioned in class, with a bit of practice one can see many of these problems just by looking at the EBNF version of the productions. A look at the three productions

    Radio        = { TalkShow | NewsBulletin | "song" | "advert" } EOF .
    NewsItem     =    "ANC" [ "cosatu" ] | "strike" | [ "cosatu" ] "ANC" | "zuma"
                   | "corruption" | "malema" "song" | Accident .
    NewsBulletin = "advert" NewsItem { NewsItem } [ Weather ] Filler .

quickly shows up the Rule 1 problems. The others may not be so obvious, although the second of these last three productions shows that [ "cosatu" ] is nullable, and one quickly discovers that Rule 2 is broken. Similarly, a look at

    TalkShow = "host" { "listener" "host" } [ Filler ] .

shows that [ Filler ] is nullable, and this combined with the production for Radio shows that Rule 2 must be broken for this nullable component as well.

How, if at all, can one find a better grammar? Several submissions simply gave up at this point, but their authors often gave very spurious (that is, indefensible) reasons. Just because a grammar is not LL(1) is no guarantee that you will not be able to find an equivalent LL(1) grammar - but by the same token, it is possible to find many non- LL(1) grammars that can be converted to LL(1) grammars quite easily (go and read section 7.3 of the textbook again), although in some cases this is not possible (read the discussion about the "dangling else" again).

Some of the problems in the grammar are easily overcome. If we change the production for TalkShow to

    TalkShow = "host" { "listener" "host" } .

we remove one of the Rule 2 problems, and we do not lose anything at all - if a Filler had been present it would have shown up as part of the next item.

We can remove the Rule 1 problem involving "advert" by a simple trick which is useful on many occasions - delay the factorization of the alternatives that cause problems

    Radio      = { TalkShow | "song" | "advert" [ RestOfNews ] } EOF .
    RestofNews = NewsItem { NewsItem } [ Weather ] Filler .

Conceptually a NewsBulletin in the old form - starting with an "advert" - is still required.

How do we get the problematic "cosatu" to go away? A careful look at the production for NewsItem reveals that the grammar is actually ambiguous - it appears we are allowed to parse a substring "ANC" "cosatu" "ANC" in two ways (either as ("ANC" "cosatu") ( e "ANC" ) or as ("ANC" e ) ( "cosatu" "ANC"). Now if we really want to find a grammar that is ambiguous (bearing in mind the rude remarks made in lectures that politicians love ambiguity, we might just want to do so!) we shall not be able to do this in an LL(1) compliant way. Trying to get behind the intent of the problem might suggest that we want a grammar in which it is possible to have news items on "ANC" without mentioning "cosatu" in the same breath, but if we ever mention "cosatu" it must either have been just after mentioning "ANC", or (if we had not done so), we have no option but to mention "ANC" straight afterwards.

One way to do this is to write the production:

    NewsItem  =   "ANC" [ "cosatu" ] | "strike" | "cosatu" "ANC"
                | "zuma" | "corruption" | "malema" "song" | Accident .

This gets rid of the Rule 1 violation, but not the Rule 2 violation. However, this one might not be serious - it is, in fact, exactly the same situation as the "dangling else" and a practical parser would resolve it in the same way that we discuss in lectures - that is, a news item on "cosatu" immediately following "ANC" would be bound to that story on "ANC", and not to one still to follow. So, presented with

"ANC" "cosatu" "ANC" "zuma"

would be handled as syntactically equivalent to

("ANC" "cosatu") ("ANC") ("zuma")

and not as

("ANC") ("cosatu" "ANC") "zuma"

Of course, in a sense, this behaviour of the parser might not be what was really wanted - this is a situation where the semantics have become "context sensitive". In the case of the "dangling else", the recursive descent parser works very well, and if one wants the opposite sort of effect, one can resort to code in { braces }.

Palindromes

Palindromes are character strings that read the same from either end, like "Hannah" or my brother's favourite line when he did the CTM ads: "Bob Bob". The following represent various attempts to find grammars that describe palindromes made only of the letters a and b:

     (1)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
     (2)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
     (3)        Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
     (4)        Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .

Which grammars achieve their aim? If they do not, explain why not. Which of them are LL(1)? Can you find other (perhaps better) grammars that describe palindromes and which are LL(1)?

This is one of those awful problems that looks deceptively simple, and indeed is deceptive. We need to be able to cater for palindromes of odd or even length, and we need to be able to cater for palindromes of finite length, so that the "repetition" that one immediately thinks of has to be able to terminate.

Here are some that don't work:

   COMPILER Palindrome /* does not terminate */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
   END Palindrome.


   COMPILER Palindrome /* only allows odd length palindromes */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
   END Palindrome.

   COMPILER Palindrome /* only allows even length palindromes */
   PRODUCTIONS
     Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
   END Palindrome.

Of those grammars, the first seems to obey the LL(1) rules, but it is useless (it is not "reduced" in the sense of the definitions on page 68). The second one is obviously non-LL(1) as the terminals "a" and "b" can start more than one alternative. The third one is less obviously non-LL(1). If you rewrite it

   COMPILER Palindrome /* only allows even length palindromes */
   PRODUCTIONS
     Palindrome = "a" Extra "a" | "b" Extra "b" .
     Extra      = Palindrome | e .
   END Palindrome.

and note that Extra is nullable, then FIRST(Extra) = { "a", "b" } and FOLLOW(Extra) = { "a", "b" }.

Here is another attempt

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .
   END Palindrome.

This describes both odd and even length palindromes, but is non-LL(1). Palindrome is nullable, and both FIRST(Palindrome) and FOLLOW(Palindrome) = { "a", "b" }. And, as most were quick to notice, it breaks Rule 1 immediately as well.

Other suggestions were:

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome =  "a"  Palindrome  "a" | "b"  Palindrome  "b" | [ "a" | "b" ] .
   END Palindrome.

which also breaks both rules fairly obviously, and

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome =  "a"  [ Palindrome  "a"] | "b"  [ Palindrome  "b" ] .
   END Palindrome.

but, ingenious as this appears, it does not work either. Rewritten it would become

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome =  "a"  PalA | "b" PalB .
     PalA       = Palindrome  "a" | .
     PalB       = Palindrome  "b" | .
   END Palindrome.

PalA and PalB are both nullable, and FIRST(PalA) = { "a" , "b" } while FOLLOW(PalA) = FOLLOW(Palindrome) = { "a", "b" } as well.

In fact, when you think about it, you simply will not be able to find an LL(1) grammar for this language. (That is fine; grammars don't have to be LL(1) to be valid grammars. They just have to be LL(1) or very close to LL(1) to be able to write recursive descent parsers.) Here's how to think about it. Suppose I asked you to hold your breath for as long as you could, and also to nod your head when you were half way through. I don't believe you could do it - you don't know before you begin exactly how long you will be holding your breath. Similarly, if I told you to get into my car and drive it till the tank was empty but to hoot the hooter when you were half way to running out you could not do it. Or if I told you to walk into a forest with your partner and kiss him/her when you were in the dead centre of the forest, you would not know when the magic moment had arrived.

LL(1) parsers have to be able to decide just by looking at one token exactly what to do next - if they have to guess when they are are half-way through parsing some structure they will not be able to do so. One would have to stop applying the options like Palindrome = "a" Palindrome "a" at the point where one had generated or analyzed half the palindrome, and if there is no distinctive character in the middle one could not expect the parser to be able to do so.

If course, if one changes the problem ever so slightly in that way one can find an LL(1) grammar. Suppose we want a grammar for palindromes that have matching a and b characters on either end and a distinctive c or pair of c characters in the centre:

   COMPILER Palindrome /* allows any length palindromes, but c must be in the middle */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "c" [ "c" ] .
   END Palindrome.

Reverse Polish Notation

You are now familiar with RPN or "Reverse Polish Notation" as a notation that can describe expressions without the need for parentheses. The notation eliminates parentheses by using "postfix" operators after the operands. To evaluate such expressions one uses a stack architecture, such as forms the basis of the PVM machine studied in the course. Examples of RPN expressions are:

3 4 + - equivalent to 3 + 4
3 4 5 + * - equivalent to 3 * (4 + 5)

In many cases an operator is taken to be "binary" - applied to the two preceding operands - but the notation is sometimes extended to incorporate "unary" operators - applied to one preceding operand:

4 sqrt - equivalent to sqrt(4)
5 - - equivalent to -5

Here are two attempts to write grammars describing an RPN expression:

          (G1)     RPN       =     RPN RPN binOp
                                 | RPN unaryOp
                                 | number .
                   binOp     =   "+" | "-" | "*" | "/" .
                   unaryOp   =   "-" | "sqrt" .

and

          (G2)     RPN       =   number REST .
                   REST      =   [ number REST binOp REST | unaryOp ].
                   binOp     =   "+" | "-" | "*" | "/" .
                   unaryOp   =   "-" | "sqrt" .

Are these grammars equivalent? Is either (or both) ambiguous? Do either or both conform to the LL(1) conditions? If not, explain clearly where the rules are broken, and come up with an LL(1) grammar that describes RPN notation, or else explain why it might be necessary to modify the language itself to overcome any problems you have uncovered.

The grammars are not equivalent. To show this we need only find one string that one will accept but the other will not, as many people realised. An example of such a string would be 45 sqrt sqrt, which can only be recognized by G1 in one way, but not at all by G2. Using the expression

15 6 - -

as an example, and by drawing appropriate parse trees, we can demonstrate that both of the grammars are ambiguous.

Using grammar G1:

                       RPN                                  RPN
              .--------------------.                    .------------.
              |         |          |                    |            |
             RPN       RPN       binOp                 RPN        unaryOp
              |      .------.      |              .-----------.      |
              |      |      |      |              |     |     |      |
            number  RPN  unaryOp   |             RPN   RPN  binOp    |
              |      |      |      |              |     |     |      |
              |    number   |      |            number number |      |
              |      |      |      |              |     |     |      |

             15      6      -      -              15    6     -      -

         (implied value is 21)                 (implied value is -9)


Using grammar G2:


                 RPN                                      RPN
         .------------------.                     .-------------------.
         |                  |                     |                   |
       number              REST                 number               REST
         |         .---------------------.        |           .-----------------.
         |         |      |        |     |        |           |     |      |    |
         |       number  REST    binOp  REST      |         number REST  binOP REST
         |         |      |        |     |        |           |     |      |    |
         |         |    unaryOp    |   (eps)      |           |   (eps)    | unaryOp
         |         |      |        |              |           |            |    |

         15        6      -        -              15          6            -    -

To analyse each of these grammars to check whether they conform to the LL(1) conditions:

G1 is left recursive and thus is immediately ruled out as a possible LL(1) grammar. There are two alternatives for the right side of the production for RPN that both start with RPN, so Rule 1 is broken, regardless of the exact nature of FIRST(RPN) - which in any case is number, and all three alternatives start with this

For G2, REST is nullable. First(REST) is { number, sqrt, - } while Follow(REST) is { +, -, / , * } so Rule 2 is broken.

To try to overcome the problem we seem at the outset to try to to find a different token to represent the operation of unary negation, as several people may have realised. If, for example, we use negate for the "unary minus", then G1 becomes non-ambiguous (but still not LL(1)) but G2, while now being LL(1), will still not recognize 4 sqrt sqrt. Like the palindrome example, this one is frustrating - it looks so easy. Again we emphasize that a valid grammar does not have to be LL(1) - it just helps when building recursive descent parsers.

One of the groups in 2012 suggested a better grammar might be as follows. What do you think?

    PRODUCTIONS
      RPN7    =   RPN7 ( unaryOp | RPN7 binOp )  |  number .
      binOp   = "+" | "-" | "*" | "/" .
      unaryOp = "-" | "sqrt" .
    END RPN7.

Meet the family in another format

Consider the following grammar:

   COMPILER Home
   IGNORE CHR(0) .. CHR(31)
   PRODUCTIONS
     Home      = Family { Pets } [ Vehicle ] "house" .
     Pets      = "dog" [ "cat" ] | "cat" .
     Vehicle   = ( "scooter" | "bicycle" ) "fourbyfour" .
     Family    = Parents { Children } .
     Parents   = [ "Dad" ] [ "Mom" ] | "Mom" "Dad" .
     Child     =   "Helen" | "Margaret" | "Alice" | "Robyn" | "Cathy"
                 | "Janet" | "Anne" | "Ntombizodwa" | "Ntombizanele" .
   END Home.

Analyse this grammar in detail. Is it LL(1) compliant? If not, why not?

The first point to be made is that this is not a reduced grammar. The non-terminal Child is unreachable, and there is no way that the non-terminal Children can be derived to anything, let alone to terminals. Presumably what was meant was

   COMPILER Home
   IGNORE CHR(0) .. CHR(31)
   PRODUCTIONS
     Home      = Family { Pets } [ Vehicle ] "house" .
     Pets      = "dog" [ "cat" ] | "cat" .
     Vehicle   = ( "scooter" | "bicycle" ) "fourbyfour" .
     Family    = Parents { Child } .
     Parents   = [ "Dad" ] [ "Mom" ] | "Mom" "Dad" .
     Child     =   "Helen" | "Margaret" | "Alice" | "Robyn" | "Cathy"
                 | "Janet" | "Anne" | "Ntombizodwa" | "Ntombizanele" .
   END Home.

If we introduce extra non-terminals to eliminate the [ ] and { } metabrackets we might get:

   COMPILER Home
   IGNORE CHR(0) .. CHR(31)
   PRODUCTIONS
     Home        = Family AllPets Vehicle "house" .
     AllPets     = Pets AllPets | .
     Pets        = "dog" OptionalCat | "cat" .
     OptionalCat = "cat" | .
     Vehicle     = TwoWheeled "fourbyfour" | .
     TwoWheeled  = "scooter" | "bicycle" .
     Family      = Parents Children .
     Children    = Child Children | .
     Parents     = OptionalDad OptionalMom | "Mom" "Dad".
     OptionalDad = "Dad" | .
     OptionalMom = "Mom" | .
     Child       =   "Helen" | "Margaret" | "Alice" | "Robyn" | "Cathy"
                   | "Janet" | "Anne" | "Ntombizodwa" | "Ntombizanele" .
   END Home.

It should be pretty apparent that the productions for Home and Family cause no problems (no alternatives appear in their right hand sides), nor do the productions for Pets, TwoWheeled and Child (they are not nullable, and the alternatives begin with clearly distinct terminals).

The production for Parents needs closer scrutiny.

     FIRST(Parents_1) = FIRST(OptionalDad) U FIRST(OptionalMom) = { "Dad", "Mom" }
     (because OptionalDad is nullable)

     FIRST(Parents_2) = { "Mom" }

so Rule 1 is broken, and the grammar is not LL(1) compliant.

We can check Rule 2, as there are several productions that have alternatives, one of which is nullable. These are the productions for AllPets, OptionalCat, Vehicle, Children, Parents, OptionalDad and OptionalMom (yes, Sylvia, real grammars often have lots of exciting complications).

This means that we must look at

      FIRST(AllPets) and FOLLOW(AllPets)
      FIRST(OptionalCat) and FOLLOW(OptionalCat)
      FIRST(Vehicle) and FOLLOW(Vehicle) etc.

The results follow

      FIRST(AllPets)  = { "dog", "cat" }
      FOLLOW(AllPets) = { "house", "scooter", "bike" }

      FIRST(OptionalCat) = { "cat" }
      FOLLOW(OptionalCat) = { "dog", "cat", "house", "scooter", "bike" }

(so Rule 2 is broken here, perhaps surprisingly)

    FIRST(Vehicle)      = { "scooter", "bicycle" }
    FOLLOW(Vehicle)     = { "house" }

    FIRST(Children)     = { "Helen", "Margaret", "Alice" .... "Ntombizanele" }
    FOLLOW(Children)    = { "dog", "cat", "house", "scooter", "bike" }

    FIRST(Parents)      = { "Mom", "Dad" }
    FOLLOW(Parents)     = { "Helen", "Margaret", "Alice" .... "Ntombizanele",
                            "dog", "cat", "house", "scooter", "bike" }

    FIRST(OptionalDad)  = { "Dad" }
    FOLLOW(OptionalDad) = { "Mom, "Helen", "Margaret", "Alice" .... "Ntombizanele",
                            "dog", "cat", "house", "scooter", "bike" }

    FIRST(OptionalMom)  = { "Mom" }
    FOLLOW(OptionalMom) = { "Helen", "Margaret", "Alice" .... "Ntombizanele",
                            "dog", "cat", "house", "scooter", "bike" }

We can get an LL(1) description of the family as follows:

     Home5     = Family { Pets } [ Vehicle ] "house" .
     Pets      = "dog" | "cat" .
     Vehicle   = ( "scooter" | "bicycle" ) "fourbyfour" .
     Family    = Parents { Child } .
     Parents   = [ "Dad" [ "Mom" ] | "Mom" [ "Dad" ] ] .
     Child     =   "Helen" | "Margaret" | "Alice" | "Robyn" | "Cathy"
                 | "Janet" | "Anne" | "Ntombizodwa" | "Ntombizanele" .