Practical 22 solutions

Computer Science 3 - 2006

Programming Language Translation

Practical for Week 22, beginning 25 September 2006 - Solutions

This tutorial/practical was not always well done. Many people could "guess" the answers, but could not or did not justify their conclusions. If set in exams, in these sorts of questions it is important to do so.

As usual, you can find a "solution kit" as PRAC22A.ZIP or PRAC22AC.ZIP if you wish to experiment further.

Task 1 Steam Radio

The original grammar had productions

    Radio        = { TalkShow | NewsBulletin | "music" | "advert" } EOF .
    NewsBulletin = "advert" NewsItem { NewsItem } [ Weather ] Filler .
    NewsItem     = "zuma" [ "shaik" ] | "strike" | [ "shaik" ] "zuma" | "mbeki" | "randFalls" | "accident" .
    TalkShow     = "host" { "listener" "host" } [ Filler ] .
    Filler       = "music" | "advert"  .
    Weather      = { "snow" | "rain" | "cloudy" | "windy" } .

This is pretty obviously non-LL(1) and most people could see one of the obvious reasons, and looked no further. But here is what I hoped you might have done.

If we rewrite the grammar without meta-brackets we get something like:

    SAFM         = Programmes EOF .
    Programmes   = Programme Programmes | e .
    Programme    = TalkShow | NewsBulletin | "music" | "advert" .
    NewsBulletin = "advert" NewsItems OptWeather Filler .
    NewsItems    = NewsItem NewsItems | e .
    OptWeather   = Weather | e .
    NewsItem     = "zuma" OptShaik | "strike" | OptShaik "zuma" | "mbeki" | "randFalls" | "accident" .
    OptShaik     = "shaik" | e .
    TalkShow     = "host" Exchanges OptFiller .
    Exchanges    = Exchange Exchanges | e .
    OptFiller    = Filler | e .
    Exchange     = "listener" "host" .
    Filler       = "music" | "advert"  .
    Weather      = OneWeather Weather | e .
    OneWeather   = "snow" | "rain" | "cloudy" | "windy" .

Rule 1 is broken in only two places:

(a) One of the options for Programme, that is NewsBulletin, starts with "advert", which is an option in its own right for Programme.

(b) Since OptShaik is nullable, two of the options in NewsItem can both start with "zuma"

To check Rule 2 we need to look at the nullable non-terminals, which are

Programmes, NewsItems, OptWeather, Weather, OptShaik, Exchanges, OptFiller

You should check through in detail, but the two that cause problems are OptShaik and OptFiller:

    FIRST(OptShaik)  = { shaik }
    FOLLOW(OptShaik) = { music, advert, zuma, strike, mbeki, randFalls, accident, shaik,
                         snow, rain, cloudy, windy }

    FIRST(OptFiller) = { music, advert }
    FOLLOW(OptFiller) = {music, advert, host }

As mentioned in class, with a bit of practice one can see many of these problems just by looking at the EBNF version of the productions. A look at the three productions

    Radio        = { TalkShow | NewsBulletin | "music" | "advert" } EOF .
    NewsItem     = "zuma" [ "shaik" ] | "strike" | [ "shaik" ] "zuma" | "mbeki" | "randFalls" | "accident" .
    NewsBulletin = "advert" NewsItem { NewsItem } [ Weather ] Filler .

quickly shows up the Rule 1 problems. The others may not be so obvious, although the second of these shows that "shaik" is nullable, and that Rule 2 is broken follows quite quickly. Similarly, a look at

    TalkShow = "host" { "listener" "host" } [ Filler ] .

shows that [ Filler ] is nullable, and this combined with the production for Radio shows that Rule 2 must be broken for this nullable component as well.

How, if at all, can one find a better grammar? Several submissions simply gave up at this point, but their authors often gave very spurious (that is, indefensible) reasons. Just because a grammar is not LL(1) is no guarantee that you cannot find an equivalent LL(1) grammar - but by the same token, it is possible to find non-LL(1) grammars that can be converted to LL(1) grammars quite easily (go and read section 7.3 of the textbook again), although in some cases this is not possible (read the discussion about the "dangling else" again).

Some of the problems in the grammar are easily overcome. If we change the production for TalkShow to

    TalkShow = "host" { "listener" "host" } .

we remove one of the Rule 2 problems, and we do not lose anything at all - if a Filler had been present it would have shown up as part of the next item.

We can remove the Rule 1 problem involving "advert" by a simple trick which is useful on many occasions - delay the factorization of the alternatives that cause problems

    Radio      = { TalkShow | "music" | "advert" [ RestOfNews ] } EOF .
    RestofNews = NewsItem { NewsItem } [ Weather ] Filler .

Conceptually a NewsBulletin in the old form - starting with an "advert" - is still required.

How do we get the problem of "shaik" to go away? A careful look at the production for NewsItem reveals that the grammar is actually ambiguous - it appears we are allowed to parse a substring "zuma" "shaik" "zuma" in two ways (either as ("zuma" "shaik") ( e "zuma" ) or as ("zuma" e ) ( "shaik" "zuma"). Now if we really want to find a grammar that is ambiguous (bearing in mind the rude remarks made in lectures that politicians love ambiguity, we might just want to do so!) we shall not be able to do this in an LL(1) compliant way. Trying to get behind the intent of the problem might suggest that we want a grammar in which it is possible to have news items on "zuma" without mentioning "shaik" in the same breath, but if we ever mention "shaik" it must either have been just after mentioning "zuma", or (if we had not done so), we have no option but to mention "zuma" straight afterwards.

One way to do this is to write the production:

    NewsItem  = "zuma" [ "shaik" ] | "strike" | "shaik" "zuma" | "mbeki" | "randFalls" | "accident" .

This gets rid of the Rule 1 violation, but not the Rule 2 violation. However, this one would not be serious - it is, in fact, exactly the same situation as the "dangling else" and a practical parser would resolve it in the same way that we discuss in lectures - that is, a news item on "shaik" immediately following "zuma" would be bound to that story on "zuma", and not to one still to follow. So, presented with

"zuma" "shaik" "zuma" "mbeki"

would be handled as syntactically equivalent to

("zuma" "shaik") ("zuma") ("mbeki")

and not as

("zuma") ("shaik" "zuma") "mbeki"

Of course, in a sense, these are not necessarily semantically equivalent - but this is a situation where the semantics become "context sensitive"!

Task 2 - Palindromes

Palindromes are character strings that read the same from either end. You were invited to explore various ways of finding grammars that describe palindromes made only of the letters a and b:

     (1)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
     (2)        Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
     (3)        Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
     (4)        Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .

Which grammars achieve their aim? If they do not, explain why not. Which of them are LL(1)? Can you find other (perhaps better) grammars that describe palindromes and which are LL(1)?

This is one of those awful problems that looks deceptively simple, and indeed is deceptive. We need to be able to cater for palindromes of odd or even length, and we need to be able to cater for palindromes of finite length, so that the "repetition" that one immediately thinks of has to be able to terminate.

Here are some that don't work:

   COMPILER Palindrome /* does not terminate */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" .
   END Palindrome.

   COMPILER Palindrome /* only allows odd length palindromes */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" .
   END Palindrome.

   COMPILER Palindrome /* only allows even length palindromes */
   PRODUCTIONS
     Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
   END Palindrome.

Of those grammars, the first seems to obey the LL(1) rules, but it is useless (it is not "reduced" in the sense of the definitions on page 129). The second one is obviously non-LL(1) as the terminals "a" and "b" can start more than one alternative. The third one is less obviously non-LL(1). If you rewrite it

   COMPILER Palindrome /* only allows even length palindromes */
   PRODUCTIONS
     Palindrome = "a" Extra "a" | "b" Extra "b" .
     Extra      = Palindrome | e .
   END Palindrome.

and note that Extra is nullable, then FIRST(Extra) = { "a", "b" } and FOLLOW(Extra) = { "a", "b" }.

Here is another attempt

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome = [ "a"  Palindrome  "a" | "b"  Palindrome  "b" | "a" | "b" ] .
   END Palindrome.

This describes both odd and even length palindromes, but is non-LL(1). Palindrome is nullable, and both FIRST(Palindrome) and FOLLOW(Palindrome) = { "a", "b" }. And, as most were quick to notice, it breaks Rule 1 immediately as well.

Other suggestions were:

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome =  "a"  [ Palindrome  "a"] | "b"  [ Palindrome  "b" ] .
   END Palindrome.

but, ingenious as this appears, it does not work either. Rewritten it would become

   COMPILER Palindrome /* allows any length palindromes */
   PRODUCTIONS
     Palindrome =  "a"  PalA | "b" PalB .
     PalA       = Palindrome  "a" | .
     PalB       = Palindrome  "b" | .
   END Palindrome.

PalA and PalB are both nullable, and FIRST(PalA) = { "a" , "b" } while FOLLOW(PalA) = FOLLOW(Palindrome) = { "a", "b" } as well.

In fact, when you think about it, you simply will not be able to find an LL(1) grammar for this language. (That is fine; grammars don't have to be LL(1) to be valid grammars. They just have to be LL(1) or very close to LL(1) to be able to write recursive descent parsers.) Here's how to think about it. Suppose I asked you to hold your breath for as long as you could, and also to nod your head when you were half way through. I don't believe you could do it - you don't know before you begin exactly how long you will be holding your breath. Similarly, if I told you to get into my car and drive it till the tank was empty but to hoot the hooter when you were half way to running out you could not do it. Or if I told you to walk into a forest with your partner and kiss him/her when you were in the dead centre of the forest, you would not know when the magic moment had arrived.

LL(1) parsers have to be able to decide just by looking at one token exactly what to do next - if they have to guess when they are are half-way through parsing some structure they will not be able to do so. One would have to stop applying the options like Palindrome = "a" Palindrome "a" at the point where one had generated or analyzed half the palindrome, and if there is no distinctive character in the middle one would not expect the parser to be able to do so.

If course, if one changes the problem ever so slightly in that way one can find an LL(1) grammar. Suppose we want a grammar for palindromes that have matching a and b characters on either end and a distinctive c or pair of c characters in the centre:

   COMPILER Palindrome /* allows any length palindromes, but c must be in the middle */
   PRODUCTIONS
     Palindrome = "a"  Palindrome  "a" | "b"  Palindrome  "b" | "c" [ "c" ] .
   END Palindrome.

Task 3 - Pause for thought

Which of the following statements are true? Justify your answer.

(a) An LL(1) grammar cannot be ambiguous.
(b) A non-LL(1) grammar must be ambiguous.
(c) An ambiguous language cannot be described by an LL(1) grammar.
(d) It is possible to find an LL(1) grammar to describe any non-ambiguous language.

To answer this sort of question you must be able to argue convincingly, and most people did not do that at all!

(a) is TRUE. An LL(1) grammar cannot be ambiguous. If a language can be described by an LL(1) grammar it will always be able to find a single valid parse tree for any valid sentence, and no parse tree for an invalid sentence. The rules imply that no indecision can exist at any stage - either you can find a unique way to continue the implicit derivation from the goal symbol, or you have to conclude that the sentence is malformed.

But you cannot immediately conclude any of the "opposite" statements, other than (c) which is TRUE. If you really want to define an ambiguous language (and you may have perfectly good/nefarious reasons for doing so - stand-up comedians do it all the time) you will not be able to describe it by an LL(1) grammar, which has the property that it can only be used for deterministic parsing.

In particular (b) is FALSE. We can "justify" this by giving just a single counter example to the claim that it might be true. We have seen several such grammars. The palindrome grammars above are like this - even though they are non LL(1) for the reasons given, they are quite unambiguous - you would only be able to parse any palindrome in one way! (Many people seemed not to realise this - they were incorrectly concluding that non- LL(1) inevitably implied ambiguity.(

Similarly (d) is FALSE. Once again the palindrome example suffices - this language is simple, unambiguous, but we can easily argue that it is impossible to find an LL(1) grammar to describe it.

Task 5 - All very logical!

This involved developing a grammar to describe a Boolean "calculator", which might accept a sequence of expressions like

      a AND B OR (C OR NOT D) =     a . b + (c + d') =     a b + (c + d') =
      NOT (a OR b) AND TRUE =       (a + b)' . 1 =         (a + b)' AND 1 =
      b AND NOT C AND D =           b . c' . d =           b c' d =

Most submissions got this correct, or very nearly so. Here is my suggestion:

  COMPILER Bool $CN
  /* Boolean expression parser
     P.D. Terry, Rhodes University, 2006 */

  CHARACTERS
    letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
  TOKENS
    variable   = letter .

  COMMENTS FROM "(*" TO "*)"  NESTED
  COMMENTS FROM "/*" TO "*/"  NESTED

  IGNORE CHR(0) .. CHR(31)

  PRODUCTIONS
    Bool       = { Expression "=" } .
    Expression = Term { Or Term } .
    Term       = Factor { [ And ] Factor } .
    Factor     = "NOT" Factor | Primary { "'" } .
    Primary    = True | False | variable | "(" Expression ")" .
    True       = "TRUE" | "1" .
    False      = "FALSE" | "0" .
    And        = "AND" | "&" | "." .
    Or         = "OR" | "|" | "+" .
  END Bool.

Points that were missed were (a) the AND operator was "optional" and (b) the leading unary NOT operator or the postfix ' operator can be applied repeatedly. Expressions like

     NOT NOT a
     a AND b''

are quite legal. Finally (c) although at this stage it may not be obvious, the grammar above is better (from the point of view of associativity and evaluation) than one with a simpler production

    Factor     = { "NOT" } ( True | False | variable | "(" Expression ")" ) { "'" } .

although this latter one is equivalent syntactically.

You were asked to comment on the trees that might be produced from an application of de Morgan's laws, though not many people commented on this. They are, of course, different, but that is not serious. Trees for a + b * c and c * b + a resulting from applying an arithmetic grammar would also be different, though one would readily recognize that the underlying semantic meanings would be identical.

Task 6 - Deja vu all over again - the joys of CSC 201 assembler

You were invited to develop a grammar for programs written in the CSC 201 toy assembler language, such as exemplified by

          BEG             ; Count the bits in a number
          CLA             ; CPU.A := 0
          STA     BITS    ; BITS := 0
          INI             ; Read(CPU.A)
  LOOP                    ; REPEAT
          SHR             ;  CPU.A := CPU.A DIV 2
          BCC     EVEN    ;  IF CPU.A MOD 2 # 0 THEN
          PSH             ;    save CPU.A on stack
          LDA     BITS
          INC
          STA     BITS    ;    BITS := BITS + 1
          POP             ;    restore CPU.A
  EVEN    BNZ     LOOP    ; UNTIL CPU.A = 0
          LDA     BITS    ;
          OTI             ; Write(BITS)
          HLT             ; terminate execution
  BITS    DS      1       ; BYTE BITS
          END

There were a whole lot of points missed, understandably, since (a) you had forgotten and/or (b) might never have explored the vagaries of this language fully.

One important feature, which most groups realized, was that one should split the opcodes into two groups - those that require an Address field, and those that do not. All the more surprising then, that in the test for this week, so few people did this for the other assembler language!

Relatively few understood how the address field could be defined. As is typical of many assembler systems, the expressions here can mix numbers, labels, strings and the special term denoted by * (few people had got close to understanding this last one).

Finally, in assembler languages the rule is almost invariably at most "one statement per line". But lines can be devoid of opcodes, and have only labels or comments on them. So we need to be careful to capture this idea too.

Here is a possible grammar. This makes use of the Coco/R directive SYNC that we have not discussed yet - it signals a convenient synchronization point. If no EOL is found where one should occur, the scanner simply consumes tokens until the EOL is found - this is a simple and very effective error recovery technique for "one statement per line" systems.

  COMPILER ASM $NC

  CHARACTERS
    lf         = CHR(10) .
    cr         = CHR(13) .
    control    = CHR(0) .. CHR(31) .
    letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    digit      = "0123456789" .
    binDigit   = "01" .
    hexDigit   = digit + "ABCDEFabcdef" .
    charCh     = ANY - "'" - control .
    printable  = ANY - control .

  TOKENS
    number     = digit { digit } | binDigit { binDigit } "%" | digit { hexDigit } "H" .
    identifier = letter { letter | digit } .
    string     = "'" charCh { charCh } "'" .
    EOL        = cr lf | lf .
    comment    = ";" { printable } .

  PRODUCTIONS
    ASM               = StatementSequence .
    StatementSequence = { Statement [ comment ] SYNC EOL } .
    Statement         = [ Label ] [ OneByteOp | TwoByteOp Address ] .
    OneByteOp         =    "ASR"  | "BEG"  | "CLA"  | "CLC"  | "CLV"  | "CLX"  | "CMA"
                         | "CMC"  | "DEC"  | "DEX"  | "END"  | "HLT"  | "INA"  | "INB"
                         | "INC"  | "INH"  | "INI"  | "INX"  | "NOP"  | "OTA"  | "OTB"
                         | "OTC"  | "OTH"  | "OTI"  | "POP"  | "PSH"  | "RET"  | "SHL"
                         | "SHR"  | "TAX" .
    TwoByteOp         =    "ACI"  | "ACX"  | "ADC"  | "ADD"  | "ADI"  | "ADX"  | "ANA"
                         | "ANI"  | "ANX"  | "BCC"  | "BCS"  | "BGE"  | "BGT"  | "BLE"
                         | "BLT"  | "BNG"  | "BNZ"  | "BPZ"  | "BRN"  | "BVC"  | "BVS"
                         | "BZE"  | "CMP"  | "CPI"  | "CPX"  | "DC"   | "DS"   | "EQU"
                         | "JGE"  | "JGT"  | "JLE"  | "JLT"  | "JSR"  | "LDA"  | "LDI"
                         | "LDX"  | "LSI"  | "LSP"  | "ORA"  | "ORG"  | "ORI"  | "ORX"
                         | "SBC"  | "SBI"  | "SBX"  | "SCI"  | "SCX"  | "STA"  | "STX"
                         | "SUB"  .
    Address           = Term { '+' Term | '-' Term } .
    Term              = Label | number | string | '*' .
    Label             = identifier .
  END ASM.

As an alternative, one might have defined

  COMMENTS FROM ";" TO lf

  PRODUCTIONS
    ...
    StatementSequence = { Statement SYNC EOL } .

The grammar is actually rather too accommodating, as some people realised. Strictly we should not allow BEG and END to appear more than once each, in very definite places. Directives like EQU must have labels - a statement sequence like

         MAX  EQU  12
              EQU  13
         MIN  EQU  14

must be meaningless. And strings of length greater than 1 are really only meaningful in DC directives.

Enforcing all these constraints is tricky, and in the case of the EQU situation, I think impossible if one wants to keep an LL(1) grammar. There is another way around that problem that we can explore in the next practical. The BEG/END restriction is possible syntactically, though its a bit messy, since we must allow for leading blank lines before BEG and for trailing blank lines after END. If we retain "comment" as a token we would get

  PRODUCTIONS
    ASM               = Begin StatementSequence End.
    StatementSequence = { Statement [ comment ] SYNC EOL } .
    Statement         = [ Label ] [ OneByteOp | TwoByteOp Address ] .
    OneByteOp         = /* as before, but omit BEG and END */ .
    TwoByteOp         = /* as before */
    Begin             = { [ comment ] EOL } "BEG" [ comment ] EOL .
    End               = { "END" [ comment ] EOL { [ comment ] EOL } .

At first I thought one would not be able to use the COMMENT directive in Coco/R to handle this sort of language. Surely if one defines

COMMENTS FROM ";" TO lf

then the scanner should have consumed the LF as part of the comment, and so looking for another one as an EOL token immediately afterwards suggests that one would get into trouble. As it happens (and as several people discovered) it does work - to try to handle the different ways in which operating systems record line breaks (CR on an Apple, LF on Unix and CR/LF on Windoze) Coco/R has a special way of handling the LF behind your back.

The grammar above has used a token definition for number

  TOKENS
    number = digit { digit } | binDigit { binDigit } "%" | digit { hexDigit } "H" .

This is quite adequate here, but it would be preferable to define three different tokens if one was interested in determining the underlying values - as one would be in a real assembler that generated code.

  TOKENS
    decnumber = digit { digit } .
    binNumber = binDigit { binDigit } "%" .
    hexNumber = digit { hexDigit } "H" .

Task 7 - Parva expressions are not like those in C# and Java

The grammar for expressions in Parva employs very few levels of operator precedence, corresponding exactly to the levels found in Pascal and its derivatives. You were asked to modify the Parva grammar from last week's practical so that it recognizes expressions whose precedence rules are equivalent to those found in C# or Java.

I think most people got at least part of the way there. The correct answer would be as below.

A few points were missed by many people. Note that the rule for RelExp allows only one further component - this prevents expressions from being formed like

a < b < c > d

which could have no meaaning. We divide the "equality" operators from the "relational" operators because expressions like

a == b == true != false

could have meaning. Finally, note the way in which the unary operators +, -and ! enter the precedence hierarchy. This grammar allows expressions to be formed like

a + - + b

(whatever turns you on!) which is forbidden in Pascal.

  COMPILER Parva $CN
     .....
    Condition         = Expression .
    Expression        = AndExp { "||" AndExp } .
    AndExp            = EqlExp { "&&" EqlExp } .
    EqlExp            = RelExp { EqlOp RelExp } .
    RelExp            = AddExp [ RelOp AddExp | "in" ExpList ] .
    AddExp            = MulExp { AddOp MulExp } .
    MulExp            = Factor { MulOp Factor } .
    Factor            = Primary | "+" Factor | "-" Factor | "!" Factor .
    Primary           =   Designator | Constant
                        | "new" BasicType "[" Expression "]"
                        | "(" Expression ")" .
    AddOp             = "+" | "-" .
    MulOp             = "*" | "/" | "%" .
    EqlOp             = "==" | "!=" .
    RelOp             = ">" | ">=" | "<" | "<=" .
  END Parva.

You were asked to consider why languages derived from C have so many levels of precedence and the rules they have for associativity, and what the advantages and disadvantages might be over languages like Pascal that get away with fewer.

I guess the designers of C thought that this would be a good idea because it allows you to write statements like

if ( a > b and c < d) DoSomething();

which in a Pascal-oriented system would have to be expressed with more parentheses:

if ( (a > b) and (c < d) ) DoSomething();

(Of course, one sensibly inserts extra parentheses into any expression if one is not sure of the precedence rules!)

There is another point that one can make, however. Now that you have seen how recursive descent parsers work, you should be able to appreciate that if a C-like compiler for Parva has to recognize the simple expression "a" it must go through a sequence of eight function calls - first to Expression(), then to AndExp(), then to OrExp(), then to EqlExp() ... until it gets to Primary() . In the original grammar it would have to make only four function calls. Function calls take space, and they take time. In the 70's, when memory was expensive and tight and computers were slow, it was a distinct advantage to have small, tight parsers. Niklaus Wirth is a role model for folk like myself of the "keep it as simple as you can" persuasion - I am sure that he would have made a very definite engineering decision.