Computer Science 301 - 2001

Test 2 on Translators - 20 September 2001

Please answer all questions 0 ... 4, in the spaces provided (five sides). Text books may not be used.

0. What is your surname (+ initials please) [ 1 mark ]

1. In the land of Phool the official language is Phoolish. A Phoolish professor noticed that many of his students had not mastered the syntax of Phoolish well. Tired of correcting the many syntactical mistakes, he decided to challenge the students and asked them to write a program that could check the syntactical correctness of any sentence they wrote. Similar to the nature of Phools, the syntax of Phoolish is also pleasantly simple. Here are the rules:

0. The characters in the language are the 26 letters of the standard western alphabet and the language distinguishes between lower and upper case letters.
1. Every lower case letter is a simple correct sentence.
2. If σ is a correct sentence, then so is a sentence consisting of an upper case vowel followed by σ, that is Aσ Eσ Iσ Oσ or Us
3. If σ and τ are correct sentences, then so is a sentence consisting of an upper case consonant followed by στ, for example Bστ Cστ Dστ ... Zστ.
4. Rules 0 to 3 are the only rules that determine the syntactical correctness of a sentence.

Complete the following Cocol description of this language: [ 6 marks ]

COMPILER Phoolish IGNORE CHR(1) .. CHR(31) CHARACTERS /* if necessary or useful */ TOKENS /* if necessary or useful */ PRODUCTIONS Phoolish = END Phoolish.

This was unbelievably badly done, especially as it was a question you were supposed to have seen before (and would have done had you come to the tutorial sessions). The two points that many students missed were (a) one cannot use character set names within the PRODUCTIONS section directly (b) the recursion has to be done in terms of the Phoolish non-terminal itself.

        COMPILER Phoolish
        IGNORE CHR(1) .. CHR(31)
        CHARACTERS
          letter    = "abcdefghijklmnopqrstuvwxyz" .
          consonant = "BCDFGHJKLMNPQRSTVWXYZ" .
          vowel     = "AEIOU" .
        TOKENS
          Lletter    = letter .
          Uconsonant = consonant .
          Uvowel     = vowel .
        PRODUCTIONS
          Phoolish = Lletter | Uvowel Phoolish | Uconsonant Phoolish Phoolish .
        END Phoolish.

2. (a) Explain what is meant by the FIRST and FOLLOW sets as applied to grammar analysis, and give a concise (mathematical) definition of set membership for each of these. [ 4 marks ]

Predictably, this question was very badly done. I never understand why the most important theoretical part of this course seems to be ignored every year by so many students! Be careful to appreciate that we usually talk about FIRST(string) but FOLLOW(NonTerminal).

The terminal start symbols of a general string ξ are the elements of the set of all terminals with which ξ or a string derived from ξ can start, that is

a ∈ FIRST(ξ) if ξ ⇒ ^* aζ, (a ∈ T ⁺ , ζ ∈ (N ∪ T ) ^* )

with the ad hoc rule that FIRST(ε) = Φ. Note that ε is not a member of the terminal vocabulary T, and that it is important to distinguish between FIRST(ξ) and FIRST(NonTerminal). The string ξ might consist of a single non-terminal A, but in general it might be a concatenation of several symbols.

The terminal successors of a non-terminal A are the elements of the set of all terminals that can follow A in any sentential form, that is

a ∈ FOLLOW(A) if S ⇒ ^* ξAaζ, (A ∈ N , a ∈ T ⁺ , x, z ∈ (N ∪ T ) ^* )

(b) State the rules that a grammar must obey for it to be LL(1). Do not write vague English; describe the rules in concise mathematics. [6]

Rule 1

When the productions for any non-terminal A admit alternatives

A → ξ₁ | ξ₂ | . . . ξ_n

but where ξ_k ≠> ε for any k, the sets of initial terminal symbols of all sentences that can be generated from each of the ξ_k's must be disjoint, that is

FIRST(ξ_j) ∩ FIRST(ξ_k) = Φ for all j ≠ k

Rule 2

When the productions for a non-terminal A admit alternatives

A → ξ₁ | ξ₂ | . . . ξ_n

and in particular where ξ_k ⇒ ε for some k, the sets of initial terminal symbols of all sentences that can be generated from each of the ξ_j for j ≠ k must be disjoint from the set FOLLOW(A) of symbols that may follow any sequence generated from A, that is

FIRST(ξ_j) ∩ FOLLOW(A) = Φ, j ≠ k

or, rather more loosely,

FIRST(A) ∩ FOLLOW(A) = F

3. The following is a Cocol description of a Boolean expression

       COMPILER Bool

       IGNORE CHR(1) .. CHR(31)

       PRODUCTIONS
         Bool       = Term { "OR" Term } .
         Term       = Factor { [ "AND" ] Factor } .
         Factor     = "NOT" Factor | Primary { "'" } .
         Primary    = "TRUE" | "FALSE" | "(" Bool ")" .
       END Bool.

(I apologise - the question paper had a typographical error in the production for Primary. However, this did not really affect the question).

(a) Rewrite the productions to give an equivalent grammar that does not use the EBNF meta-brackets { ... } or [ ... ] [ 4 marks ]

There are various ways in which this can be done, but the safest way to proceed, which is what many students did, is to use the way suggested in section 9.2.3 of the book and come up with:

        Bool         = Term MoreTerms .
        MoreTerms    = "OR" Term MoreTerms | ε .
        Term         = Factor MoreFactors .
        MoreFactors  = OptionalAND Factor MoreFactors | ε  .
        OptionalAND  = "AND" | ε .
        Factor       = "NOT" Factor | Primary Quotes .
        Quotes       = "'" Quotes | ε .
        Primary      = "TRUE" | "FALSE" | "(" Bool ")" .

(b) Analyse the grammar in the form produced in (a) to determine whether it satisfies to the LL(1) constraints for recursive descent parsing. [ 6 marks ]

Here the only productions that might cause complications for Rule 1 are those for Factor and for Primary.

        FIRST(Factor₁) = { "NOT" }
        FIRST(Factor₂) = { "TRUE", "FALSE", "(" }

which have nothing in common.

        FIRST(Primary₁) = { "TRUE" }
        FIRST(Primary₂) = { "FALSE" }
        FIRST(Primary₃) = { "(" }

and no pair of these has anything in common. There are several nullable productions - those for

        MoreTerms    = "OR" Term MoreTerms | ε .
        MoreFactors  = OptionalAND Factor MoreFactors | ε  .
        OptionalAND  = "AND" | ε .
        Quotes       = "'" Quotes | ε .

which are the only ones we have to consider in checking Rule 2. But when we look at the appropriate FIRST and FOLLOW sets we find the following (not everyone seems to know how to do these!)

        FIRST(MoreTerms)  = { "OR" }
        FOLLOW(MoreTerms) = { ")" }

        FIRST(MoreFactors)  = { "AND" }
        FOLLOW(MoreFactors) = { "OR", ")" }

        FIRST(OptionalAND)  = { "AND" }
        FOLLOW(OptionalAND) = { "NOT", "TRUE", "FALSE", "(" }

        FIRST(Quotes)  = { "'" }
        FOLLOW(Quotes) = { "OR", "AND", "NOT", "TRUE", "FALSE", "(", ")" }

Alternatively, as some people did, one could transform the grammar on the lines of several other expression grammars that we have seen, namely to use left recursive productions:

        Bool        = Term | Bool "OR" Term .
        Term        = Factor | Term OptionalAND Factor .
        OptionalAND = "AND" | ε  .
        Factor      = "NOT" Factor | Primary Quotes .
        Quotes      = "'" Quotes | ε  .
        Primary     = "TRUE" | "FALSE" | "(" Bool ")" .

Left recursive grammars cannot be LL(1), as we have discusssed on several occasions. Finally, there were some solutions submitted that were ambiguous, on the lines of

        Bool        = Term | Term "OR" Term .
        Term        = Factor | Factor OptionalAND Factor .
        OptionalAND = "AND" | ε  .
        Factor      = "NOT" Factor | Primary Quotes .
        Quotes      = "'" Quotes | ε  .
        Primary     = "TRUE" | "FALSE" | "(" Bool ")" .

which allow the correct syntax, but you should try to avoid those - see section 8.4 in the text.

(c) Assuming that you have suitable Accept and GetSym routines similar to those used in an earlier practical (you need not develop these at all), and that GetSym can decode the globally accessible variable Sym as one of the tokens

          { EOFSym, unknownSym, trueSym, falseSym, notSym, andSym, orSym, quoteSym,
            lparenSym, rparenSym }

develop a recursive descent parser for the language that matches the grammar as given originally. [ 8 marks ]

What was expected here was code on the lines of that shown below:

      void Bool (void);    // prototypes
      void Term (void);
      void Factor (void);
      void Primary (void);

      void Bool (void) {
      // Bool = Term { "OR" Term } .
        Term();
        while (Sym == orSym) {
          GetSym();
          Term();
        }
      }

      void Term (void) {
      // Term = Factor { [ "AND" ] Factor } .
        Factor();
        while (Sym == andSym ||
               Sym == trueSym || Sym == falseSym || Sym == notSym || Sym == lparenSym ) {
          if (Sym == andSym) GetSym();
          Factor();
        }
      }

      void Factor (void) {
      // Factor = "NOT" Factor | Primary { "'" } .
        if (Sym == notSym) {
          GetSym();
          Factor();
        }
        else {
          Primary();
          while (Sym == quoteSym) GetSym();
        }
      }

      void Primary (void) {
      // Primary = "TRUE" | "FALSE" | "(" Bool ")" .
        switch (Sym) {
          case trueSym : /* drop through to next case */
          case falseSym :
            GetSym(); break;
          case lparenSym;
            GetSym(); Bool(); accept(rparenSym, ") expected"); break;
          default :
            abort("unexpected symbol"); break;
        }
      }

      void main (void) {
        GetChar(); GetSym();
        Bool();
      }

The messiest bit of this is the function for parsing a Term(). Note that if one were to define the enumeration in a clever way:

          { EOFSym, unknownSym, orSym, quoteSym, rparenSym,
            andSym, lparenSym, trueSym, falseSym, notSym }

then one could simplify Term as follows:

      void Term (void) {
      // Term = Factor { [ "AND" ] Factor } .
        Factor();
        while (Sym >= andSym) {
          if (Sym == andSym) GetSym();   Factor();
        }
      }

One sometimes sees this sort of trick with enumerations (another example is to be found in the book in section 14.6.1 (which we did not cover in class).

4. The grammar below is for integer or Boolean expressions that involve constant factors only. Show how it could be "attributed" for use with Coco/R so that the parsing process could check that an expression was not only syntactically correct but also return a value of an enumeration denoting whether the expression was of integer, Boolean or mismatched type: [ 10 marks ]

        enum types { badType, intType, boolType };

The constraints on operators appearing in expressions may be summarized:

Arithmetic operators + - * / and relational operators == != < <= > >= may only be applied to integer operands.
Boolean operators ! && and || may only be applied to Boolean operands.
Arithmetic operators produce integer results.
Boolean and relational operators produce Boolean results.

Goal = Expression . Expression = SimpleExp [ RelOp SimpleExp ] . SimpleExp = Term { ( "+" | "-" ) Term | "||" Term } . Term = Factor { ( "*" | "/" ) Factor | "&&" Factor } . Factor = number | "true" | "false" | "(" Expression ")" | "!" Factor . RelOp = "==" | "!=" | "<" | "<=" | ">" | ">=" .

This problem was not particularly well done, except perhaps by people who had bothered to study material presented in Tutorial 20. Note that one cannot use a simple Boolean parameter (as some people tried, presumably having in mind the similar but simpler problem studied in Tutorial 19 where the objective was to see whether an expression was a "constant expression" or not). Another point that many people missed was that if the Expression parser finds a Relop between two SimpleExps, then it has to check that these SimpleExps have returned integer types, but pass an indication of a Boolean type back to its own caller. Finally, there were many attempts that did not use local variables correctly or at all. In view of the recursive nature of the grammar, the use of reference parameters and local variables is mandatory!

    Goal
    =                             (. types e; .)
    Expression<e> .

    Expression<types &e>
    =                             (. types s; .)
      SimpleExp<e>
      [ RelOp SimpleExp<s>        (. if (e != intType || s != intType) e = badType;
                                     else e = boolType; .)
      ] .

    SimpleExp<types &s>
    =                             (. types t; .)
      Term<s>
      {   ( "+" | "-" ) Term<t>   (. if (s != intType || t != intType) e = badType; .)
        | "||" Term<t>            (. if (s != boolType || t != boolType) e = badType; .)
      } .

    Term<types &t>
    =                             (. types f; .)
      Factor<t>
      {   ( "*" | "/" ) Factor<f> (. if (t != intType || f != intType) t = badType; .)
        | "&&" Factor<f>          (. if (t != boolType || f != boolType) t = badType; .)
      } .

    Factor<types &f>
    =   number                    (. f = intType; .)
      | "true"                    (. f = boolType; .)
      | "false"                   (. f = boolType; .)
      | "(" Expression<f> ")"
      | "!" Factor<f>             (. if (f != boolType) f = badType; .)
      .

    RelOp
    = "==" | "!=" | "<" | "<=" | ">" | ">=" .