Computer Science 3 - 2002 - Test on Prac 10

1. No funny stuff this time- this is a serious question - What is your name (surname)? [1]

Pat (Terry)

2. Here is part of an attempt to write a scanner for the system that you developed last week. The author has made an attempt to handle comments, realising that a comment is not a token. But the code is inadequate. Why? [3]

   void GetSym (void) {  // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     switch (ch) {
       case '\0' : sym = EOFSym; break;
       case '\\' : do GetChar(); while (ch != '\\');
                   GetChar(); GetSym(); break;
       case ';'  : sym = SemicolonSym; GetChar(); break;
       case '|'  : sym = BarSym; GetChar(); break;
       .....
       default   : sym = UnknownSym; GetChar(); break;
     }
   }

The inadequacy arises because no attempt has been made to guard against encountering a comment that begins, but does not close, before end-of-file is reached. We might prefer code like this

        case '\\' : do GetChar(); while (ch != '\\' && ch != '\0');
                    if (ch == '\0') Abort("unclosed comment");
                    GetChar(); GetSym();
                    break;

Some strange misconceptions showed up in the submissions. Firstly, the sequence '\\' in C++ code does not represent a string of two \ characters, it represents the \ character - this has to be done using an "escaped character sequence" idea. Secondly, do not confuse the do-while loop with the other while loop, as some people did. Thirdly, note that a comment is not a token - the scanner must not return a CommentSym, for example.

3. For that particular application, was it necessary or advisable to introduce the concept of the UnknownSym? If so, why so, and if not, why not? [2]

Well, I was looking for the insight that I hoped you would have had - in this particular application there were NO unknown tokens - every character that was not a metacharacter was actually an atom! Sadly not everyone seemed to know that. One insightful answer suggested that UnknownSym might be useful for situations where the EscapedSym was not correctly defined. Well done - I had not thought of that myself, and it's always nice when I learn a new trick from students! In my own solution to the prac I made the scanner Abort the whole program if a bad escape sequence is detected, but that was pretty harsh.

In general the idea of having an UnknownSym (or noSym as Coco calls it) is a good one - if the scanner does not recognise a valid token it then returns UnknownSym, and leaves the parser to sort out the mess (just as having the concept of EOFSym is a good idea - if the scanner can't find a token at all then it returns a pseudo-token EOFSym, and leaves the parser to sort out the mess).

4. Suppose we wanted to write a scanner like the one above that would be able to recognize only two valid tokens - floating point numbers and integer numbers. These are described by

       float   = digit { digit } '.' { digit } .
       integer = digit { digit } .

The routine must assign either floatSym or intSym to Sym before leaving the routine (or assign the value of unknownSym if it comes across an invalid token). Complete the skeleton code below to show how this might be done. [5]

      void GetSym (void) {  // Scans for next sym
        while (ch > 0 && ch <= ' ') GetChar();
        if (isdigit(ch) {

        } else Sym = UnknownSym;
      }

This was very badly answered - most solutions submitted had no loops for consuming digits. The best way to do this may be as follows:

   void GetSym (void) {
   // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     if (isdigit(ch)) {           // must be okay
       Sym = intSym;              // initial assumption
       do
         GetChar();
       while (isdigit(ch));       // scan over the first few digits
       if (ch == '.') {           // okay, we must have a floating number
         Sym = floatSym;          // so change our assumption
         do                       // now scan past the digits after the point
           GetChar();
         while (isdigit(ch)
       }
     } else Sym = UnknownSym;
   }

Another way might appear to be like this, which some people tried

   void GetSym (void) {
   // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     if (isdigit(ch)) {
       Sym = intSym;                               // initial assumption
       while (isdigit(ch) || ch = '.' ) {          // scan over the rest
         GetChar(); if (ch == '.') Sym = floatSym; // change our assumption
       }
     } else Sym = UnknownSym;
   }

but this is subtly WRONG - it would accept something like 45.67.89 as a valid floating number (can you see why? If not, better come to ask me!)

5. Suppose we have a grammar with the PRODUCTIONS

     PRODUCTIONS
        A = [ B ] { C } .
        B = number | identifier .
        C = '(' string ')' .

Assuming that you have an accept routine like the one you used last week, and a scanner that can recognise tokens that might be described by the enumeration

enum SYMBOLS { numSym, identSym, stringSym, LParenSym, RParenSym, badSym };

how would you complete the parser routines below? [9]

    void A (void) {
    //  A = [ B ] { C } .
    }

    void B (void) {
    //  B = number | identifier .
    }

    void C (void) {
    //  C = '(' string ')' .
    }

This was not always well done, and it is very easy. An important idea that many people have missed is that parser routines are "driven" by checking for FIRST tokens, not by negative tests for FOLLOW tokens. So we need something like this:

    void A (void) {
    //  A = [ B ] { C } .
      if (Sym == numSym || Sym == ident) B();
      while (Sym == LParenSym)  C();
    }

    void B (void) {
    //  B = number | identifier .
      if (Sym == numSym || Sym == ident) GetSym;
      else error("invalid B");
    }

    void C (void) {
    //  C = '(' string ')' .
      accept(LParenSym, "( expected");
      accept(stringSym, "string expected");
      accept(RParenSym, ") expected");
    }

Each routine should be written in its own "context". The routine for A should not worry about what B has to do, other than to check whether B should be called. Coding like this, which several people did, "works", but is illogical - it works only if there are no other productions in the system that depend on A, B or C.

    void A (void) {
    //  A = [ B ] { C } .
      if (Sym == numSym || Sym == ident) GetSym();
      if (Sym == LParenSym) { GetSym();  C(); }
    }

    void B (void) {
    //  B = number | identifier .
      // nothing, already handled in A
    }

    void C (void) {
    //  C = '(' string ')' .
    // already accepted ( in A, so start from string
      accept(stringSym, "string expected");
      accept(RParenSym, ") expected");
    }