Computer Science 3 - 2001 - Test on Prac 18

THURSDAY TEST

2. Suppose we wanted to write a scanner to match the specification expressed
   in Cocol as follows (only two kinds of tokens are valid!)

       CHARACTERS
         letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
         digit  = "0123456789" .
         eol    = CHR(13) .

       IGNORE CHR(1) .. CHR(32)

       COMMENTS FROM "//" TO eol

       TOKENS
         begin      = "begin" .
         identifier = "a" { letter | digit } .

   The routine must assign either beginSym or identSym to Sym before leaving
   the routine (or assign the value of unknownSym if it comes across an
   invalid token).  Complete the skeleton code below to show how this might be
   done:

Very few people saw that they were expected to be able to skip comments!

A full solution is a bit messy, unless you introduce the idea of storing
letters in an array. What I was looking for was the insight that you had to
scan the remainder of the word, and, sadly, not everyone saw this. Here is an
array based solution:

    void GetSym (void) {
    // Scans for next sym
      while (ch > 0 && ch <= ' ') GetChar();
      sym = unknownSym;                 // not recognised yet
      switch (ch) {
      case '/' :                        // possible comment
        GetChar();
        if (ch == '/') {                // definite comment
          do GetChar(); while (ch != '\n');
          GetChar(); GetSym();          // now look for the symbol again
        }
        break;
      case 'a' :
        sym = identSym;
        do                              // scan remainder of ident
          GetChar();
        while (isalpha(ch));
        break;
      case 'b' :
        char spelling[100];
        int i = 0;
        do                              // build up the word
          { spelling[i++] = ch; GetChar(); }
        while (isalpha(ch));
        spelling[i] = '\0';             // null terminator
        if (strcmp(spelling, "begin") == 0) sym = beginSym;
      default  : break; // already handled - sym = unknownSym;
      }
    }

3. Suppose we have a grammar with the PRODUCTIONS

     PRODUCTIONS
        A = '(' { '+' number } [ B ] ')' .
        B = '-' number | "." .

Assuming that you have an accept routine like the one you used last week,
and a scanner that can recognise tokens that might be described by the
enumeration

enum SYMBOLS { EOFSym, plusSym, minusSym, numberSym, lparenSym, rparenSym,
               periodSym, unknownSym};

how would you complete the parser routines below?

    void A (void) {
    //  A = '(' { '+' number } [ B ] ')' .
    }

    void B (void) {
    //  B = '-' number | "." .
    }

This was rather badly done, and it is very easy. An important idea that many
people have missed is that parser routines are "driven" by checking for
FIRST tokens, not by negative tests for FOLLOW tokens. So we need something
like this:

    void A (void) {
    //  A = '(' { '+' number } [ B ] ')' .
      accept(lparenSym, "( expected");
      while (Sym == plusSym) { GetSym(); accept(numberSym, "number expected"); }
      if (sym == minusSym || sym == periodSym) B();
      accept(rparenSym, ") expected");
    }

    void B (void) {
    //  B = '-' number | "." .
      switch (sym) {
        case minusSym:
          GetSym(); accept(numberSym, "- expected"); break;
        case periodSym:
          GetSym(); break;
        default :
          error("unexpected symbol"); break;
      }
    }

alternatively:

    void B (void) {
    //  B = '-' number | "." .
      if (sym == minusSym) {
        GetSym(); accept(numberSym, "- expected");
      } else accept(periodSym, "- expected");
    }

Each routine should be written in its own "context". The routine for A
should not worry about what B has to do, other than to check whether B
should be called.  Coding A like this, which some people did, is back to
front:

    void A (void) {
    //  A = '(' { '+' number } [ B ] ')' .
      accept(lparenSym, "( expected");
      if (Sym != rparenSym) {
        if (Sym = minusSym) B();
        while (Sym == plusSym) { GetSym(); accept(numberSym, "number expected"); }
      else error(") expected");
    }

Finally, there were several submissions that tested for EOFSym in various
places.  While I see what you may have been thinking (since the associated
prac had used EOFSym), in this grammar there is no mention of EOFSym, so don't
be tempted to introduce it into the productions!

FRIDAY TEST

2. Suppose we wanted to write a scanner to match the specification expressed
   in Cocol as follows (only two kinds of tokens are valid!)

       CHARACTERS
         digit = "0123456789" .
         eol   = CHR(13) .

       IGNORE CHR(1) .. CHR(32)

       COMMENTS FROM ";" TO eol

       TOKENS
         float   = digit { digit } '.' { digit } .
         integer = digit { digit } .

   The routine must assign either floatSym or intSym to Sym before leaving the
   routine (or assign the value of unknownSym if it comes across an invalid
   token).  Complete the skeleton code below to show how this might be done:

Very few people saw that they were expected to be able to skip comments!

The best way to do this is probably as follows:

   void GetSym (void) {
   // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     if (ch == ';') {                  // comment
       do GetChar(); while (ch != '\n');
       GetChar(); GetSym();            // now look for the symbol again
       return;
     }
     if (isdigit(ch)) {                // must be okay
       Sym = intSym;                   // initial assumption
       while (isdigit(ch)) GetChar();  // scan over the first few digits
       if (ch == '.') {                // okay, we must have a floating number
         Sym = floatSym;               // so change our assumption
         do                            // now scan past the digits after the point
           GetChar();
         while (isdigit(ch))
       }
     } else Sym = unknownSym;
   }

Another way might appear to be like this, which some people tried

   void GetSym (void) {
   // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     sym = unknownSym;                 // not recognised yet
     if (ch == ';') {                  // comment
       do GetChar(); while (ch != '\n');
       GetChar(); GetSym();            // now look for the symbol again
       return;
     }
     if (isdigit(ch)) {
       Sym = intSym;                               // initial assumption
       while (isdigit(ch) || ch = '.' ) {          // scan over the rest
         GetChar(); if (ch == '.') Sym = floatSym; // change our assumption
       }
     } else Sym = UnknownSym;
   }

but this is subtly WRONG - it would accept something like 45.67.89 as a
valid floating number (can you see why? If not, better come to ask me!)

3. Suppose we have a grammar with the PRODUCTIONS

     PRODUCTIONS
        A = [ B  { C } ] .
        B = number | identifier .
        C = '(' string B ')' .

Assuming that you have an accept routine like the one you used last week,
and a scanner that can recognise tokens that might be described by the
enumeration

enum SYMBOLS { EOFSym, numSym, identSym, stringSym, lparenSym, rparenSym,
               unknownSym };

how would you complete the parser routines below?

    void A (void) {
    //  A = [ B  { C } ] .
    }

    void B (void) {
    //  B = number | identifier .
    }

    void C (void) {
    //  C = '(' string B ')' .
    }

This was very badly done, and it is very easy.  An important idea that many
people have missed is that parser routines are "driven" by checking for
FIRST tokens, not by negative tests for FOLLOW tokens. So we need something
like this:

    void A (void) {
    //  A = [ B { C } ] .
      if (Sym == numSym || Sym == identSym) {
        B();
        while (Sym == lparenSym)  C();
      }
    }

    void B (void) {
    //  B = number | identifier .
      if (Sym == numSym || Sym == identSym) GetSym();
      else error("invalid B");
    }

    void C (void) {
    //  C = '(' string B ')' .
      accept(lparenSym, "( expected");
      accept(stringSym, "string expected");
      B();
      accept(rparenSym, ") expected");
    }

Each routine should be written in its own "context". The routine for A
should not worry about what B has to do, other than to check whether B
should be called. Coding like this, which several people did, "works", but
is illogical - it works only if there are no other productions in the
system that depend on A, B or C.

    void A (void) {
    //  A = [ B { C } ] .
      if (Sym == numSym || Sym == identSym) {
        GetSym();
        while (Sym == lparenSym) { GetSym();  C(); }
      }
    }

    void B (void) {
    //  B = number | identifier .
      // nothing, already handled in A
    }

    void C (void) {
    //  C = '(' string ')' .
    // already accepted ( in A, so start from string
      accept(stringSym, "string expected");
      B();
      accept(rparenSym, ") expected");
    }

Finally, there were several submissions that tested for EOFSym in various
places.  While I see what you may have been thinking (since the associated
prac had used EOFSym), in this grammar there is no mention of EOFSym, so don't
be tempted to introduce it into the productions!