Computer Science 3 - 2000 - Test on Prac 23

1. What is your name?

2. Here is part of an attempt to write a scanner for the system that you developed last week. The author has made an attempt to handle comments, realising that comments might be started and never completed. But the code is inadequate. Why?

    void GetSym (void) { // Scans for next sym
      while (ch > 0 && ch <= ' ') GetChar();
      switch (ch) {
        case '\0' : sym = EOFSym; break;
        case '\\' : do GetChar(); while (ch != '\\' && ch != '\0');
                    if (ch == nul) Abort("unclosed comment");
                    break;
        case ';'  : sym = SemicolonSym; GetChar(); break;
        case '|'  : sym = BarSym; GetChar(); break;
        .....
        case '\'' : sym = EscapedSym;
                    char quote = ch; GetChar(); GetChar();
                    if (ch != quote) Abort("Mismatched quotes");
                    GetChar();
                    break;
      }
    }

The inadequacy arises because, if a comment is encountered, the scanner (correctly) ignores it, but then returns without fulfilling the contract to decode a token (a comment is not a token). We need code like this

        case '\\' : do GetChar(); while (ch != '\\' && ch != '\0');
                    if (ch == nul) Abort("unclosed comment");
                    GetChar(); GetSym(); break;

3. The code also handles the EscapedSym token by accepting an escaped token of the form "x" or 'x' (that is, a character, surrounded by quotes). Is it really necessary to define escaped tokens in this way? If not, can you suggest an alternative that would still allow us to distinguish meta symbols like BarSym from the representation of | as a character that we might want our regular expression to match?

This was almost a trick question. Although we must have some mechanism for "escaping", we don't need to be restricted to the "?" or '?' convention for escaping a meta-character like ?. We could use another way of escaping - for example using a %?% notation with % as the escape character, or even simply %?.

4. Suppose we wanted to write a scanner like the one above that would be able to recognize only two valid tokens - a range of identifiers that must start with the letter 'a' and contain only letters, and the keyword "begin". Complete the skeleton code below to show how this might be done:

    void GetSym (void) { // Scans for next sym
      while (ch > 0 && ch <= ' ') GetChar();
      switch (ch) {
        case 'a'  : sym = identSym;
        default   : sym = badSym;
      }
    }

A full solution is a bit messy, unless you introduce the idea of storing letters in an array. What I was looking for was the insight that you had to scan the remainder of the word, and, sadly, not everyone saw this. Here is an array based solution:

    void GetSym (void) {
    // Scans for next sym
      char spelling[100];
      while (ch > 0 && ch <= ' ') GetChar();
      switch (ch) {
        case 'a'  :
          sym = identSym;
          do                            // scan remainder of ident
            GetChar();
          while (isalpha(ch));
        case 'b'  :
          int i = 0; spelling[0] = ch;
          do                            // build up the word
            { GetChar(); spelling[++i] = ch; )
          while (isalpha(ch));
          spelling[i-1] = '\0';         // null terminator
          if (strcmp(spelling, "begin") == 0) sym = beginSym; else sym = badSym;
        default   : sym = badSym;
      }
    }

5. Suppose we have a grammar with the PRODUCTIONS

     PRODUCTIONS
        A = '(' { '+' number } [ B ] ')' .
        B = '-' number .

Assuming that you have an accept routine like the one you used last week, and a scanner that can recognise tokens that might be described by the enumeration

enum SYMBOLS { plusSym, minusSym, numberSym, LParenSym, RParenSym, badSym };

how would you complete the parser routines below?

    void A (void) {
    //  A = '(' { '+' number } [ B ] ')' .
    }

    void B (void) {
    //  B = '-' number .
    }

This was very badly done, and it is very easy. An important idea that many people have missed is that parser routines are "driven" by checking for FIRST tokens, not by negative tests for FOLLOW tokens. So we need something like this:

    void A (void) {
    //  A = '(' { '+' number } [ B ] ')' .
      accept(LParenSym, "( expected");
      while (Sym == plusSym) { GetSym; accept(numberSym, "number expected"); }
      if (sym == minusSym) B();
      accept(RParenSym, ") expected");
    }

    void B (void) {
    //  B = '-' number .
      accept(minusSym, "- expected");
      accept(numberSym, "- expected");
    }

Each routine should be written in its own "context". The routine for A should not worry about what B has to do, other than to check whether B should be called. Coding A like this, which several people did, is back to front:

    void A (void) {
    //  A = '(' { '+' number } [ B ] ')' .
      accept(LParenSym, "( expected");
      if (Sym != RParenSym) {
        if (Sym = minusSym) B();
        while (Sym == plusSym) { GetSym; accept(numberSym, "number expected"); }
      else error(") expected");
    }


Computer Science 3 - 2000 - Test on Prac 23

1. What is your name?

2. Here is part of an attempt to write a scanner for the system that you developed last week. The author has made an attempt to handle comments, realising that a comment is not a token. But the code is inadequate. Why?

   void GetSym (void) {  // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     switch (ch) {
       case '\0' : sym = EOFSym; break;
       case '\\' : do GetChar(); while (ch != '\\');
                   GetChar(); GetSym(); break;
       case ';'  : sym = SemicolonSym; GetChar(); break;
       case '|'  : sym = BarSym; GetChar(); break;
       .....
       default   : sym = UnknownSym; GetChar(); break;
     }
   }

The inadequacy arises because no attempt has been made to guard against encountering a comment that begins, but does not close before end-of-file is reached. We need code like this

        case '\\' : do GetChar(); while (ch != '\\' && ch != '\0');
                    if (ch == nul) Abort("unclosed comment");
                    GetChar(); GetSym();
                    break;

3. For that particular application, is it necessary or advisable to introduce the concept of the UnknownSym? If so, why so, and if not, why not?

Well, I was looking for the insight that I hoped you would have had - in this particular application there are NO unknown tokens - every character that is not a metacharacter is actually an atom! Sadly not everyone seemed to know that.

In general the idea of having an UnknownSym (or noSym as Coco calls it) is a good one - if the scanner does not recognise a valid token it then returns UnknownSym, and leaves the parser to sort out the mess (just as having the concept of EOFSym is a good idea - if the scanner can't find a token at all then it returns a pseudo- token EOFSym, and leaves the parser to sort out the mess).

4. Suppose we wanted to write a scanner like the one above that would be able to recognize only two valid tokens - floating point numbers and integer numbers. These are described by

       float   = digit { digit } '.' { digit } .
       integer = digit { digit }

The routine must assign either floatSym or intSym to Sym before leaving the routine (or assign the value of UnknownSym if it comes across an invalid token). Complete the skeleton code below to show how this might be done:

      void GetSym (void) {  // Scans for next sym
        while (ch > 0 && ch <= ' ') GetChar();
        if (isdigit(ch) {

        } else Sym = UnknownSym;
      }

The best way to do this is probably as follows:

   void GetSym (void) {
   // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     if (isdigit(ch)) {                // must be okay
       Sym = intSym;                   // initial assumption
       while (isdigit(ch)) GetChar();  // scan over the first few digits
       if (ch == '.') {                // okay, we must have a floating number
         Sym = floatSym;               // so change our assumption
         do                            // now scan past the digits after the point
           GetChar();
         while (isdigit(ch)
       }
     } else Sym = UnknownSym;
   }

Another way might appear to be like this, which some people tried

   void GetSym (void) {
   // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     if (isdigit(ch)) {
       Sym = intSym;                               // initial assumption
       while (isdigit(ch) || ch = '.' ) {          // scan over the rest
         GetChar(); if (ch == '.') Sym = floatSym; // change our assumption
       }
     } else Sym = UnknownSym;
   }

but this is subtly WRONG - it would accept something like 45.67.89 as a valid floating number (can you see why? If not, better come to ask me!)

5. Suppose we have a grammar with the PRODUCTIONS

     PRODUCTIONS
        A = [ B ] { C } .
        B = number | identifier .
        C = '(' string ')' .

Assuming that you have an accept routine like the one you used last week, and a scanner that can recognise tokens that might be described by the enumeration

enum SYMBOLS { numSym, identSym, stringSym, LParenSym, RParenSym, badSym };

how would you complete the parser routines below?

    void A (void) {
    //  A = [ B ] { C } .
    }

    void B (void) {
    //  B = number | identifier .
    }

    void C (void) {
    //  C = '(' string ')' .
    }

This was very badly done, and it is very easy. An important idea that many people have missed is that parser routines are "driven" by checking for FIRST tokens, not by negative tests for FOLLOW tokens. So we need something like this:

    void A (void) {
    //  A = [ B ] { C } .
      if (Sym == numSym || Sym == ident) B();
      while (Sym == LParenSym)  C();
    }

    void B (void) {
    //  B = number | identifier .
      if (Sym == numSym || Sym == ident) GetSym;
      else error("invalid B");
    }

    void C (void) {
    //  C = '(' string ')' .
      accept(LParenSym, "( expected");
      accept(stringSym, "string expected");
      accept(RParenSym, ") expected");
    }

Each routine should be written in its own "context". The routine for A should not worry about what B has to do, other than to check whether B should be called. Coding like this, which several people did, "works", but is illogical - it works only if there are no other productions in the system that depend on A, B or C.

    void A (void) {
    //  A = [ B ] { C } .
      if (Sym == numSym || Sym == ident) GetSym();
      if (Sym == LParenSym) { GetSym();  C(); }
    }

    void B (void) {
    //  B = number | identifier .
      // nothing, already handled in A
    }

    void C (void) {
    //  C = '(' string ')' .
    // already accepted ( in A, so start from string
      accept(stringSym, "string expected");
      accept(RParenSym, ") expected");
    }