Test for week 23

Computer Science 301 - 2010 - Test on Week 23 - Solutions

Please answer all questions in the spaces provided (four sides). Max 30

This test aims to check your ability to write simple scanners and parsers. TEXTBOOKS and NOTES MAY NOT BE USED.

1. You can get this 100% right! What are your name (surname) and student number? [1]

Pat (Terry) 63T0844

2. Suppose we have a Cocol grammar

     COMPILER Declarations
     /* Describe a subset of the forms that declarations can assume in a simple C-like language */

     CHARACTERS
       digit  = "0123456789" .
       letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .

     TOKENS
       ident  = letter { { "-" } (letter | digit) } .

     COMMENTS FROM "{" TO "}" NESTED

     IGNORE CHR(1) .. CHR(31) // CHR(0) will be used to denote EOF

     PRODUCTIONS
       Declarations = { DecList } EOF .
       DecList      = Type OneDecl { "," OneDecl } ";"  .
       Type         = "int" | "void" | "bool" | "char" .
       OneDecl      = "*" OneDecl | Direct .
       Direct       = ( ident | "(" OneDecl ")" ) [ Params ] .
       Params       = "(" [ OneParam { "," OneParam } ] ")" .
       OneParam     = Type [ OneDecl ] .
     END Declarations.

Assuming that you have accept and abort routines like those you used in this last week, and a scanner getSym() that can recognise tokens that might be described by the enumeration

EOFSym, noSym, identSym, intSym, voidSym, boolSym, charSym, commaSym, lparenSym, rparenSym,
semicolonSym, starSym;

how would you complete the parser routines below?

Most people were fairly close to having correct solutions. Those that were not were very depressing, and, frankly, I wondered whether the students concerned had contributed to the prac submissions at all.

Several submissions added EOFSym into FirstDeclarations. EOFSym is not a "real" terminal. It is injected into the grammar above as a convenient way of making sure that an error message will result if the input ends prematurely.

    static IntSet
      FirstDeclarations = new IntSet(intSym, voidSym, boolSym, charSym),
      FirstDeclList     = new IntSet(intSym, voidSym, boolSym, charSym),
      FirstType         = new IntSet(intSym, voidSym, boolSym, charSym),
      FirstOneDecl      = new IntSet(starSym, identSym, lparenSym),
      FirstDirect       = new IntSet(identSym, lparenSym),
      FirstParams       = new IntSet(lparenSym),
      FirstOneParam     = new IntSet(intSym, voidSym, boolSym, charSym);

Many people referred to FirstDeclarations instead of FirstDeclList in the next method. Well, yes, they are the same in this case, but they might have been different. Take care to get the code matched properly to the produccctions.

    static void Declarations () {
    // Declarations = { DecList } EOF .
      while (FirstDeclList.contains(sym.kind)) DecList();
      accept(EOFSym, "EOF expected");
    }

    static void DecList () {
    // DecList = Type OneDecl { "," OneDecl } ";"  .
      Type();
      OneDecl();
      while (sym.kind == commaSym) {
        getSym(); OneDecl();
      }
      accept(semicolonSym, "; expected");
    }

    static void Type () {
    // Type = "int" | "void" | "bool" | "char" .
      if Firsttype.contains(sym.kind) getSym();
      else abort("invalid type");
    }

Here's an interesting variation. Suppose we were to overload the accept method as sollows

    static void accept(int wantedSym, String errorMessage) {
    // Checks that lookahead token is wantedSym
      if (sym.kind == wantedSym) getSym(); else abort(errorMessage);
    } // accept (simple)

    static void accept(IntSet allowable, String errorMessage) {
    // Checks that lookahead token is a member of a set of allowed tokens
      if (allowable.contains(sym.kind)) getSym(); else abort(errorMessage);
    } // accept (set)

then we would be able to write the Type parser as:

    static void Type () {
    // Type = "int" | "void" | "bool" | "char" .
      accept(FirstType, "invalid type");
    }

A variation which might be useful in other applications as well. Moving right along -

    static void OneDecl () {
    // OneDecl = "*" OneDecl | Direct .
      if (sym.kind == starSym) {
        getSym(); oneDecl();
      }
      else if (FirstDirect.contains(sym.kind)) Direct();
      else abort("invalid start to OneDecl");
    }

Several submissions simplified this to the following, leaving out the "abort" call:

    static void OneDecl () {
    // OneDecl = "*" OneDecl | Direct .
      if (sym.kind == starSym) {
        getSym(); oneDecl();
      }
      else Direct();
    }

This is living dangerously! If a production (or part of a production) incorporates a list of alternatives, you should always cater for the possibility that the text to be parsed matches none of the possibilities, and be prepared to flag an error in this case.

Incidentally, a remarkable number of submissions misinterpreted the production

    OneDecl = "*" OneDecl | Direct

treating it as

    OneDecl = "*"   (  OneDecl | Direct )

A "safe" way to code Direct would be

    static void Direct () {
    // Direct = ( ident | "(" OneDecl ")" ) [ Params ] .
      if (sym.kind == identSym()) getSym;
      else if (sym.kind == lparenSym) {
        getSym;
        OneDecl();
        accept(rparenSym, ") expected");
      }
      else abort("invalid start to Direct");
      if (sym.kind == lparenSym) Params();
    }

although the following is also "safe" (why - can you figure it out), and is a bit shorter. The point is that one should not assume that when Direct is called there will be an identSym or lparenSym as the current token.

    static void Direct () {
    // Direct = ( ident | "(" OneDecl ")" ) [ Params ] .
      if (sym.kind == identSym()) getSym;
      else {
        accept(lparenSym, "( expected");
        OneDecl();
        accept(rparenSym, ") expected");
      }
      if (sym.kind == lparenSym) Params();
    }

    static void Params () {
    // Params = "(" [ OneParam { "," OneParam } ] ")" .
      accept(lparenSym, "( expected");  // not a good idea to simplify to getSym() - why?
      if (FirstOneParam.contains(sym.kind)) {
        OneParam();
        while (sym.kind == commaSym) {
          getSym(); OneParam();
        }
      }
      accept(rparenSym, ") expected");
    }

    static void OneParam () {
    // OneParam  = Type [ OneDecl ] .
      Type();
      if (FirstOneDecl.contains(sym.kind)) OneDecl();
    }

3. A student has been trying to develop a scanner for the system. The part of it that deals with recognizing identifiers and ignoring comments has led her to develop the following code. Sadly, it is not quite correct. Suggest the changes needed to correct it.

This was very badly done! Most people seemed not to have noticed that the scanner was required to discard (possibly) nested comments, and that an underscore in an identifier had to be followed by a letter or digit. And don't let the scanner call abort. The scanner should simply return noSym as the value of sym.kind and let the parser react accordingly.

The skeleton code was as follows.

      static void getSym() {                         // Scans for next sym
        StringBuilder symLex = new StringBuilder();
        int symKind = noSym;                         // assume the worst!
        while (ch > EOF && ch <= ' ') getChar();     // whitespace
        if (ch == '{' )  {                           // deal with comments
          do {
            getChar();
          } while (ch != '}');
        }
        if (Character.isLetter(ch)) {                // recognize identifiers
          do {
            symLex.append(ch); getChar();
          } while (Character.isLetterOrDigit(ch));
          symKind = literalKind(symLex);
        }

This made no attempt to handle nested comments, to continue to look for a token after dealing with a comment, to guard against unclosed comments, and so on. It also made no attempt to handle identifiers with underscores that have to be followed by a letter or digit. A Java solution follows (C# code is effectively identical, of course):

      static int literalKind(StringBuilder lex) {    // handle keywords
        String s = lex.toString();
        if (s.equals("bool"))  return boolSym;
        if (s.equals("char"))  return charSym;
        if (s.equals("int"))   return intSym;
        if (s.equals("void"))  return voidSym;
        return identSym;
      }

      static void getSym() {                         // Scans for next sym
        StringBuilder symLex = new StringBuilder();
        int symKind = noSym;                         // assume the worst!
        while (ch > EOF && ch <= ' ') getChar();     // whitespace
        if (ch == '{' ) {                            // deal with comments
          int nesting = 1;                           // level of nesting
          do {
            getChar();
            if (ch == EOF) {                         // unclosed comment
              sym = new Token(EOFSym, "EOF");        // give up in disgust
              return;
            }
            if (ch == '{') nesting++;                // increase nesting
            if (ch == '}') nesting--;                // decrease nesting
          } while (nesting != 0);                    // when the loop terminates our
                                                     // job is not yet done!
          getChar(); getSym();                       // recursive call - still need to find a token
          return;
        }
        if (Character.isLetter(ch)) {                // recognize identifiers
          do {
            symLex.append(ch); getChar();
            if (ch = '_') {                          // must be followed letter | digit
              while (ch == '_') {                    // work through the underscores
                symLex.append(ch);
                getCh();
              }                                      // and now check the following character
              if (!Character.isLetterOrDigit(ch)) {  // bad identifier - return noSym
                sym = new Token(noSym, symLex.toString());
                return;
              }
            }
          } while (Character.isLetterOrDigit(ch));
          symKind = literalKind(symLex);
        }

        else {
          // Please don't concern yourself with the rest of this method
        }
        sym = new Token(symKind, symLex.toString());
      }