Test 23 solutions

Computer Science 301 - 2006 - Test on Week 23

Please answer all questions in the spaces provided (two sides). Max 25

This test is intended to check your ability to write simple scanners and parsers. Textbooks may not be used.

1. You can get this 100% right! What are your name (surname) and student number? [1]

Pat Terry 63T0884

2. Suppose we have a grammar which included the PRODUCTIONS

     PRODUCTIONS
        A = [ B ] '(' number { '+' number | B } ')' .
        B = '-' number '.' | '.' .

Assuming that you have accept and abort routines like those you used last week, and a scanner getSym() that can recognise tokens that might be described by the enumeration

EOFSym, noSym, plusSym, minusSym, numSym, lparenSym, rparenSym, periodSym;

how would you complete the parser routines below? Hint: the grammar might have other productions, some of which might have right sides that include the non-terminals A and B. [10]

I was hoping for solutions on the line of the following:

    static void A () {
    //  A = [ B ] '(' number { '+' number | B } ')' .

      if (sym.kind == minusSym || sym.kind == periodSym) B();

      accept(lparenSym, "( expected");
      accept(numSym, "number expected");
      while (sym.kind == plusSym || sym.kind == minusSym || sym.kind == periodSym) {
         if (sym.kind == plusSym) {
           getSym(); accept(numSym, "number expected")
         }
         else B();
      }
      accept(rparenSym, ")" expected);
    }

    static void B () {
    //  B = '-' number '.' | '.' .
      if (sym.kind == minusSym) {
        getSym();
        accept(numSym, "number expected");
        accept(periodSym, ". expected";
      } else accept(periodSym, ". expected";
    }

The method for B could also have been coded with a switch statement. If you do this, be careful to include the error case:

static void B () { // B = '-' number '.' | '.' . switch (sym.kind) case minusSym : { getSym(); accept(numSym, "number expected"); accept(periodSym, ". expected"; break; } case periodSym : getSym(); break; default: abort("invalid start to B"); break; } }

Some submissions modified the grammar to come up with:

static void B () { // B = '-' number '.' | '.' . // changed to B = [ '-' number ] '.' . if (sym.kind == minusSym) { getSym(); accept(numSym, "number expected"); } accept(periodSym, ". expected"; }

The grammars are syntactically equivalent, but one would not be able to do this transformation with impunity if different "actions" were to be associated with the different periods.
Several submissions came up with the following for A

static void A () { // A = [ B ] '(' number { '+' number | B } ')' . if (sym.kind == minusSym || sym.kind == periodSym) B(); accept(lparenSym, "( expected"); accept(numSym, "number expected"); while (sym.kind != rparenSym) { if (sym.kind == plusSym) { getSym(); accept(numSym, "number expected") else B(); } accept(rparenSym, ")" expected); }

that is, drove the repetition { '+' number | B } "from the right" instead of "from the left". I don't know that this is a good idea at all. If the closing parenthesis were omitted the loop would be in danger of not terminating nicely.

3. Suppose we wanted to write a scanner to match the specification of numbers used in the assembler grammar recently, which can be expressed in Cocol as follows (only three kinds of tokens are valid!) [16]

       CHARACTERS
         digit      = "0123456789" .
         binDigit   = "01" .
         hexDigit   = digit + "ABCDEF" .
         eol        = CHR(10) .

       COMMENTS FROM "{" TO "}"

       IGNORE CHR(1) .. CHR(32)  /* character handler returns CHR(0) as EOF */

       TOKENS
         decNumber  = digit { digit } .            /* digits only */
         binNumber  = binDigit { binDigit } "%" .  /* trailing % */
         hexNumber  = digit { hexDigit } "H" .     /* trailing H */

The routine must assign either decSym or binSym or hexSym to symKind before leaving the routine (or assign the value of noSym if it comes across an invalid token, or eofSym at the end of file). Complete the skeleton code below to show how this might be done:

This was deliberately a tough question, and sadly not many people saw the problem -which is that tokens like 01103% (with a non binary digit in what appears to be a binary sequence) or 1A3 (an hex number without a trailing H) must give rise to an error. One way to handle this is suggested below.

Secondly, the comment handling left a lot to be desired, and in many cases I wondered whether their authors could possibly have understood the practical in this respect. If a comment is detected one has firstly to ignore it, and then to press on to find a symbol - and this might only happen after one has found another comment, in a sequence like { comment } { comment(after more whitespace) } 1234.

One way of handling comments is to have a recursive call, but you must be careful to ensure that when the "inner" call returns, control flows correctly.

      static void getSym() {                               // Scans for next sym
        StringBuffer symLex = new StringBuffer();
        int symKind = noSym;                               // default is noSym
        while (ch > EOF && ch <= ' ') getChar();

        if (ch == '{') {                                   // comment
          do getChar(); while (ch != '}' && ch != EOF);
          if (ch == '}') {
            getChar(); getSym(); return;                   // note break out!
          } else output.write("unclosed comment");
        }
        else {
          if (Character.isDigit(ch)) {                     // hopefully a number
            boolean seenHex = false, seenDec = false;      // it might be binary
            do {
              seenDec = seenDec || ch > '1' && ch <= '9';  // no, it might be decimal
              seenHex = seenHex || ch >= 'A' && ch <= 'F'; // no, it might be hexadecimal
              symLex.append(ch); getChar();
            } while (Character.isHexDigit(ch));            // now have seen all the digits
            if (ch == '%') {                               // check that we had not seen bad digits
              symLex.append(ch); getChar();
              if (!seenHex && !seenDec)                    // only seen 0s and 1s
                symKind = binNumber;                       // else it must be noSym (eg 121% or 1E1%)
            }
            else if (ch == 'H') {
              symLex.append(ch); getChar();
              symKind = hexNumber;                         // it must have been hexadecimal
            }
            else if (!seenHex) symKind = decNumber;        // else it must be noSym (eg 1A34)
          }
          else                                             // unknown symbol or EOF
            if (ch == EOF) symKind = EOFSym;
          sym = new Token(symKind, symLex.toString());
        }
      }

If you don't like the recursive solution you might like this one instead, which also shows another way of discrimination between numbers (some submissions tried something like this):

static void getSym() { // Scans for next sym StringBuffer symLex = new StringBuffer(); int symKind = noSym; // default is noSym while (ch > EOF && ch <= ' ') getChar(); // leading whitespace while (ch == '{') { // there may be several comments in a row do getChar(); while (ch != '}' && ch != EOF); if (ch == EOF) { reportError("unclosed comment"); break; // out of the outer loop } while (ch > EOF && ch <= ' ') getChar(); // skip trailing whitespace } if (Character.isDigit(ch)) { // hopefully a number int status = bin; while (isBinDigit(ch)) { // leading binary digits symLex.append(ch); getChar(); } while (Character.isDigit(ch)) { // decimal or binary digits status = dec; symLex.append(ch); getChar(); } while(isHexDigit(ch)) { // hex, decimal or binary status = hex; symLex.append(ch); getChar(); } switch (ch) { // character following digits case 'H' : symLex.append(ch); getChar(); symKind = hexNumber; break; case '%' : symLex.append(ch); getChar(); if (status == bin) symKind = binNumber; // else it is in error break; default : if (status != hex) symKind = decNumber; // else it is in error break; } } else // unknown symbol or EOF if (ch == EOF) symKind = EOFSym; sym = new Token(symKind, symLex.toString()); }

Home © P.D. Terry