Practical for week

Computer Science 301 - 2014

Test on Prac 5 - Week beginning 5 May 2014 - Solutions

Please answer all questions in the space provided or on the back sides of the two pages. Max 25

This test is intended to check your ability to write simple scanners and parsers. It should only take you about 30 minutes at the most. Textbooks and computers and notes may not be used.

1. Your spot came up - lucky you! What are your name (surname) and student number? [1]

Pat (Terry) 63T0844

2. Suppose we have a grammar with the PRODUCTIONS (possibly among others which might make reference to them):

     PRODUCTIONS
        A = { C [ B ] } [ B ] .
        B = number | identifier .
        C = '(' string ')' .

Assuming that you have an accept routine like the one you used last week, and a scanner that can recognise tokens that might be described by the enumeration below, complete the parser routines [6]

numSym, identSym, stringSym, lParenSym, rParenSym, EOFSym, noSym

    static void A () {
    //  A = { C [ B ] } [ B ] .

    }

    static void B () {
    //  B = number | identifier .

    }

    static void C () {
    //  C = '(' string ')' .
    }

This is not an LL(1) grammar. Does this totally invalidate the parser? Explain why or why not - perhaps give an input string that might cause trouble. [3]

This was not always well done, and it is very easy. Indeed, some submissions were so badly wrong that I doubt whether the people concerned had done anything in the practical at all. An important idea that some people have missed is that parser routines are best "driven" by checking for FIRST tokens, not by negative tests for FOLLOW tokens. So we need something like this:

    void A() {
    //  A = { C [ B ] } [ B ] .
      while (sym.kind == lParenSym) {
        C();
        if (sym.kind == numSym || sym.kind == identSym) B();
      }
      if (sym.kind == numSym || sym.kind == identSym) B();
    }

    void B() {  // one version
    //  B = number | identifier .
      if (sym.kind == numSym || sym.kind == identSym) getSym;
      else reportError("invalid start to B");
    }

    void B() {  // an alternative version
    //  B = number | identifier .
      switch (sym.kind) {
        case numSym:
        case identSym : getSym(); break;
        default: abort("invalid start to B"); break;
      }
    }

    void B() {  // another alternative version
    //  B = number | identifier .
      if (sym.kind == numSym) getSym();
      else accept(identSym, "invalid start to B");
    }

    void C() {
    //  C = '(' string ')' .
     accept(lParenSym, "( expected");
     accept(stringSym, "string expected");
     accept(rParenSym, ") expected");
    }

Alternatively we might try something like this. Note that the accept() method is overloaded - one version has a simple int parameter, the other has an IntSet parameter. This not mentioned in the book, but it was included in the skeleton file you used in the practical.

static Intset FirstB = new IntSet(numSym, identSym); void A() { // A = { C [ B ] } [ B ] . while (sym.kind == lParenSym) { C(); if (FirstB.contains(sym.kind)) B(); } if (FirstB.contains(sym.kind)) B(); } void B() { // B = number | identifier . accept (FirstB, "invalid start to B"); }

Normally each routine should be written in its own "context", that is, be self-contained. The routine for A should not worry about what B has to do, other than to check whether B should be called. Coding like that below, which a few people tried, "works", but is illogical - it works only if there are no other productions in the system that depend on A, B or C (as you hade been warned in the question might be the case).

void A() { // A = { C [ B ] } [ B ] . while (sym.kind == lParenSym) { getSym(); C(); if (sym.kind == numSym || sym.kind == identSym) getSym(); } if (sym.kind == numSym || sym.kind == identSym) getSym(); } void B() { // B = number | identifier . // nothing, already handled in A } void C() { // C = '(' string ')' . // already accepted ( in A, so start from string accept(stringSym, "string expected"); accept(RParenSym, ") expected"); }

This is not an LL(1) grammar, but it will not really invalidate the parser. It will, however, bind the first of any sequences derived from B to the { C [ B ] } construct.

So C B will be treated and parsed as C B eps and not as C eps B

A bit like a "dangling else", in a sense. Of course, in an application where any actions to be associated with B are different for each of the two Bs there might well be trouble, as would happen with

     PRODUCTIONS
        A                            (. String str; .)
        = { C                        (. action after parsing C .)
          [ B<out str>               (. str = str.toUpperCase(); .)
          ] }
          [ B<out str>               (. str = str.toLowerCase(); .)
          ] .
        B<out String str>
         = ( number | identifier )   (. str = token.val; .)
        C = '(' string ')' .

In general, try to avoid grammars with LL(1) problems if you are Coco/R.

2. Suppose we wanted to write a scanner to match a specification expressed in Cocol as follows. Only three kinds of tokens are valid - read the specification carefully.

       CHARACTERS
         digit = "0123456789" .
         eol   = CHR(10) .

       COMMENTS FROM ";" TO eol

       IGNORE CHR(1) .. CHR(32)  /* character handler returns CHR(0) as EOF */

       TOKENS
         float   =   digit { digit } '.' { digit }
                   | "." digit { digit } .
         integer = digit { digit } .
         dotdot  = ".." .

The routine must assign either floatSym, intSym or dotdotSym to sym.kind before leaving the routine (or assign the value of noSym if it comes across an invalid token, or eofSym at the end of file). Complete the skeleton code below to show how this might be done. [15]

      static void getSym() {  // Scans for next sym
        StringBuilder symLex = new StringBuilder();
        int symKind = noSym;
        while (ch > EOF && ch <= ' ') getChar();

In general this was badly answered (with some outstanding exceptions) . Some solutions submitted had not even thought to incorporate loops for reading the digits, and some had even confused the roles of scanner and parser. Most submissions did not deal with comments at all, which was alarming, as I know that this had caused some difficulty for some people in the practical.

One way to write the scanner would be:

      void getSym() {                          // Scans for next sym
        while (ch > 0 && ch <= ' ') getChar();
        StringBuilder symLex = new StringBuilder();
        int symKind = noSym;
        while (ch > EOF && ch <= ' ') getChar();

        // Firstly, check to see if we found a comment (or, even a sequence of comments with
        // nothing of much interest between them).

        if (ch == ';') {
          do getChar(); while (ch != '\n' && ch != EOF);

          // A comment is NOT a token, so if we skip over a comment we must still continue
          // to search for the symbol that getSym() is supposed to be returning.  One way
          // to do this is to make a recursive call to getSym() - having first ensured that
          // we are truly over the comment
          // If we get to EOF the returned token will then be eofSym; maybe issue a warning?

          getChar(); getSym();                 // get the hell out of here once we have the token
          return;                              // that follows the comment, (or comments in succession)
                                               // (will be EOFSym if unterminated -
        } // comment processor

        // If we didnt find a comment, we now proceed to recognise a proper token.

        if (Character.isDigit(ch) {            // We must have found an integer or float number
          symKind = intSym;                    // Make an initial assumption
          do {
            symLex.append(ch); getChar();      // Note that we add the digits to the buffer
          } while (Character.isDigit(ch));     // Scan over the only digits, or ones before the period.
          if (ch == '.') {                     // Aha! - we must have a floating number in that case,
            symKind = floatSym;                // so change our assumption
            do {                               // and incorporate the digits after the point.
              symLex.append(ch); getChar();
            } while (Character.isDigit(ch);
          }
        } // finished  dddd.dd  or  dddd.   processing

        else                                   // the else is crucial - do you see why?

        if (ch == '.') {                       // we must have found dotdot or a float number,
                                               // like   .ddd    or a noSym, like   .ball
          symLex.append(ch); getChar();        // add to the buffer
          if (ch == '.') {                     // Aha! - we found dotdot
            symLex.append(ch); getChar();
            symKind = dotdotSym;
          }
          else                                 // But in this case we found  .ddd or some .bad sequence
            if (Character.isDigit(ch) {        // Aha! - we must have found a float number
              symKind = floatSym;
              do {
                symLex.append(ch); getChar();  // keep adding digits to the buffer
              } while (Character.isDigit(ch)); // and looking for a non-digit to stop the loop
            }
        } // finished  .dddd or  ..  or .bad processing

        else                                   // another crucial else
          if (ch == EOF) {
            symLex.append("EOF");              // no need to getChar() this time
            sym.kind = eofSym;
          }

        else {
          symLex.append(ch); getChar();        // symKind is noSym anyway
        }
        sym = new Token(symKind, symLex.toString());
      }

Another way might appear to be like this, which some people tried

... if (Character.isDigit(ch) { // must have found a number symKind = intSym; // initial assumption while (isdigit(ch) || ch == '.' ) { // scan over the rest symLex.append(ch); getChar(); if (ch == '.') symKind = floatSym; // change our assumption } } else ...

but this is subtly WRONG - it would accept something like 45.67.89 as a valid floating number (can you see why? If not, better come to ask me!)