Test on week 23

Computer Science 3 - 2008 - Test on Prac 23 - Solutions

1. No funny stuff this time- this is a serious question - What is your name (surname)? [1]

Pat (Terry) 63T0844

2. Here is part of an attempt to write a scanner for the system that you developed last week. The author has made an attempt to handle comments, realising that a comment is not a token. The code is inadequate. Briefly explain why this should be so and how it sould be fixed. [4]

   void getSym() {  // Scans for next sym
     while (ch > 0 && ch <= ' ') GetChar();
     StringBuilder symLex = new StringBuilder();
     int symKind;
     symLex.append(ch);
     switch (ch) {
       case '\0' : symKind = EOFSym; symLex = new StringBuilder("EOF"); break;
       case '\\' : do getChar(); while (ch != '\\');
                   getChar(); getSym(); break;
       case ';'  : symKind = SemicolonSym; getChar(); break;
       // ..... more like that
       default   : symKind = noSym; getChar(); break;
     }
     sym = new Token(symKind, symLex.toString());
   }

The inadequacy arises because no attempt has been made to guard against encountering a comment that begins, but does not close, before end-of-file is reached. Furthermore, the break; after the getChar(); getSym(); sequence must be replaced by a return;, so that the recursion will "unwind" properly. This leads to code like this:

        case '\\' : do getChar(); while (ch != '\\' && ch != '\0');
                    if (ch == '\0') abort("unclosed comment");
                    getChar(); getSym(); return;

Some strange misconceptions showed up in the submissions. Firstly, the sequence '\\' in Java code does not represent a string of two \ characters, it represents the \ character - this has to be done using an "escaped character sequence" idea. Secondly, do not confuse the do-while loop with the other while loop, as some people did. Thirdly, note that a comment is not a token - the scanner must not return a CommentSym, for example. Lastly, note that a recursive solution like this works very well - it handles perceived problems like "What about situations where one comment follows another immediately, as in \ comment \ \ comment \ or even \comment\\comment\".

3. Suppose we wanted to write a scanner (similar to but simpler than) the one above that would be able to recognize only two valid tokens - floating point numbers and integer numbers. Assume that these are described by

       float   = digit { digit } '.' { digit } .
       integer = digit { digit }

The routine must assign either floatSym or intSym to sym.kind before leaving the routine (or assign the value of noSym if it comes across an invalid token). Complete the skeleton code below to show how this might be done. [8]

      void getSym() {  // Scans for next sym
        while (ch > 0 && ch <= ' ') getChar();
        StringBuilder symLex = new StringBuilder();
        int symKind;
        if (Character.isDigit(ch) {  // must have found a number

        } else { // we found something unrecognizable
          symLex.append(ch); symKind = noSym; getChar();
        }
        sym = new Token(symKind, symLex.toString());
      }

This was very badly answered - some solutions submitted had not even though of incorporating loops for reading the digits, and some had clearly totally confused the roles of scanner and parser. The best way to do this may be as follows:

      void getSym() {  // Scans for next sym
        while (ch > 0 && ch <= ' ') getChar();
        StringBuilder symLex = new StringBuilder();
        int symKind;
        if (Character.isDigit(ch) {        // must have found a number
          symKind = intSym;                // initial assumption
          do {
            symLex.append(ch); getChar();
          } while (Character.isDigit(ch)); // scan over the first few digits
          if (ch == '.') {                 // okay, we must have a floating number
            symKind = floatSym;            // so change our assumption
            do {                           // now scan past the digits after the point
              symLex.append(ch); getChar();
            } while (Character.isDigit(ch);
          }
        } else {                           // we found something unrecognizable
          symLex.append(ch); symKind = noSym; getChar();
        }
        sym = new Token(symKind, symLex.toString());
      }

Another way might appear to be like this, which some people tried

void getSym() { // Scans for next sym while (ch > 0 && ch <= ' ') getChar(); StringBuilder symLex = new StringBuilder(); int symKind; if (Character.isDigit(ch) { // must have found a number symKind = intSym; // initial assumption while (isdigit(ch) || ch == '.' ) { // scan over the rest symLex.append(ch); getChar(); if (ch == '.') symKind = floatSym; // change our assumption } } else { // we found something unrecognizable symLex.append(ch); symKind = noSym; getChar(); } sym = new Token(symKind, symLex.toString()); }

but this is subtly WRONG - it would accept something like 45.67.89 as a valid floating number (can you see why? If not, better come to ask me!)

4. For the Regular Expression application, is it necessary or advisable to introduce the noSym concept? If so, why so, and if not, why not? [3]

Well, I was looking for the insight that I hoped you would have had - in this particular application there were really NO unknown tokens - every character that was not a metacharacter was actually an atom! Sadly not everyone seemed to know that. However, noSym would be useful for situations where the EscapedSym was not correctly defined.

In other applications - that is, in general - the idea of having an noSym is a good one - if the scanner does not recognise a valid token it then returns noSym, and leaves the parser to sort out the mess (just as having the concept of EOFSym is a good idea - if the scanner can't find a token at all then it returns a pseudo-token EOFSym, and leaves the parser to sort out the mess).

5. Suppose we have a grammar with the PRODUCTIONS

     PRODUCTIONS
        A = { C [ B] } [ B ] .
        B = number | identifier .
        C = '(' string ')' .

Assuming that you have an accept routine like the one you used last week, and a scanner that can recognise tokens that might be described by the enumeration

numSym, identSym, stringSym, LParenSym, RParenSym, EOFSym, noSym

how would you complete the parser routines below? [6] This is not an LL(1) grammar. Does this totally invalidate the parser? Explain why or why not. [3]

    void A() {
    //  A = { C [ B ] } [ B ] .
    }

    void B() {
    //  B = number | identifier .
    }

    void C() {
    //  C = '(' string ')' .
    }

This was not always well done, and it is very easy. Indeed, some submissions were so badly wrong that I doubt whether the people concerned had done anything in the practical at all. An important idea that some people have missed is that parser routines are best "driven" by checking for FIRST tokens, not by negative tests for FOLLOW tokens. So we need something like this:

    void A() {
    //  A = { C [ B] } [ B ] .
      while (sym.kind == LParenSym) {
        C();
        if (sym.kind == numSym || sym.kind == identSym) B();
      }
      if (sym.kind == numSym || sym.kind == identSym) B();
    }

    void B() {
    //  B = number | identifier .
      if (sym.kind == numSym || sym.kind == identSym) getSym;
      else reportError("invalid B");
    }

    void C() {
    //  C = '(' string ')' .
      accept(LParenSym, "( expected");
      accept(stringSym, "string expected");
      accept(RParenSym, ") expected");
    }

Normally each routine should be written in its own "context", that is, be self-contained. The routine for A should not worry about what B has to do, other than to check whether B should be called. Coding like that below, which a few people did, "works", but is illogical - it works only if there are no other productions in the system that depend on A, B or C.

void A() { // A = { C [ B ] } [ B ] . while (sym.kind == LParenSym) { getSym(); C(); if (sym.kind == numSym || sym.kind == identSym) getSym(); } if (sym.kind == numSym || sym.kind == identSym) getSym(); } void B() { // B = number | identifier . // nothing, already handled in A } void C() { // C = '(' string ')' . // already accepted ( in A, so start from string accept(stringSym, "string expected"); accept(RParenSym, ") expected"); }

This is not an LL(1) grammar, but it will not really invalidate the parser. It will, however, bind the first B to the { C [ B ] } construct.

So C B will be parsed as C B eps and not as C eps B

A bit like a "dangling else", in a sense. Of course, in an application where any actions to be associated with B were different for each B there would be trouble.