Please answer all questions in the spaces provided (four sides). Max 30
This test aims to check your ability to write simple scanners and parsers. TEXTBOOKS and NOTES MAY NOT BE USED.
1. You can get this 100% right! What are your name (surname) and student number? [1]
Pat (Terry) 63T0844
2. Suppose we have a Cocol grammar
COMPILER Declarations /* Describe a subset of the forms that declarations can assume in a simple C-like language */ CHARACTERS digit = "0123456789" . letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . TOKENS ident = letter { { "-" } (letter | digit) } . COMMENTS FROM "{" TO "}" NESTED IGNORE CHR(1) .. CHR(31) // CHR(0) will be used to denote EOF PRODUCTIONS Declarations = { DecList } EOF . DecList = Type OneDecl { "," OneDecl } ";" . Type = "int" | "void" | "bool" | "char" . OneDecl = "*" OneDecl | Direct . Direct = ( ident | "(" OneDecl ")" ) [ Params ] . Params = "(" [ OneParam { "," OneParam } ] ")" . OneParam = Type [ OneDecl ] . END Declarations.
Assuming that you have accept and abort routines like those you used in this last week, and a scanner getSym() that can recognise tokens that might be described by the enumeration
EOFSym, noSym, identSym, intSym, voidSym, boolSym, charSym, commaSym, lparenSym, rparenSym,
semicolonSym, starSym;
how would you complete the parser routines below?
Most people were fairly close to having correct solutions. Those that were not were very depressing, and, frankly, I wondered whether the students concerned had contributed to the prac submissions at all.
Several submissions added EOFSym into FirstDeclarations. EOFSym is not a "real" terminal. It is injected into the grammar above as a convenient way of making sure that an error message will result if the input ends prematurely.
static IntSet FirstDeclarations = new IntSet(intSym, voidSym, boolSym, charSym), FirstDeclList = new IntSet(intSym, voidSym, boolSym, charSym), FirstType = new IntSet(intSym, voidSym, boolSym, charSym), FirstOneDecl = new IntSet(starSym, identSym, lparenSym), FirstDirect = new IntSet(identSym, lparenSym), FirstParams = new IntSet(lparenSym), FirstOneParam = new IntSet(intSym, voidSym, boolSym, charSym);
Many people referred to FirstDeclarations instead of FirstDeclList in the next method. Well, yes, they are the same in this case, but they might have been different. Take care to get the code matched properly to the produccctions.
static void Declarations () { // Declarations = { DecList } EOF . while (FirstDeclList.contains(sym.kind)) DecList(); accept(EOFSym, "EOF expected"); } static void DecList () { // DecList = Type OneDecl { "," OneDecl } ";" . Type(); OneDecl(); while (sym.kind == commaSym) { getSym(); OneDecl(); } accept(semicolonSym, "; expected"); } static void Type () { // Type = "int" | "void" | "bool" | "char" . if Firsttype.contains(sym.kind) getSym(); else abort("invalid type"); }
Here's an interesting variation. Suppose we were to overload the accept method as sollows
static void accept(int wantedSym, String errorMessage) { // Checks that lookahead token is wantedSym if (sym.kind == wantedSym) getSym(); else abort(errorMessage); } // accept (simple) static void accept(IntSet allowable, String errorMessage) { // Checks that lookahead token is a member of a set of allowed tokens if (allowable.contains(sym.kind)) getSym(); else abort(errorMessage); } // accept (set)
then we would be able to write the Type parser as:
static void Type () { // Type = "int" | "void" | "bool" | "char" . accept(FirstType, "invalid type"); }
A variation which might be useful in other applications as well. Moving right along -
static void OneDecl () { // OneDecl = "*" OneDecl | Direct . if (sym.kind == starSym) { getSym(); oneDecl(); } else if (FirstDirect.contains(sym.kind)) Direct(); else abort("invalid start to OneDecl"); }
Several submissions simplified this to the following, leaving out the "abort" call:
static void OneDecl () { // OneDecl = "*" OneDecl | Direct . if (sym.kind == starSym) { getSym(); oneDecl(); } else Direct(); }
This is living dangerously! If a production (or part of a production) incorporates a list of alternatives, you should always cater for the possibility that the text to be parsed matches none of the possibilities, and be prepared to flag an error in this case.
Incidentally, a remarkable number of submissions misinterpreted the production
OneDecl = "*" OneDecl | Direct
treating it as
OneDecl = "*" ( OneDecl | Direct )
A "safe" way to code Direct would be
static void Direct () { // Direct = ( ident | "(" OneDecl ")" ) [ Params ] . if (sym.kind == identSym()) getSym; else if (sym.kind == lparenSym) { getSym; OneDecl(); accept(rparenSym, ") expected"); } else abort("invalid start to Direct"); if (sym.kind == lparenSym) Params(); }
although the following is also "safe" (why - can you figure it out), and is a bit shorter. The point is that one should not assume that when Direct is called there will be an identSym or lparenSym as the current token.
static void Direct () { // Direct = ( ident | "(" OneDecl ")" ) [ Params ] . if (sym.kind == identSym()) getSym; else { accept(lparenSym, "( expected"); OneDecl(); accept(rparenSym, ") expected"); } if (sym.kind == lparenSym) Params(); } static void Params () { // Params = "(" [ OneParam { "," OneParam } ] ")" . accept(lparenSym, "( expected"); // not a good idea to simplify to getSym() - why? if (FirstOneParam.contains(sym.kind)) { OneParam(); while (sym.kind == commaSym) { getSym(); OneParam(); } } accept(rparenSym, ") expected"); } static void OneParam () { // OneParam = Type [ OneDecl ] . Type(); if (FirstOneDecl.contains(sym.kind)) OneDecl(); }
3. A student has been trying to develop a scanner for the system. The part of it that deals with recognizing identifiers and ignoring comments has led her to develop the following code. Sadly, it is not quite correct. Suggest the changes needed to correct it.
This was very badly done! Most people seemed not to have noticed that the scanner was required to discard (possibly) nested comments, and that an underscore in an identifier had to be followed by a letter or digit. And don't let the scanner call abort. The scanner should simply return noSym as the value of sym.kind and let the parser react accordingly.
The skeleton code was as follows.
static void getSym() { // Scans for next sym StringBuilder symLex = new StringBuilder(); int symKind = noSym; // assume the worst! while (ch > EOF && ch <= ' ') getChar(); // whitespace if (ch == '{' ) { // deal with comments do { getChar(); } while (ch != '}'); } if (Character.isLetter(ch)) { // recognize identifiers do { symLex.append(ch); getChar(); } while (Character.isLetterOrDigit(ch)); symKind = literalKind(symLex); }
This made no attempt to handle nested comments, to continue to look for a token after dealing with a comment, to guard against unclosed comments, and so on. It also made no attempt to handle identifiers with underscores that have to be followed by a letter or digit. A Java solution follows (C# code is effectively identical, of course):
static int literalKind(StringBuilder lex) { // handle keywords String s = lex.toString(); if (s.equals("bool")) return boolSym; if (s.equals("char")) return charSym; if (s.equals("int")) return intSym; if (s.equals("void")) return voidSym; return identSym; } static void getSym() { // Scans for next sym StringBuilder symLex = new StringBuilder(); int symKind = noSym; // assume the worst! while (ch > EOF && ch <= ' ') getChar(); // whitespace if (ch == '{' ) { // deal with comments int nesting = 1; // level of nesting do { getChar(); if (ch == EOF) { // unclosed comment sym = new Token(EOFSym, "EOF"); // give up in disgust return; } if (ch == '{') nesting++; // increase nesting if (ch == '}') nesting--; // decrease nesting } while (nesting != 0); // when the loop terminates our // job is not yet done! getChar(); getSym(); // recursive call - still need to find a token return; } if (Character.isLetter(ch)) { // recognize identifiers do { symLex.append(ch); getChar(); if (ch = '_') { // must be followed letter | digit while (ch == '_') { // work through the underscores symLex.append(ch); getCh(); } // and now check the following character if (!Character.isLetterOrDigit(ch)) { // bad identifier - return noSym sym = new Token(noSym, symLex.toString()); return; } } } while (Character.isLetterOrDigit(ch)); symKind = literalKind(symLex); } else { // Please don't concern yourself with the rest of this method } sym = new Token(symKind, symLex.toString()); }