Practical 10

Computer Science 3 - 2002

Programming Language Translation

Practical for Week 10, beginning 22 April 2002

This prac is due for submission by lunch time on your next practical day, correctly packaged in a transparent folder as usual. Pracs should please be deposited in the hand-in box outside the lab. Only one set of listings is needed for each group, but please enclose as many copies of the cover sheet as are needed, one for each member of the group. These will be returned to you in due course.

Objectives

The objective of this practical is to allow you to develop a recursive descent parser "from scratch" for a simple application.

To hand in:

This week you are required to hand in, besides the cover sheet, a listing of the final version of your source program and some listings of input and output files, and a diskette with the source file, the executable, and files of test data.

Task 1 - a trivial task

Unpack the prac kit PRAC10.ZIP. In it you will find the files misc.h and set.h, the skeleton of a system adapted for intermediate testing of a scanner, and two simple data files - but you really need to learn to develop your own test data.

Task 2 - a chance to do some real development!

In the text, on pages 50-52, you will find a description of regular expressions. If we incorporate various of the extensions mentioned on page 52 we get to the point where we might allow regular expressions like

a* s d | g | ( i o o ) ; \ a very simple one \
a.s.d? \ d is optional \ | 2 | [ a-s BC A-S 0-9]* (e | f+) ;

where ; is being used as a meta character to separate regular expressions in these examples - it is not normally special in regular expression theory. We can describe these interesting animals using Cocol something as follows:

      COMPILER RE
      /* Regular expression grammar
         P.D. Terry, Rhodes University, 2002 */

      CHARACTERS
        control  = CHR(1) .. CHR(31) .
        noquote1 = ANY - control - "'" - CHR(0) .
        noquote2 = ANY - control - '"' - CHR(0) .
        meta     = "()*|.;[]-?+" .
        simple   = ANY - control - "'" - '"' - meta - CHR(0) .

      IGNORE  control
      COMMENTS FROM "\" TO "\"

      TOKENS
        atomic  = simple .
        escaped = "'" noquote1 "'" | '"' noquote2 '"' .

      PRODUCTIONS
        RE         = { Expression ";" } EOF .
        Expression = Term { "|" Term } .
        Term       = Factor { [ "." ] Factor } .
        Factor     = Element [ "*" | "?" | "+" ] .
        Element    = Atom | Range | "(" Expression ")" .
        Range      = "[" OneRange { OneRange } "]" .
        OneRange   = Atom [ "-" Atom ] .
        Atom       = atomic | escaped .
      END RE.

Tempting as it might be simply to use Coco/R to produce a program that will recognize regular expressions, this week we should like you to produce such a recognizer more directly, by developing a program in the spirit of those you will find in the textbook on pages 131 and 133.

WARNING. This exercise really requires you to do some real thinking and planning. Please do not just sit at a computer and hack away as most of you are wont to do. Sit in your groups and discuss your ideas with one another and with the demonstrators. If you don't do this you will probably find that the whole exercise turns into a nightmare, and I don't want that to happen.

To do this proceed as follows:

Develop a main function that will

check that the program has been given a parameter (the file name of a data file)
from this file name derive an output file name with a different extension
open these two files, checking that no mishaps occur (this has all been done for you)
initialise the "character handler"
initialise the "scanner"
call the "parser" by calling the routine that is to parse the goal symbol
close the output file and report that the system parsed correctly.

But you might try this in easier stages! So for Task 2, study the skeleton program from the kit as shown below, Firstly, develop the character handler as a very simple routine, say GetChar, that will simply read the next character ch from the input file each time it is called - but be careful to have it behave sensibly if there is no "next" character (ie if you have already reached the end of file). I suggest that you write the routine to return a known (NUL) character if it tries to read past the end of file.

Task 3

Next, develop the scanner as a routine GetSym, whose goal in life is to recognize tokens. Tokens for this application could be defined by an enumeration

  enum SYMBOLS { EOFSym, SemicolonSym, BarSym, LParenSym, LBrackSym, RParenSym,
                 RBrackSym, ConcatSym, OptionSym, RepeatSym, PlusSym, HyphenSym,
                 AtomSym, EscapedSym};

The scanner can (indeed, must) be developed on the pretext that an initial character CH has been read. When called, it must (if necessary) read past any comments and/or spaces in the input file until it comes to a character that can form part (or all) of a token. It must then read as many characters as are required to identify a token, and assign the corresponding value from the enumeration to a variable called, say, Sym - and then read the next character CH (remember that the parsers we are discussing always look one position ahead in the source).

Test the scanner with the skeleton program which should be able to scan the data file and simply tell you what tokens it can find, using the simple loop in the main function as supplied.

  // This will later be a Regular Expression Parser
  // For the moment it is a scanner testbed program

  // Usage: RegExp DataFile [ ListFile ]
  // Parse sequence of regular expressions in DataFile

  // PUT YOUR NAMES (SURNAMES!) HERE

  #include "misc.h"
  #include "set.h"

  // You may need other global variables, but try to keep these to a minimum

  // +++++++++++++++++++++++++ Character handler +++++++++++++++

  char outName[256];              // String for results file name
  FILE *inFile, *outFile;         // Data and results

  // Declaring ch as a global variable is done for expediency - global variables
  // are not always a good thing

  char ch;                        // Lookahead character for scanner

  void GetChar (void) {
  // Obtains next character ch from inFile, or CHR(0) if EOF reached

  // Fill in this simple code for yourselves

  }

  // +++++++++++++++++++++++++ Scanner +++++++++++++++++++++++++

  // Define token enumeration

  enum SYMBOLS { EOFSym, SemicolonSym, BarSym, LParenSym, LBrackSym, RParenSym,
                 RBrackSym, ConcatSym, OptionSym, RepeatSym, PlusSym, HyphenSym,
                 AtomSym, EscapedSym};

  // Declaring sym as a global variable is done for expediency - global variables
  // are not always a good thing

  SYMBOLS sym;                    // Lookahead token for parser

  // The <EscapedSym> in the next declaration ensures that the sets will all be
  // able to contain elements in the range EOFSym .. EscapedSym

  typedef Set<EscapedSym> SymSet; // Set type for FIRST sets

  // Here are some examples of constructed sets - you will need others

  SymSet Parentheses(LParenSym, RParenSym);
  SymSet Brackets(LBrackSym, RBrackSym);

  void GetSym (void) {
  // Scans for next SYMBOL value from inFile, store it in global Sym

  // Fill in this code!

  }

  // +++++++++++++++++++++++++ Parser ++++++++++++++++++++++++++

  // To be supplied later

  // +++++++++++++++++++++++++ Driver ++++++++++++++++++++++++++

  void main (int argc, char *argv[])  {

    // Check command line parameters, open files

    if (argc < 2) {
      fprintf(stderr, "Usage: RegExp DataFile [Results]"); exit(1);
    }
    if ((inFile = fopen(argv[1], "r")) == NULL) {
      fprintf(stderr, "Could not open %s", argv[1]); exit(1);
    }
    if (argc == 3) strcpy(outName, argv[2]);
    else appendextension(argv[1], ".lst", outName);
    if ((outFile = fopen(outName, "w")) == NULL) {
      fprintf(stderr, "Could not open %s", outName); exit(1);
    }

    // Initialise character handler
    // you may have to add a bit to this ....

     GetChar();

     // Test scanner in simple loop

     do {
       GetSym();

       // Write sym in some convenient form for testing purposes

     } while (sym != EOFSym);

    // If we get here everything must have been okay
    fprintf(stderr, "Parsed correctly");

    // Close output file correctly
    fclose(outFile);
  }

Task 4

Now develop the parser as a set of routines, one for each of the non-terminals suggested in the grammar above. These functions should, where necessary, simply call on the GetSym scanner routine to deliver the next token from the input. As discussed in chapter 10, the system hinges on the premise that each time a parsing routine is called (including the initial call to the goal routine) there will already be a token waiting in the variable sym, and whenever a parsing routine returns, it will have obtained the follower token in readiness for the caller to continue parsing (see discussion on page 130). It is to make communication between all these routines easy that we declare the lookahead character ch and the lookahead token sym to be global variables.

Of course, anyone can write a recognizer for input that is correct. The clever bit is to be able to spot incorrect input, and to react by reporting an appropriate error message. For the purposes of this exercise it will be sufficient first to develop a simple routine on the lines of the accept routine that you will see on page 133, that simply issues a stern error message, closes the output file, and then abandons parsing altogether. In task 5 we shall attempt to be kinder to our users!

Something to think about: If you have been following the lectures, you will know that associated with each nonterminal A is a set FIRST(A) of the terminals that can appear first in any string derived from A. Alarums and excursions (as they say in the classics). So that's why we have been learning about the SET template class - we can profitably use it here to help write nicely structured parsers that can check whether the lookahead token sym is a member of an appropriate set.

(Aside: In Pascal and Modula-2, as you may remember from an earlier practical, one does not need a special template class, sets are part of the language. There are probably many reasons why, in his genius, Wirth invented Pascal and Modula-2 to have native set types. Their usefulness for writing parsers and compilers is, I suspect, one of them.)

Task 5

Extend your parsing routines to incorporate the error recovery strategy discussed in section 10.3.