Practical 18

Computer Science 3 - 2001

Programming Language Translation

Practical for Week 18, beginning 20 August 2001

Hand in this prac sheet before lunch time on your next practical day, correctly packaged in a transparent folder with your solutions and the "cover sheet". Please do NOT come to a practical and spend the first hour printing or completing solutions from the previous week's exercises. Since the practical will have been done on a group basis, please hand in one copy of the cover sheet for each member of the group. These will be returned to you in due course, signed by the marker, and you will be asked to sign to acknowledge that you have received your own copy.

Objectives

The objective of this practical is to allow you to develop a recursive descent parser "from scratch" for a simple application.

To hand in

Please hand in a listing of the final version of your source program, and some listings of input and output files, and a diskette with the source file, the executable, and files of test data.

Task 1 - a trivial task

Unpack the prac kit PRAC18.ZIP. In it you will find the files misc.h and set.h, the skeleton of a system adapted for intermediate testing of a scanner, and two simple data files - but you really need to learn to develop your own test data.

Task 2 - a chance to do some real development!

WARNING. This exercise really requires you to do some real thinking and planning. Please do not just sit at a computer and hack away as most of you are wont to do. Sit in your groups and discuss your ideas with one another and with the demonstrators. If you don't do this you will probably find that the whole exercise turns into a nightmare, and I don't want that to happen.

It is generally acknowledged, even by experts, that the syntax of declarations in C and C++ can be quite difficult to understand. This is especially true for programmers who have learned Pascal or Modula-2 before turning to a study of C or C++. Simple declarations like

    int x, list[100];

present few difficulties (x is a scalar integer, list is an array of 100 integers). However, in developing more abstruse examples like

    char **a;      // a is a pointer to a pointer to a character
    int *b[10];    // b is an array of 10 pointers to single integers
    int (*c)[10];  // c is a pointer to an array of 10 integers
    bool *d();     // d is a function returning a pointer to a bool
    char (*e)();   // e is a pointer to a function returning a character

it is easy to confuse the placement of the various brackets, parentheses and asterisks, perhaps even writing syntactically correct declarations that do not mean what the author intended. By the time one is into writing (or reading) declarations like

    bool (*(*f())[])();
    int  (*(*g[50])())[15];

there may be little consolation to be gained from learning that C was designed so that the syntax of declarations (defining occurrences) should mirror the syntax for access to the corresponding quantities in expressions (applied occurrences).

Incidentally, an excellent discussion of how humans can unravel such declarations is to be found in the excellent book by my friend Kim King: "C Programming - a modern approach" (Norton, 1996, pages 408 - 415) a copy of which can be made available on request. .

A grammar that describes many of the forms that such declarations can take might be expressed in Cocol as follows:

   COMPILER Cdecls
   /* Describe a subset of the forms that C declarations can assume */

   CHARACTERS
     digit  = "0123456789" .
     letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz_" .

   IGNORE CHR(1) .. CHR(31)

   TOKENS
     number = digit { digit } .
     ident  = letter { letter | digit } .

   PRODUCTIONS
     Cdecls   = { DecList } EOF .
     DecList  = Type OneDecl { "," OneDecl } ";"  .
     Type     = "int" | "void" | "bool" | "char" .
     OneDecl  = "*" OneDecl | Direct .
     Direct   = ( ident | "(" OneDecl ")" ) [ Suffix ] .
     Suffix   = Array { Array } | Params .
     Params   = "(" [ OneParam { "," OneParam } ] ")" .
     OneParam = Type [ OneDecl ] .
     Array    = "[" [ number ] "]" .
   END Cdecls.

Tempting as it might be simply to use Coco/R to produce a program that will recognize declarations, this week we should like you to produce such a recognizer more directly, by developing a program in the spirit of those you will find in the textbook on pages 131 and 133.

To do this proceed as follows:

Develop a main function that will

check that the program has been given a parameter (the file name of a data file)
from this file name derive an output file name with a different extension
open these two files, checking that no mishaps occur (that is straightforward, or should be if you did Prac 15 properly!)
initialise the "character handler"
initialise the "scanner"
call the "parser" by calling the routine that is to parse the goal symbol
close the output file and report that the system parsed correctly.

But you might try this in easier stages! So for Task 2, study the skeleton program from the kit as shown below, Firstly add in the file handling stuff, and develop the character handler as a very simple routine, say GetChar, that will simply read the next character ch from the input file each time it is called - but be careful to have it behave sensibly if there is no "next" character (ie if you have already reached the end of file). I suggest that you write the routine to return a known (NUL) character if it tries to read past the end of file.

Task 3

Next, develop the scanner as a routine GetSym, whose goal in life is to recognize tokens. Tokens for this application could be defined by an enumeration

  enum SYMBOLS { EOFSym, unknownSym, intSym, charSym, boolSym, voidSym,
                 numSym, identSym, lparenSym, rparenSym, lbrackSym, rbrackSym,
                 pointerSym, commaSym, semicolonSym};

The scanner can (indeed, must) be developed on the pretext that an initial character ch has been read. When called, it must (if necessary) read past any "white space" in the input file until it comes to a character that can form part (or all) of a token. It must then read as many characters as are required to identify a token, and assign the corresponding value from the enumeration to a variable called, say, sym - and then read the next character ch (remember that the parsers we are discussing always look one position ahead in the source).

Test the scanner with the skeleton program which should be able to scan the data file and simply tell you what tokens it can find, using the simple loop in the main function as supplied.

    // Parse sequence of C declarations
    // Usage: CDecl DataFile [ ListFile ]
    // YOUR NAME HERE

    #include "misc.h"
    #include "set.h"

    // You may need other global variables, but try to keep these to a minimum

    char outName[256];              // string for results file name
    FILE *inFile, *outFile;         // Data and results

    // Declaring ch as a global variable is done for expediency - global variables
    // are not always a good thing

    char ch;                        // Lookahead character for scanner

    void GetChar (void) {
    // Obtains next character ch from inFile, or CHR(0) if EOF reached

    }

    // +++++++++++++++++++++++++ Scanner +++++++++++++++++++++++++

    // Define token enumeration

    enum SYMBOLS { EOFSym, unknownSym, intSym, charSym, boolSym, voidSym,
                   numSym, identSym, lparenSym, rparenSym, lbrackSym, rbrackSym,
                   pointerSym, commaSym, semicolonSym};

    // Declaring sym as a global variable is done for expediency - global variables
    // are not always a good thing

    SYMBOLS sym;                    // Lookahead token for parser

    // The <semicolonSym> in the next declaration ensures that the sets will all be
    // able to contain elements in the range EOFSym .. semicolonSym

    typedef Set<semicolonSym> SymSet; // Set type for FIRST sets

    void GetSym (void) {
    // Scans for next SYMBOL value from inFile, store it in global sym

    }

    // +++++++++++++++++++++++++ Parser ++++++++++++++++++++++++++

    // To be supplied later

    // +++++++++++++++++++++++++ Driver ++++++++++++++++++++++++++


    void main (int argc, char *argv[])  {

      // Check command line parameters, open files

      // Initialise character handler
      // you may have to add a bit to this ....

       GetChar();

       // Test scanner in simple loop

       do {
         GetSym();
         // Write sym in some convenient form for testing purposes
       } while (sym != EOFSym);

       // Close files
    }

Task 4

Now develop the parser as a set of routines, one for each of the non-terminals suggested in the grammar above. These functions should, where necessary, simply call on the GetSym scanner routine to deliver the next token from the input. As discussed in chapter 10, the system hinges on the premise that each time a parsing routine is called (including the initial call to the goal routine) there will already be a token waiting in the variable sym, and whenever a parsing routine returns, it will have obtained the follower token in readiness for the caller to continue parsing (see discussion on page 130). It is to make communication between all these routines easy that we declare the lookahead character ch and the lookahead token sym to be global variables.

Of course, anyone can write a recognizer for input that is correct. The clever bit is to be able to spot incorrect input, and to react by reporting an appropriate error message. For the purposes of this exercise it will be sufficient first to develop a simple routine on the lines of the accept routine that you will see on page 133, that simply issues a stern error message, closes the output file, and then abandons parsing altogether. In task 5 we shall attempt to be kinder to our users!

Something to think about: If you have been following the lectures, you will know that associated with each nonterminal A is a set FIRST(A) of the terminals that can appear first in any string derived from A. Alarums and excursions (as they say in the classics). So that's why we have been learning about the SET template class - we can profitably use it here to help write nicely structured parsers that can check whether the lookahead token sym is a member of an appropriate set.

(Aside: In Pascal and Modula-2, as you may remember from an earlier practical, one does not need a special template class, sets are part of the language. There are probably many reasons why, in his genius, Wirth invented Pascal and Modula-2 to have native set types. Their usefulness for writing parsers and compilers is, I suspect, one of them.)

A note on testing

To test your parser you might like to make use of the data files supplied. One of these (CORRECT.CPP) has a number of correct declarations. The other (WRONG.CPP) has a number of incorrect declarations. Your parser should, of course, be able to parse CORRECT.CPP easily, and you should be able to "compile" this same file by issuing a command like BCC -c CORRECT.CPP just to verify this. Using BCC to "compile" WRONG.CPP should be fun, but parsing WRONG.CPP with your system will be a little more frustrating until you complete Task 5, as the parser will simply stop as soon as it finds the first error. You might like to create a number of "one- liner" data files to make this testing stage easier. Feel free to experiment!

Task 5

Extend your parsing routines to incorporate the error recovery strategy discussed in section 10.3.

Task 7 - for interest only

C declarations in general require a rather more complex grammar than the one given above. For example, the grammar will not allow declarations like

    int x = 10;                   // initializer
    int x[4] = {1 , 2 , 3 , 4};   // initializer
    static int x[10];             // modifier
    void func (int **);           // no parameter name
    struct { int a; char b} x, y; // structures

and while it will correctly allow declarations like

    void main (int argc, char *argv[]);

it will incorrectly allow declarations like

    int x[];                      // no size specified

You might like to ponder whether the grammar and parser can easily be changed to handle some or more of these situations.