Computer Science 3 - 2000

Programming Language Translation

Practical for Week 23, beginning 2 October 2000

Hand in this prac sheet on your next practical day, correctly packaged in a transparent folder and with your solutions. This prac sheet forms the "cover sheet". Since the practical will have been done on a group basis, please hand in one copy of the prac sheet for each member of the group. These will be returned to you in due course, signed by the marker, and you will be asked to sign to acknowledge that you have received your own copy.

Your surname: (PRINT CLEARLY)          Your prac day:

Your student number: (PRINT CLEARLY)

Your signature:                        Your group partners:


Other people with whom you collaborated (you need not give tutors' names):



Mark awarded:                          Tutor's signature

Objectives

The objective of this practical is to allow you to develop a recursive descent parser "from scratch" for a simple application.

To hand in

Please hand in a listing of the final version of your source program (in either Pascal or C++) and some listings of input and output files, and a diskette with the source file, the executable, and files of test data.

Task 1 - a trivial task

Unpack the prac kit PRAC23.ZIP.

You won't find much this time. C++ users will find the files misc.h and set.h, and Pascal users will find a Pascal translation of the appendextension routine useful for deriving one file name from another. There are also two simple data files, but you really need to learn to develop your own test data.

Task 2 - a chance to do some real development!

WARNING. This exercise really requires you to do some real thinking and planning. Please do not just sit at a computer and hack away as most of you are wont to do. Sit in your groups and discuss your ideas with one another and with the demonstrators. If you don't do this you will probably find that the whole exercise turns into a nightmare, and I don't want that to happen.

In the text, on pages 50-52, you will find a description of regular expressions. If we incorporate various of the extensions mentioned on page 52 we get to the point where we might allow regular expressions like

a* s d | g | ( i o o ) ; \ a very simple one \
a.s.d? \ d is optional \ | 2 | [ a-s BC A-S 0-9]* (e | f+) ;

where ; is used as a meta character to separate regular expressions in these examples - it is not normally special in regular expression theory. We can describe these interesting animals using Cocol something as follows:

      COMPILER RE
      /* Regular expression grammar
         P.D. Terry, Rhodes University, 2000 */

      CHARACTERS
        control  = CHR(1) .. CHR(31) .
        noquote1 = ANY - control - "'" - CHR(0) .
        noquote2 = ANY - control - '"' - CHR(0) .
        meta     = "()*|.;[]-?+" .
        simple   = ANY - control - "'" - '"' - meta - CHR(0) .

      IGNORE  control

      COMMENTS FROM "\" TO "\"

      TOKENS
        atomic  = simple .
        escaped = "'" noquote1 "'" | '"' noquote2 '"' .

      PRODUCTIONS
        RE         = { Expression ";" } EOF .
        Expression = Term { "|" Term } .
        Term       = Factor { [ "." ] Factor } .
        Factor     = Element [ "*" | "?" | "+" ] .
        Element    = Atom | Range | "(" Expression ")" .
        Range      = "[" OneRange { OneRange } "]" .
        OneRange   = Atom [ "-" Atom ] .
        Atom       = atomic | escaped .
      END RE.

Tempting as it might be simply to use Coco/R to produce a program that will recognize regular expressions, this week we should like you to produce such a recognizer more directly, by developing a program in the spirit of those you will find in the textbook on pages 131 and 133.

To do this proceed as follows:

Develop a main function that will

- check that the program has been given a parameter (the file name of a data file)
- from this file name derive an output file name with a different extension
- open these two files, checking that no mishaps occur
(all of that is straightforward, or should be if you did Prac 20 properly!)

- initialise the "character handler"
- initialise the "scanner"
- call the "parser" by calling the routine that is to parse the goal symbol
- close the output file and report that the system parsed correctly.

Develop the character handler as a very simple routine, say GetChar, that will simply read the next character CH from the input file each time it is called - but be careful to have it behave sensibly if there is no "next" character (ie if you have already reached the end of file). I suggest that you write the routine to return a known (NUL) character if it tries to read past the end of file.

Develop the scanner as a routine GetSym, whose goal in life is to recognize tokens. Tokens for this application could be defined by an enumeration

  enum SYMBOLS { EOFSym, SemicolonSym, BarSym, LParenSym, LBrackSym, RParenSym,
                 RBrackSym, ConcatSym, OptionSym, RepeatSym, PlusSym, HyphenSym,
                 AtomSym, EscapedSym};

or, in Pascal

  TYPE
    SYMBOLS = (EOFSym, SemicolonSym, BarSym, LParenSym, LBrackSym, RParenSym,
               RBrackSym, ConcatSym, OptionSym, RepeatSym, PlusSym, HyphenSym,
               AtomSym, EscapedSym);

The scanner can (indeed, must) be developed on the pretext that an initial character CH has been read. When called, it must (if necessary) read past any comments and/or spaces in the input file until it comes to a character that can form part (or all) of a token. It must then read as many characters as are required to identify a token, and assign the corresponding value from the enumeration to a variable called, say, Sym - and then read the next character CH (remember that the parsers we are discussing always look one position ahead in the source).

Develop the parser as a set of routines, one for each of the non-terminals suggested in the grammar above. These functions should, where necessary, simply call on the GetSym scanner routine to deliver the next token from the input. As before, the system hinges on the premise that each time a parsing routine is called (including the initial call to the goal routine) there will already be a token waiting in the variable Sym, and whenever a parsing routine returns, it will have obtained the follower token in readiness for the caller to continue parsing (see discussion on page 130). To make communication between the routines easy, declare the lookahead character CH and the lookahead token Sym to be global variables.

Of course, anyone can write a recognizer for input that is correct. The clever bit is to be able to spot incorrect input, and to react by reporting an appropriate error message. For the purposes of this exercise it will be sufficient first to develop a simple routine on the lines of the accept routine that you will see on page 133, that simply issues a stern error message, closes the output file, and then abandons parsing altogether. In task 3 we shall attempt to be kinder to our users!

Something else to think about. If you look at the Cocol specification above, and if you have been following the lectures, you should be able to see that associated with each nonterminal A is a set FIRST(A) of the terminals that can appear first in any string derived from A. Alarums and excursions (as they say in the classics). So that's why we have been learning about the SET template class - we can profitably use it here to help write nicely structured parsers that can check whether the lookahead token Sym is a member of an appropriate set.

(Aside: In Pascal one does not need a special template class, sets are part of the language. There are probably many reasons why, in his genius, Wirth invented Pascal and Modula-2 to have native set types. Their usefulness for writing parsers and compilers is, I suspect, one of them.)

Task 3

Extend your parsing routines to incorporate the error recovery strategy discussed in section 10.3