(a) Develop a regular expression and/or a Cocol token definition that describes identifiers of the sort used in many programming languages, subject to the restriction that underscores may not appear at the start or end of an identifier. but may appear in the interior of an identifier.
In common RE notation we might write
RE1 = letter = [A-Za-z] RE2 = digit = [0-9] RE3 = identifier = letter ( letter | digit | underscore+ ( letter | digit ) )*
In Cocol notation we would write something that looks much the same but note that the CHARACTERS section defines sets of characters, not strings!
CHARACTERS letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . digit = "0123456789" . TOKENS identifier = letter { letter | digit | "_" { "_" } ( letter | digit ) } .
It might be thought that we could write the Cocol definition more simply, as
TOKENS identifier = letter { letter | digit | "_" ( letter | digit ) } .
but this is not quite correct. Can you see why not? What would the equivalent be in common RE notation?
(b) Develop a regular expression and/or a Cocol description that describes comments of the form allowed in the C family of languages.
CHARACTERS inComment = ANY - "*" - "/" . TOKENS Comment = "/*" { inComment } "*/" .
However, the regular expression forming part of this directive has some fairly obvious shortcomings. The trick is that * and / characters can occur "internally", as can /*, but of course the combination */ cannot occur internally.
The problem as posed hinted that the following were some examples of strings that might be valid comments, and invited the reader to spot the odd one out.
(1) /* comment */ (2) /* multiply a * b */ (3) /* divide a / b */ (4) /* ***** */ (5) /*********/ (6) /* ** /* // */ (7) /* // */ ** */ (8) /*a*b*/ (9) /* **a/b* ** */
It is, of course, (7) which is better described as a comment /* // */ followed by a bit of garbage ** */.
Finding a correct solution is quite tricky, though like so many classic exercises, once you see the solution it become "obvious". Here is a deceptively appealing attempt:
CHARACTERS inComment = ANY - "*" - "/" . TOKENS Comment = "/*" { inComment | "/" | "*" inComment } { "*" } "*/" .
Casual testing will suggest that the problem is now solved. However, this more complicated regular expression will still not handle some of the valid comments given above. In particular it will still not recognize 4, 6 , 7, 9. A better definition might take some time to think through and verify:
CHARACTERS inComment = ANY - "*" - "/" . TOKENS Comment = "/*" { inComment | "/" | "*" { "*" } inComment } { "*" } "*/" .
It is left as an exercise to write this in common RE form - try it, and it may convince you how much clearer the Cocol notation is!
(c) Develop a Cocol grammar to describe a list of names of some well loved (well, well respected) folk who have titles, first names and/or initials, surnames, and (usually) qualifications - for example
Professor P. D. Terry, MSc, PhD . George Clifford Wells, BSc(Hons), MSc, PhD . C. Hubert H. Parry, BMus. Greg G. Foster, PhD . Dave A. Sewry, PhD . Dr Karen Lee Bradshaw, MSc, PhD . Caro Watkins, BSc . Professor Richard J. Foss, PhD. Your Name One Day, BSc(InfSys), PhD . Mx Mic Halse . Madonna . Dr Mabizela.
This can be attempted in several ways. As always, it is useful to try to introduce non-terminals for the items of semantic interest. Here is one attempt at a solution:
COMPILER Staff1 $CN /* Describe a list of academic staff in a department Non-LL(1), and will not work properly P.D. Terry, Rhodes University, 2015 */ CHARACTERS uLetter = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" . lLetter = "abcdefghijklmnopqrstuvwxyz" . letter = uLetter + lLetter . TOKENS name = uLetter { letter | "'" uLetter | "-" uLetter } . initial = uLetter "." . IGNORE CHR(0) .. CHR(31) PRODUCTIONS Staff1 = { Person } EOF . Person = [ Title ] { name | initial } surname { "," Degree } SYNC "." . Title = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" . surname = name . Degree = "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci" | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" . END Staff1.
Note that this allows for names like Smith and Malema and also for names like MacKay, O'Neill and Lee-Ann.
Although this correctly describes a staff list, it is useless for a simple parser, as surnames and names are lexically indistinguishable. This is an example of an LL(1) violation (about which we shall learn more shortly).
Attempting to define a better grammar affords interesting insights. It is not difficult to come up with productions that permit full names with leading initials (only) or to contain complete names only:
PRODUCTIONS Staff2 = { Person } EOF . Person = [ Title ] FullName { "," Degree } SYNC "." . FullName = InitialsName | name { name } . InitialsName = initial { initial } name . Title = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" . Degree = "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci" | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" . END Staff2.
but this is too restrictive. Another approach is to relax the distinction between initials and complete names:
PRODUCTIONS Staff3 = { Person } EOF . Person = [ Title ] FullName { "," Degree } SYNC "." . FullName = ( initial | name ) { initial | name } . Title = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" . Degree = "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci" END Staff3.
but this has the unfortunate effect that it allows a name made of initials only, or a name to have an initial as its last component, as exemplified by
P. Terry D.
One might be tempted to be very rigid about punctuation, and insist that a surname incorporate a final periond (if the person has no qualifictions) or a final comma (for persons that have qualifications. This is too restrictive, but it is possible to find a much better solution. In effect, this picks up from the partial solution given earlier for the "initials only" and the "names only" restrictions. We have to ensure that the productions ensure that a full name is a sequence of initials and names terminated by a name:
PRODUCTIONS Staff9 = { Person } EOF . Person = [ Title ] FullName { "," Degree } SYNC "." . FullName = NameLast { NameLast } . NameLast = InitialsName | name . InitialsName = initial { initial } name . Title = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" . Degree = "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci" | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" . END Staff9.
and it can be refactorised in various ways, for example (various other solutions are given in the source kit):
PRODUCTIONS Staff11 = { Person } EOF . Person = [ Title ] FullName { "," Degree } SYNC "." . FullName = NameLast { NameLast } . NameLast = [ initial { initial } ] name . Title = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" . Degree = "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci" | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" . END Staff11.