Tutorial 4 - Solutions

Computer Science 301 - Translators

Tutorial 4 - Regular Expressions and Cocol Grammars - Solutions

(a) Develop a regular expression and/or a Cocol token definition that describes identifiers of the sort used in many programming languages, subject to the restriction that underscores may not appear at the start or end of an identifier. but may appear in the interior of an identifier.

In common RE notation we might write

      RE1  =  letter     = [A-Za-z]
      RE2  =  digit      = [0-9]
      RE3  =  identifier = letter ( letter | digit | underscore+ ( letter | digit ) )*

In Cocol notation we would write something that looks much the same but note that the CHARACTERS section defines sets of characters, not strings!

    CHARACTERS
      letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit  = "0123456789" .

    TOKENS
      identifier = letter { letter | digit | "_" { "_" } ( letter | digit ) } .

It might be thought that we could write the Cocol definition more simply, as

    TOKENS
      identifier = letter { letter | digit | "_" ( letter | digit ) } .

but this is not quite correct. Can you see why not? What would the equivalent be in common RE notation?

(b) Develop a regular expression and/or a Cocol description that describes comments of the form allowed in the C family of languages.

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment } "*/" .

However, the regular expression forming part of this directive has some fairly obvious shortcomings. The trick is that * and / characters can occur "internally", as can /*, but of course the combination */ cannot occur internally.

The problem as posed hinted that the following were some examples of strings that might be valid comments, and invited the reader to spot the odd one out.

     (1)   /* comment */
     (2)   /* multiply a * b */
     (3)   /* divide a / b */
     (4)   /* ***** */
     (5)   /*********/
     (6)   /* ** /* // */
     (7)   /* // */ ** */
     (8)   /*a*b*/
     (9)   /* **a/b* ** */

It is, of course, (7) which is better described as a comment /* // */ followed by a bit of garbage ** */.

Finding a correct solution is quite tricky, though like so many classic exercises, once you see the solution it become "obvious". Here is a deceptively appealing attempt:

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment | "/" | "*" inComment } { "*" } "*/" .

Casual testing will suggest that the problem is now solved. However, this more complicated regular expression will still not handle some of the valid comments given above. In particular it will still not recognize 4, 6 , 7, 9. A better definition might take some time to think through and verify:

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment | "/" | "*" { "*" } inComment } { "*" } "*/" .

It is left as an exercise to write this in common RE form - try it, and it may convince you how much clearer the Cocol notation is!

(c) Develop a Cocol grammar to describe a list of names of some well loved (well, well respected) folk who have titles, first names and/or initials, surnames, and (usually) qualifications - for example

          Professor P. D. Terry, MSc, PhD .
          George Clifford Wells, BSc(Hons), MSc, PhD .
          C. Hubert H. Parry, BMus.
          Greg G. Foster, PhD .
          Dave A. Sewry, PhD .
          Dr Karen Lee Bradshaw, MSc, PhD .
          Caro Watkins, BSc .
          Professor Richard J. Foss, PhD.
          Your Name One Day, BSc(InfSys), PhD .
          Mx Mic Halse .
          Madonna .
          Dr Mabizela.

This can be attempted in several ways. As always, it is useful to try to introduce non-terminals for the items of semantic interest. Here is one attempt at a solution:

    COMPILER Staff1 $CN
    /* Describe a list of academic staff in a department
       Non-LL(1), and will not work properly
       P.D. Terry, Rhodes University, 2015 */

    CHARACTERS
      uLetter  = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
      lLetter  = "abcdefghijklmnopqrstuvwxyz" .
      letter   = uLetter + lLetter .

    TOKENS
      name     = uLetter { letter | "'" uLetter | "-" uLetter } .
      initial  = uLetter "." .

    IGNORE CHR(0) .. CHR(31)

    PRODUCTIONS
      Staff1       = { Person } EOF .
      Person       = [ Title ] { name | initial } surname { "," Degree } SYNC "." .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      surname      = name .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff1.

Note that this allows for names like Smith and Malema and also for names like MacKay, O'Neill and Lee-Ann.

Although this correctly describes a staff list, it is useless for a simple parser, as surnames and names are lexically indistinguishable. This is an example of an LL(1) violation (about which we shall learn more shortly).

Attempting to define a better grammar affords interesting insights. It is not difficult to come up with productions that permit full names with leading initials (only) or to contain complete names only:

    PRODUCTIONS
      Staff2       = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = InitialsName | name { name } .
      InitialsName = initial { initial } name .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff2.

but this is too restrictive. Another approach is to relax the distinction between initials and complete names:

    PRODUCTIONS
      Staff3       = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = ( initial | name ) { initial | name } .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
    END Staff3.

but this has the unfortunate effect that it allows a name made of initials only, or a name to have an initial as its last component, as exemplified by

P. Terry D.

One might be tempted to be very rigid about punctuation, and insist that a surname incorporate a final periond (if the person has no qualifictions) or a final comma (for persons that have qualifications. This is too restrictive, but it is possible to find a much better solution. In effect, this picks up from the partial solution given earlier for the "initials only" and the "names only" restrictions. We have to ensure that the productions ensure that a full name is a sequence of initials and names terminated by a name:

    PRODUCTIONS
      Staff9       = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = NameLast { NameLast } .
      NameLast     = InitialsName | name .
      InitialsName = initial { initial } name .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff9.

and it can be refactorised in various ways, for example (various other solutions are given in the source kit):

    PRODUCTIONS
      Staff11      = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = NameLast { NameLast } .
      NameLast     = [ initial { initial } ] name .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff11.