Tutorial for week 4

Computer Science 301 - Translators

Tutorial 4 - Regular Expressions and Cocol Grammars - Solutions

(a) Develop a regular expression and/or a Cocol token definition that describes identifiers of the sort used in many programming languages, subject to the restriction that underscores may not appear at the start or end of an identifier. but may appear in the interior of an identifier.

In common RE notation we might write

      RE1  =  letter     = [A-Za-z]
      RE2  =  digit      = [0-9]
      RE3  =  identifier = letter ( letter | digit | underscore+ ( letter | digit ) )*

In Cocol notation we would write something that looks much the same but note that the CHARACTERS section defines sets of characters, not strings!

    CHARACTERS
      letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit  = "0123456789" .

    TOKENS
      identifier = letter { letter | digit | "_" { "_" } ( letter | digit ) } .

It might be thought that we could write the Cocol definition more simply, as

    TOKENS
      identifier = letter { letter | digit | "_" ( letter | digit ) } .

but this is not quite correct. Can you see why not? What would the equivalent be in common RE notation?

(b) Develop a regular expression and/or a Cocol description that describes comments of the form allowed in the C family of languages.

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment } "*/" .

However, the regular expression forming part of this directive has some fairly obvious shortcomings. The trick is that * and / characters can occur "internally", as can /*, but of course the combination */ cannot occur internally.

The problem as posed hinted that the following were some examples of strings that might be valid comments, and invited the reader to spot the odd one out.

     (1)   /* comment */
     (2)   /* multiply a * b */
     (3)   /* divide a / b */
     (4)   /* ***** */
     (5)   /*********/
     (6)   /* ** /* // */
     (7)   /* // */ ** */
     (8)   /*a*b*/
     (9)   /* **a/b* ** */

It is, of course, (7) which is better described as a comment /* // */ followed by a bit of garbage ** */.

Finding a correct solution is quite tricky, though like so many classic exercises, once you see the solution it become "obvious". Here is a deceptively appealing attempt:

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment | "/" | "*" inComment } { "*" } "*/" .

Casual testing will suggest that the problem is now solved. However, this more complicated regular expression will still not handle some of the valid comments given above. In particular it will still not recognize 4, 6 , 7, 9. A better definition might take some time to think through and verify:

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment | "/" | "*" { "*" } inComment } { "*" } "*/" .

It is left as an exercise to write this in common RE form - try it, and it may convince you how much clearer the Cocol notation is!

(c) Develop a Cocol grammar to describe a list of names of some well loved (well, well respected) folk who have titles, first names and/or initials, surnames, and (usually) qualifications - for example

          Professor P. D. Terry, MSc, PhD .
          George Clifford Wells, BSc(Hons), MSc, PhD .
          C. Hubert H. Parry, BMus.
          Greg G. Foster, PhD .
          Dave A. Sewry, PhD .
          Dr Karen Lee Bradshaw, MSc, PhD .
          Caro Watkins, BSc .
          Professor Richard J. Foss, PhD.
          Your Name One Day, BSc(InfSys), PhD .
          Mx Mic Halse .
          Madonna .
          Dr Mabizela.

This can be attempted in several ways. As always, it is useful to try to introduce non-terminals for the items of semantic interest. Here is one attempt at a solution:

    COMPILER Staff1 $CN
    /* Describe a list of academic staff in a department
       Non-LL(1), and will not work properly
       P.D. Terry, Rhodes University, 2015 */

    CHARACTERS
      uLetter  = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
      lLetter  = "abcdefghijklmnopqrstuvwxyz" .
      letter   = uLetter + lLetter .

    TOKENS
      name     = uLetter { letter | "'" uLetter | "-" uLetter } .
      initial  = uLetter "." .

    IGNORE CHR(0) .. CHR(31)

    PRODUCTIONS
      Staff1       = { Person } EOF .
      Person       = [ Title ] { name | initial } surname { "," Degree } SYNC "." .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      surname      = name .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff1.

Note that this allows for names like Smith and Malema and also for names like MacKay, O'Neill and Lee-Ann.

Although this correctly describes a staff list, it is useless for a simple parser, as surnames and names are lexically indistinguishable. This is an example of an LL(1) violation (about which we shall learn more shortly).

Attempting to define a better grammar affords interesting insights. It is not difficult to come up with productions that permit full names with leading initials (only) or to contain complete names only:

    PRODUCTIONS
      Staff2       = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = InitialsName | name { name } .
      InitialsName = initial { initial } name .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff2.

but this is too restrictive. Another approach is to relax the distinction between initials and complete names:

    PRODUCTIONS
      Staff3       = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = ( initial | name ) { initial | name } .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
    END Staff3.

but this has the unfortunate effect that it allows a name made of initials only, or a name to have an initial as its last component, as exemplified by

P. Terry D.

One might be tempted to be very rigid about punctuation, and insist that a surname incorporate a final periond (if the person has no qualifictions) or a final comma (for persons that have qualifications. This is too restrictive, but it is possible to find a much better solution. In effect, this picks up from the partial solution given earlier for the "initials only" and the "names only" restrictions. We have to ensure that the productions ensure that a full name is a sequence of initials and names terminated by a name:

    PRODUCTIONS
      Staff9       = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = NameLast { NameLast } .
      NameLast     = InitialsName | name .
      InitialsName = initial { initial } name .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff9.

and it can be refactorised in various ways, for example (various other solutions are given in the source kit):

    PRODUCTIONS
      Staff11      = { Person } EOF .
      Person       = [ Title ] FullName { "," Degree } SYNC "."  .
      FullName     = NameLast { NameLast } .
      NameLast     = [ initial { initial } ] name .
      Title        = "Professor" | "Dr" | "Mr" | "Mrs" | "Ms" | "Miss" .
      Degree       =   "BA" | "BSc" | "BCom" | "BMus" | "BEd" | "BSocSci"
                     | "BSc(Hons)" | "BCom(Hons)" | "MA" | "MSc" | "PhD" .
    END Staff11.

(d) The contents of a very useful textbook (if you can find a copy) begins as follows:

    All you need to know to be able to pass your compiler examination.

                                by

                             Pat Terry.

    Chapter 1  Bribery is unlikely to succeed.

    Chapter 2  Understand the phases of compilation.
        2.1    Lexical and syntactic analysis are easily confused
        2.2    Constraint analysis involves the concept of type
        2.3    Code generation for the PVM is a breeze

    Chapter 3  Get clued up on grammars.
        3.1    Terminals
        3.2    Sentences and sentential forms
        3.3    Regular Expressions
        3.4    Productions
        3.5    EBNF and Cocol
        3.6    Ambiguity is bad news

and so on. Develop a Cocol grammar that describes the contents. Assume that the words in the contents are all fairly simple, but notice that full stops (periods) appear in significant places. Since this is a long course, you should be able to handle more than three chapters, and some of might them have many sub- sections.

Here is the sort of solution one might suggest for a grammar to describe the contents pages of a book. We can recognise phrase structure concepts like Title, Author, Chapter heading, Subsection headings and so on, so introduce non-terminals for these into the PRODUCTIONS section. We can also recognise lexical concepts like word and number - in fact, two kinds of numbers - so we introduce these into the TOKENS section based on pretty obvious CHARACTER sets. Finally, there are some candidates for "key words" - terminals like "Chapter" and "by". Of course, these last words might appear in the "wrong" places quite legitimately.

3.1 Revisit the key concepts in Chapter 2.4 one by one

but those are the sort of complications that are not in the "spirit" of this silly example, and one does not expect beginners to worry too much about them - please accept that some of these exercises are a bit artificial!

    COMPILER Contents1 $NC
    /* Describe the contents pages of a book
       P.D. Terry, Rhodes University, 2015 */

    CHARACTERS
      uLetter = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
      lLetter = "abcdefghijklmnopqrstuvwxyz" .
      letter  = uLetter + lLetter .
      digit   = "0123456780" .

    TOKENS
      word    = letter { letter } .  /* any letter will do, but there must be at least one */
      number  = digit { digit } .    /* for numbering chapters */
      section = digit { digit } "." digit { digit } . /* for subsections */

    IGNORE CHR(0) .. CHR(31)


    PRODUCTIONS
      Contents1  = Title "by" Author { Chapter } EOF .
      Title      = word { word } "." .
      Author     = word { word } "." .
      Chapter    = "Chapter" number Title { Subsection } .
      Subsection = section word { word } . /* notice no final "." */
    END Contents1.

There are, obviously, other ways of doing this. For example, since the combination word { word } comes up so often, one might use

    PRODUCTIONS
      Contents2  = Sentence "by" Sentence { Chapter } EOF .
      Sentence   = Words "." .
      Words      = word { word } .
      Chapter    = "Chapter" number Sentence { Subsection } .
      Subsection = section Words . /* notice no final "." */
    END Contents2.

    PRODUCTIONS
      Contents3  = Words "." "by" Words "." { Chapter } EOF .
      Chapter    = "Chapter" number Words "." { section Words } .
      Words      = word { word } .
    END Contents3.

One might want to insist that the first word in a title or section heading had to start with an uppercase letter. This gets a bit messier, but something like this might come to mind:

    COMPILER Contents4 $NC
    /* Describe the contents pages of a book
       P.D. Terry, Rhodes University, 2015 */

    CHARACTERS
      uLetter = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
      lLetter = "abcdefghijklmnopqrstuvwxyz" .
      letter  = uLetter + lLetter .
      digit   = "0123456780" .

    TOKENS
      uWord   = uLetter { letter } .  /* must start with an upper case letter */
      lWord   = lLetter { letter } .  /* must start with a lower case letter */
      number  = digit { digit } .     /* for numbering chapters */
      section = digit { digit } "." digit { digit } . /* for subsections */

    IGNORE CHR(0) .. CHR(31)

    PRODUCTIONS
      Contents4  = Sentence "by" Sentence { Chapter } EOF .
      Sentence   = Words "." .
      Words      = uWord { uWord | lWord } .
      Chapter    = "Chapter" number Sentence { Subsection } .
      Subsection = section Words . /* notice no final "." */
    END Contents4.

Two frequent mistakes that beginners make at this stage of their development are (a) to use non-terminals where tokens would be better:

          Subsection = number "." number Words .  /* don't do this  - 3.4 is one token! */

and (b) to add punctuation like full stops into the token definition for words or numbers in the wrong way

    TOKENS   /* if necessary or useful */
      word    = letter { letter | "."  } .  /* would allow hello.word as one word */
      number  = digit { digit | "." } .     /* would allow 34...56 as one number */