Computer Science 301 - 2000


Tutorial for week 23 - Solutions - more practice in Cocol grammars

Develop a Cocol grammar that will describe the language of Roman numbers.

This is a bit long-winded, because we have to make sure that there is at least one digit in a number! It can be done in several ways; care being taken to keep it restrictive, and at the same time in LL(1) form.

   COMPILER RomanNumbers $CXN
   /* Describe list of Roman numbers
      P.D. Terry, Rhodes University, 2000 */

   IGNORE CHR(9) .. CHR(13)
   CHARACTERS
     eol = CHR(13) .
   TOKENS
     EOL = eol.


   PRODUCTIONS
     RomanNumbers = { Number EOL } .
     Number    =   Thousands [ Hundreds ] [ Tens ] [ Ones ]
                 | Hundreds [ Tens ] [ Ones ]
                 | Tens [ Ones ]
                 | Ones .
     Thousands = "M" { "M" } .
     Hundreds  = "C" [ "M" | "D" | "C" [ "C" ] ] | "D" [ "C" [ "C" [ "C" ] ] ] .
     Tens      = "X" [ "C" | "L" | "X" [ "X" ] ] | "L" [ "X" [ "X" [ "X" ] ] ] .
     Ones      = "I" [ "X" | "V" | "I" [ "I" ] ] | "V" [ "I" [ "I" [ "I" ] ] ] .

   END RomanNumbers.


Develop a Cocol specification for a program that would recognize the staff list in a Science Faculty. Assume that staff all have a title (Dr, Prof, Mr, Miss, Mrs or Ms), that they all have at least a surname (some might have names like Harley-Davidson), that some of them prefer to be known by their initials, some by their names, and some in combination, and that they possibly have degrees. Degrees are odd things. Scientists only have three possibilities: BSc, MSc and PhD. You can't get a PhD unless you have an MSc, and you can't get an MSc unless you have a BSc, and you always quote the degrees in the order you obtain them. The following would be a valid Science Faculty list of names:

          Dr Nospots Onhym, BSc, MSc, PhD.
          Mr I. Feildim-Orle.
          Miss Fitt, BSc.
          Mr P. D. Wossname, BSc, MSc.
          Prof Heile Unlykelee.
          Mrs Sheik M. Drye, BSc.
          Ms Y. Ima Raver.

Here is one attempt, but not a very happy one. There is no real syntactic difference between surnames and firstnames, so the grammar is non-LL(1):

   COMPILER StaffList $CXN
   /* Describe list of names in a Science Faculty
      This is non-LL(1) and so cannot generate a reliable parser.
      P.D. Terry, Rhodes University, 2000 */

   IGNORE CHR(9) .. CHR(13)

   CHARACTERS
     UCletter    = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
     LCletter    = "abcdefghijklmnopqrstuvwxyz" .
     letter      = UCletter + LCletter .

   TOKENS
     name        = UCletter { letter | "'" | "-" } letter .
     initial     = UCletter "." .

   PRODUCTIONS
     StaffList   = { StaffMember } .
     StaffMember = Title { initial | FirstName } Surname Degrees "." .
     Title       = "Dr" | "Prof" | "Mr" | "Mrs" | "Ms" | "Miss" .
     FirstName   = name .
     Surname     = name .
     Degrees     = [ "," "BSc" [ "," "MSc" [ "," "PhD" ] ] ] .
   END StaffList.

The discussion in class one year on this problem was really good. Following on a suggestion from a student, I did some more exploring. If one fudges the required form of the surname to insist that it terminate with a comma or period, and that it has other internal restrictions (so as to be able to distingush words like "MSc," from words like "Terry,") then one can get an LL(1) grammar after all!

   COMPILER StaffList $CXN
   /* Describe list of names in a Science Faculty
      This is LL(1) and so can generate a reliable parser, but it imposes nasty
      constraints on the form a surname can take
      P.D. Terry, Rhodes University, 2000 */

   IGNORE CHR(9) .. CHR(13)

   CHARACTERS
     UCletter       = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
     LCletter       = "abcdefghijklmnopqrstuvwxyz" .
     letter         = UCletter + LCletter .

   TOKENS
     name           = UCletter { letter | "'" | "-" } letter .
     NodegreeName   = UCletter { LCletter | "'" UCletter | "-" UCletter } LCletter "." .
     DegreeName     = UCletter { LCletter | "'" UCletter | "-" UCletter } LCletter "," .
     initial        = UCletter "." .

   PRODUCTIONS
     StaffList      = { StaffMember } .
     StaffMember    = Title { initial | FirstName } NameAndDegrees .
     Title          = "Dr" | "Prof" | "Mr" | "Mrs" | "Ms" | "Miss" .
     FirstName      = name .
     NameAndDegrees = NodegreeName | DegreeName Degrees .
     Degrees        = "BSc" [ "," "MSc" [ "," "PhD" ] ] "." .
   END StaffList.


The following Cocol grammar describes a set of EBNF productions:

  COMPILER EBNF $XCN
    CHARACTERS
      letter   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit    = "0123456789" .
      noquote  = ANY - '"' .
    IGNORE  CHR(9) .. CHR(13)
    TOKENS
      nonterminal = letter { letter | digit } .
      terminal    =  '"' noquote { noquote } '"' .
    PRODUCTIONS
      EBNF       = { Production } EOF .
      Production = nonterminal "=" Expression "." .
      Expression = Term { "|" Term } .
      Term       = Factor { Factor } .
      Factor     =   nonterminal | terminal | "[" Expression "]"
                   | "(" Expression ")" | "{" Expression "}" .
  END EBNF.

The PRODUCTIONS section here makes use of the EBNF "meta brackets" { } in the definitions of EBNF, Expression and Term. How would you write the PRODUCTIONS section without making use either of the { } or [ ] form of meta brackets?

We have to introduce further non-terminals and use recursion to produce the repetition implicitly:

    PRODUCTIONS
      EBNF            = Production MoreProductions EOF .
      MoreProductions = Production MoreProductions | (* eps *) .
      Production      = nonterminal "=" Expression "." .
      Expression      = Term RestExp .
      RestExp         = "|" Term RestExp | (* eps *) .
      Term            = Factor RestTerm .
      RestTerm        = Factor RestTerm | (* eps *) .
      Factor          =   nonterminal | terminal | "[" Expression "]"
                        | "(" Expression ")" | "{" Expression "}" .


In section 5.9.5 it is shown how simple strings can be specified in Cocol

   CHARACTERS
     noquote1 = ANY - "'" .
     noquote2 = ANY - '"' .

   TOKENS
     string =   "'" noquote1 { noquote1 } "'"
              | '"' noquote2 { noquote2 } '"' .

and in section 8.7.2 it is shown how Clang/Pascal string literals can be specified (ones in which two successive apostrophes can be used as an "escape sequence" to denote a single apostrophe):

   CHARACTERS
     cr       = CHR(13) .
     lf       = CHR(10) .
     instring = ANY - "'" - cr - lf .

   TOKENS
     string = "'" (instring | "''") { instring | "''" } "'" .

What are the CHARACTERS and TOKENS clauses that you would need to be able to specify the form of strings in C++ - strings with other escape sequences, as demonstrated by

"A bit of string with newline \n characters \t tab characters and \\ backslash characters"

This needs a little thought - we have to find some way of defining the graphic characters that can appear in a string "unescaped" and then define the nastier escape sequences separately. Because of the difficulty of introducing the backslash character, it is most easily fudged by useing the CHR(n) notation, where in this case n = 92:

  CHARACTERS
      cr         = CHR(13) .
      lf         = CHR(10) .
      back       = CHR(92) .
      letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit      = "0123456789" .
      graphic    = ANY - '"' - "'" - back - cr - lf - CHR(0) .

  TOKENS
      identifier = letter { letter | digit | "_" } .
      number     = digit { digit } .
      string     = '"' { graphic | back graphic | "\'" | back back | '\"' } '"' .
      character  = "'" ( graphic | back graphic | "\'" | back back | '\"' ) "'" .