Tutorial for week 19 - more practice in Cocol grammars

Develop a Cocol specification for a program that would recognize the staff list in a Science Faculty. Assume that staff all have a title (Dr, Prof, Mr, Miss, Mrs or Ms), that they all have at least a surname (some might have names like Harley-Davidson), that some of them prefer to be known by their initials, some by their names, and some in combination, and that they possibly have degrees. Degrees are odd things. Scientists only have three possibilities: BSc, MSc and PhD. You can't get a PhD unless you have an MSc, and you can't get an MSc unless you have a BSc, and you always quote the degrees in the order you obtain them. The following would be a valid Science Faculty list of names:

          Dr Nospots Onhym, BSc, MSc, PhD.
          Mr I. Feildim-Orle.
          Miss Fitt, BSc.
          Mr P. D. Wossname, BSc, MSc.
          Prof Heile Unlykelee.
          Mrs Sheik M. Drye, BSc.
          Ms Y. Ima Raver.

Here is one attempt, but not a very happy one. There is no real syntactic difference between surnames and firstnames, so the grammar is non-LL(1):

   COMPILER StaffList $CXN
   /* Describe list of names in a Science Faculty
      This is non-LL(1) and so cannot generate a reliable parser.
      P.D. Terry, Rhodes University, 2001 */

   IGNORE CHR(9) .. CHR(13)

   CHARACTERS
     UCletter    = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
     LCletter    = "abcdefghijklmnopqrstuvwxyz" .
     letter      = UCletter + LCletter .

   TOKENS
     name        = UCletter { letter | "'" | "-" } letter .
     initial     = UCletter "." .

   PRODUCTIONS
     StaffList   = { StaffMember } .
     StaffMember = Title { initial | FirstName } Surname Degrees "." .
     Title       = "Dr" | "Prof" | "Mr" | "Mrs" | "Ms" | "Miss" .
     FirstName   = name .
     Surname     = name .
     Degrees     = [ "," "BSc" [ "," "MSc" [ "," "PhD" ] ] ] .
   END StaffList.

The discussion in class one year was really good. Following on a suggestion from a student, I did some more exploring. If one fudges the required form of the surname to insist that it terminate with a comma or period, and that it has other internal restrictions (so as to be able to distingush words like "MSc," from words like "Terry,") then one can get an LL(1) grammar after all!

   COMPILER StaffList $CXN
   /* Describe list of names in a Science Faculty
      This is LL(1) and so can generate a reliable parser, but it imposes nasty
      constraints on the form a surname can take
      P.D. Terry, Rhodes University, 2001 */

   IGNORE CHR(9) .. CHR(13)

   CHARACTERS
     UCletter       = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
     LCletter       = "abcdefghijklmnopqrstuvwxyz" .
     letter         = UCletter + LCletter .

   TOKENS
     name           = UCletter { letter | "'" | "-" } letter .
     NodegreeName   = UCletter { LCletter | "'" UCletter | "-" UCletter } LCletter "." .
     DegreeName     = UCletter { LCletter | "'" UCletter | "-" UCletter } LCletter "," .
     initial        = UCletter "." .

   PRODUCTIONS
     StaffList      = { StaffMember } .
     StaffMember    = Title { initial | FirstName } NameAndDegrees .
     Title          = "Dr" | "Prof" | "Mr" | "Mrs" | "Ms" | "Miss" .
     FirstName      = name .
     NameAndDegrees = NodegreeName | DegreeName Degrees .
     Degrees        = "BSc" [ "," "MSc" [ "," "PhD" ] ] "." .
   END StaffList.


The following Cocol grammar describes a set of EBNF productions:

  COMPILER EBNF $XCN
    CHARACTERS
      letter   = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit    = "0123456789" .
      noquote  = ANY - '"' .
    IGNORE  CHR(9) .. CHR(13)
    TOKENS
      nonterminal = letter { letter | digit } .
      terminal    =  '"' noquote { noquote } '"' .
    PRODUCTIONS
      EBNF       = { Production } EOF .
      Production = nonterminal "=" Expression "." .
      Expression = Term { "|" Term } .
      Term       = Factor { Factor } .
      Factor     =   nonterminal | terminal | "[" Expression "]"
                   | "(" Expression ")" | "{" Expression "}" .
  END EBNF.

The PRODUCTIONS section here makes use of the EBNF "meta brackets" { } in the definitions of EBNF, Expression and Term. How would you write the PRODUCTIONS section without making use either of the { } or [ ] form of meta brackets?

We have to introduce further non-terminals and use recursion to produce the repetition implicitly:

    PRODUCTIONS
      EBNF            = Production MoreProductions EOF .
      MoreProductions = Production MoreProductions | (* eps *) .
      Production      = nonterminal "=" Expression "." .
      Expression      = Term RestExp .
      RestExp         = "|" Term RestExp | (* eps *) .
      Term            = Factor RestTerm .
      RestTerm        = Factor RestTerm | (* eps *) .
      Factor          =   nonterminal | terminal | "[" Expression "]"
                        | "(" Expression ")" | "{" Expression "}" .


In section 5.9.5 it is shown how simple strings can be specified in Cocol

   CHARACTERS
     noquote1 = ANY - "'" - CHR(0) .
     noquote2 = ANY - '"' - CHR(0).

   TOKENS
     string =   "'" noquote1 { noquote1 } "'"
              | '"' noquote2 { noquote2 } '"' .

and in section 8.7.2 it is shown how Clang/Pascal string literals can be specified (ones in which two successive apostrophes can be used as an "escape sequence" to denote a single apostrophe):

   CHARACTERS
     cr       = CHR(13) .
     lf       = CHR(10) .
     instring = ANY - "'" - cr - lf - CHR(0).

   TOKENS
     string = "'" (instring | "''") { instring | "''" } "'" .

What are the CHARACTERS and TOKENS clauses that you would need to be able to specify the form of strings in C++ - strings with other escape sequences, as demonstrated by

"A bit of string with newline \n characters \t tab characters and \\ backslash characters"

This needs a little thought - we have to find some way of defining the graphic characters that can appear in a string "unescaped" and then define the nastier escape sequences separately:

I apologise - the answer posted for some weeks was incorrect. Here are some better ideas

  CHARACTERS
    letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    cr         = CHR(13) .
    lf         = CHR(10) .
    back       = CHR(92) .
    digit      = "0123456789" .
    graphic    = ANY - '"' - "'" - back - cr - lf - CHR(0) .
    quote      = CHR(39) .
    dquote     = CHR(34) .

    cntl       = CHR(0) .. CHR(31).
    noQuote1   = ANY - '"' - cntl - back .
    noQuote2   = ANY - "'" - cntl - back .
    graphic1   = ANY - cntl .
  TOKENS
    identifier = letter { letter | digit | "_" } .
    number     = digit { digit } .

/* The bad answer posted previously:

    string     = '"' { graphic | back graphic | "\'" | back back | '\"' } '"' .
    character  = "'" ( graphic | back graphic | "\'" | back back | '\"' ) "'" .

*/

/* That idea should be expressed this way

    string     = '"' { graphic | back graphic | "\\\'" | back back | '\\\"' } '"' .
    character  = "'" ( graphic | back graphic | "\\\'" | back back | '\\\"' ) "'" .

*/

/* this one works fine too and is easier to understand */

    string     = '"' { graphic | back graphic | back quote | back back | back dquote } '"' .
    character  = "'" ( graphic | back graphic | back quote | back back | back dquote ) "'" .

/* */

/* this next alternative is a bit looser and also seems to work

  string       = '"' { noQuote1 | back graphic1 } '"' .
  character    = "'" { noQuote2 | back graphic1 } "'" .

*/


Consider the by now rather hackneyed grammar for expressions which has been written out again below. How would you add attributes so that a parser, after recognising an Expression, would be able to determine whether that Expression was a "constant expression" (meaning that in principle it could be evaluated at compile time) or was a "variable expression" (meaning that it could not be evaluated until run time)? Assume that the Designator production can be attributed so that it returns the result of searching the symbol table to determine whether the associated identifier denotes a variable or a constant.

  Expression<bool &isConst>
  =                                   (. bool alsoConst; .)
      (   "+" Term<isConst>
        | "-" Term<isConst>
        | Term<isConst>
      ) { AddOp
      Term<alsoConst>                 (. isConst = isConst && alsoConst; .)
      }
      .

  Term<bool &isConst>
  =                                   (. bool alsoConst; .)
      Factor<isConst>
      { MulOp
      Factor<alsoConst>               (. isConst = isConst && alsoConst; .)
      }
  .

  Factor<bool &isConst>
  =     Designator<isConst>
      | number                        (. isConst = true; .)
      | "(" Expression<isConst> ")"
  .