Tutorial for week 21

Computer Science 301 - 2012

Tutorial for week 21 - Regular Expressions and Cocol Grammars

(a) Develop a regular expression and/or a Cocol token definition that describes identifiers of the sort used in many programming languages, subject to the restriction that underscores may not appear at the start or end of an identifier. but may appear in the interior of an identifier.

In common RE notation we might write

      RE1  =  letter     = [A-Za-z]
      RE2  =  digit      = [0-9]
      RE3  =  identifier = letter ( letter | digit | underscore+ ( letter | digit ) )*

In Cocol notation we would write something that looks much the same but note that the CHARACTERS section defines sets of characters, not strings!

    CHARACTERS
      letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
      digit  = "0123456789" .

    TOKENS
      identifier = letter { letter | digit | "_" { "_" } ( letter | digit ) } .

It might be thought that we could write the Cocol definition more simply, as

    TOKENS
      identifier = letter { letter | digit | "_" ( letter | digit ) } .

but this is not quite correct. Can you see why not? What would the equivalent be in common RE notation?

(b) Develop a regular expression and/or a Cocol description that describes comments of the form allowed in the C family of languages.

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment } "*/" .

However, the regular expression forming part of this directive has some fairly obvious shortcomings. The trick is that * and / characters can occur "internally", as can /*, but of course the combination */ cannot occur internally.

The problem as posed hinted that the following were some examples of strings that might be valid comments, and invited the reader to spot the odd one out.

     (1)   /* comment */
     (2)   /* multiply a * b */
     (3)   /* divide a / b */
     (4)   /* ***** */
     (5)   /*********/
     (6)   /* ** /* // */
     (7)   /* // */ ** */
     (8)   /*a*b*/
     (9)   /* **a/b* ** */

It is, of course, (7) which is better described as a comment /* // */ followed by a bit of garbage ** */.

Finding a correct solution is quite tricky, though like so many classic exercises, once you see the solution it become "obvious". Here is a deceptively appealing attempt:

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment | "/" | "*" inComment } { "*" } "*/" .

Casual testing will suggest that the problem is now solved. However, this more complicated regular expression will still not handle some of the valid comments given above. In particular it will still not recognize 4, 6 , 7, 9. A better definition might take some time to think through and verify:

    CHARACTERS
      inComment = ANY - "*" - "/" .
    TOKENS
      Comment = "/*" { inComment | "/" | "*" { "*" } inComment } { "*" } "*/" .

It is left as an exercise to write this in common RE form - try it, and it may convince you how much clearer the Cocol notation is!

      Goal       = Expression .
      Expression = Term | Expression "-" Term .
      Term       = Factor | Term "*" Factor .
      Factor     = "a" | "b" | "c" | "d" .

Show the parse tree corresponding to the string "a - b - c"

                            Goal
                              |
                          Expression
                              |
                  -------------------
                  |           |     |
             Expression       |   Term
                  |           |     |
          ----------------    |     |
          |       |      |    |     |
       Expression |    Term   |   Factor
          |       |      |    |     |
        Term      |    Factor |     |
          |       |      |    |     |
       Factor     |      |    |     |
          |       |      |    |     |
          a       -      b    -     c

(d) Another grammar is

      Goal       = Expression .
      Expression = Term | Term "-" Expression .
      Term       = Factor | Factor "*" Term .
      Factor     = "a" | "b" | "c" | "d" .

Show the parse tree corresponding to the string "a - b - c" and discuss the differences between this grammar and the last.

                            Goal
                              |
                          Expression
                              |
                  ------------------------------
                  |           |                |
                Term          |           Expression
                  |           |                |
                  |           |        ----------------
                  |           |        |       |      |
                Factor        |     Expression |    Term
                  |           |        |       |      |
                  |           |      Term      |    Factor
                  |           |        |       |      |
                  |           |     Factor     |      |
                  |           |        |       |      |
                  a           -        b       -      c

The two grammars are "equivalent" in the sense that any string that can be derived from the goal of one can be derived from the goal of the other. If and when we start applying "meaning" by applying some tree walking algorithm from the nodes to the root to get the meaning of a complete expression we would find that there are differences. The left-recursive grammar (the first one) provides "left to right" associativity - so that, for example "a-b-c" "means" (a-b)-c, whereas the right recursive grammar provides "right to left" associativity - so that "a-b-c" would "mean" a-(b-c), which would be at variance with the usual mathematical convention.

(e) Yet another grammar for describing expressions might be:

      Goal       = Expression .
      Expression = Term | Expression "*" Term .
      Term       = Factor | Term "-" Factor .
      Factor     = "a" | "b" | "c" | "d" .

Show the parse tree corresponding to the string "a - b * c"

                            Goal
                              |
                          Expression
                              |
                  -------------------
                  |           |     |
             Expression       |   Term
                  |           |     |
                 Term         |     |
          ----------------    |     |
          |       |      |    |     |
        Term      |    Factor |  Factor
          |       |      |    |     |
       Factor     |      |    |     |
          |       |      |    |     |
          a       -      b    *     c

Once again we have an equivalent grammar, in the sense that all three grammars can derive the same set of strings. But in this case it should be readily seen that the "precedence" of the - operator is now greater than the precedence of the * operator, which, again, would conflict with mathematical convention.

Even simpler equivalent grammars - but with all operators of equal precedence - are

      Goal       = Expression .
      Expression = Expression "*" Term | Expression "-" Term | Term
      Term       = "a" | "b" | "c" | "d" .

or even

      Goal       = Expression .
      Expression = Term "*" Expression | Term "-" Expression | Term
      Term       = "a" | "b" | "c" | "d" .

(you might like to wonder whether these have any semantic differences in the form of the trees they produce).

(f) The contents of a very useful textbook (if you can find a copy) begins as follows:

    All you need to know to be able to pass your compiler examination.

                                by

                             Pat Terry.

    Chapter 1  Bribery is unlikely to succeed.

    Chapter 2  Understand the phases of compilation.
        2.1    Lexical and syntactic analysis are easily confused
        2.2    Constraint analysis involves the concept of type
        2.3    Code generation for the PVM is a breeze

    Chapter 3  Get clued up on grammars.
        3.1    Terminals
        3.2    Sentences and sentential forms
        3.3    Regular Expressions
        3.4    Productions
        3.5    EBNF and Cocol
        3.6    Ambiguity is bad news

and so on. Develop a Cocol grammar that describes the contents. Assume that the words in the contents are all fairly simple, but notice that full stops (periods) appear in significant places. Since this is a long course, you should be able to handle more than three chapters, and some of might them have many sub- sections.

Here is the sort of solution one might suggest for a grammar to describe the contents pages of a book. We can recognise phrase structure concepts like Title, Author, Chapter heading, Subsection headings and so on, so introduce non-terminals for these into the PRODUCTIONS section. We can also recognise lexical concepts like word and number - in fact, two kinds of numbers - so we introduce these into the TOKENS section based on pretty obvious CHARACTER sets. Finally, there are some candidates for "key words" - terminals like "Chapter" and "by". Of course, these last words might appear in the "wrong" places quite legitimately.

3.1 Revisit the key concepts in Chapter 2.4 one by one

but those are the sort of complications that are not in the "spirit" of this silly example, and one does not expect beginners to worry too much about them - please accept that some of these exercises are a bit artificial!

    COMPILER Contents1 $NC
    /* Describe the contents pages of a book
       P.D. Terry, Rhodes University, 2012 */

    CHARACTERS
      uLetter = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
      lLetter = "abcdefghijklmnopqrstuvwxyz" .
      letter  = uLetter + lLetter .
      digit   = "0123456780" .

    TOKENS
      word    = letter { letter } .  /* any letter will do, but there must be at least one */
      number  = digit { digit } .    /* for numbering chapters */
      section = digit { digit } "." digit { digit } . /* for subsections */

    IGNORE CHR(0) .. CHR(31)


    PRODUCTIONS
      Contents1  = Title "by" Author { Chapter } EOF .
      Title      = word { word } "." .
      Author     = word { word } "." .
      Chapter    = "Chapter" number Title { Subsection } .
      Subsection = section word { word } . /* notice no final "." */
    END Contents1.

There are, obviously, other ways of doing this. For example, since the combination word { word } comes up so often, one might use

    PRODUCTIONS
      Contents2  = Sentence "by" Sentence { Chapter } EOF .
      Sentence   = Words "." .
      Words      = word { word } .
      Chapter    = "Chapter" number Sentence { Subsection } .
      Subsection = section Words . /* notice no final "." */
    END Contents2.

    PRODUCTIONS
      Contents3  = Words "." "by" Words "." { Chapter } EOF .
      Chapter    = "Chapter" number Words "." { section Words } .
      Words      = word { word } .
    END Contents3.

One might want to insist that the first word in a title or section heading had to start with an uppercase letter. This gets a bit messier, but something like this might come to mind:

    COMPILER Contents4 $NC
    /* Describe the contents pages of a book
       P.D. Terry, Rhodes University, 2012 */

    CHARACTERS
      uLetter = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
      lLetter = "abcdefghijklmnopqrstuvwxyz" .
      letter  = uLetter + lLetter .
      digit   = "0123456780" .

    TOKENS
      uWord   = uLetter { letter } .  /* must start with an upper case letter */
      lWord   = lLetter { letter } .  /* must start with a lower case letter */
      number  = digit { digit } .     /* for numbering chapters */
      section = digit { digit } "." digit { digit } . /* for subsections */

    IGNORE CHR(0) .. CHR(31)

    PRODUCTIONS
      Contents4  = Sentence "by" Sentence { Chapter } EOF .
      Sentence   = Words "." .
      Words      = uWord { uWord | lWord } .
      Chapter    = "Chapter" number Sentence { Subsection } .
      Subsection = section Words . /* notice no final "." */
    END Contents4.

Two frequent mistakes that beginners make at this stage of their development are (a) to use non-terminals where tokens would be better:

          Subsection = number "." number Words .  /* don't do this  - 3.4 is one token! */

and (b) to add punctuation like full stops into the token definition for words or numbers in the wrong way

    TOKENS   /* if necessary or useful */
      word    = letter { letter | "."  } .  /* would allow hello.world as one word */
      number  = digit { digit | "." } .     /* would allow 34...56 as one number */