Complete sources to these solutions can be found on the course WWW pages in the files PRAC21A.ZIP or PRAC21AC.ZIP
In the source kit you were given Calc.atg. This is essentially the calculator grammar on page 62 of the textbook, and you were invited to extend it to allow for parentheses, leading unary + or -operators, an abs() funcion, a factorial capability, numbers with decimal points and so on.
Extending the calculator grammar can be done in several ways. Here is one of them, which corresponds to the approach taken to expressions in languages like Pascal, which do not allow two signs to appear together:
COMPILER Calc1 $CN /* Simple four function calculator (extended) P.D. Terry, Rhodes University, 2011 */ CHARACTERS digit = "0123456789" . hexdigit = digit + "ABCDEF" . TOKENS decNumber = digit { digit } [ "." { digit } ] | "." digit { digit } . hexNumber = "$" hexdigit { hexdigit } . IGNORE CHR(0) .. CHR(31) PRODUCTIONS Calc1 = { Expression "=" } EOF . Expression = [ "+" | "-" ] Term { "+" Term | "-" Term } . Term = Factor { "*" Factor | "/" Factor } . Factor = Primary { "!" } . Primary = decNumber | hexNumber | "(" Expression ")" | "abs" "(" Expression ")" . END Calc1.
Another approach, similar to that taken in C++, is as follows:
PRODUCTIONS Calc2 = { Expression "=" } EOF . Expression = Term { "+" Term | "-" Term } . Term = Factor { "*" Factor | "/" Factor } . Factor = ( "+" | "-") Factor | Primary { "!" } . Primary = decNumber | hexNumber | "(" Expression ")" | "abs" "(" Expression ")" . END Calc2.
This allows for expressions like 3 + - 7 or even 3 * -4 or even 3 / + - 4. Because of the way the grammar is written, the last of these is equivalent to 3 / ( + ( - (4))).
Here are some other attempts. What, if any, differences are there between these and the other solutions presented so far?
PRODUCTIONS Calc3 = { Expression "=" } EOF . Expression = ["+" | "-" ] Term { "+" Term | "-" Term } . Term = Factor { "*" Factor | "/" Factor } . Factor = Primary { "!" } | "abs" "(" Expression ")" . Primary = decNumber | hexNumber | "(" Expression ")" . END Calc3. PRODUCTIONS Calc4 = { Expression "=" } EOF . Expression = Term { "+" Term | "-" Term } . Term = Factor { "*" Factor | "/" Factor } . Factor = ( "+" | "-" ) ( Factor | Primary ) | "abs" "(" Expression ")" ) . Primary = decNumber | hexNumber | "(" Expression ")" { "!" } . END Calc4.
It may be tempting to suggest a production like this
Primary = ( decNumber | hexNumber | "(" Expression ")" | "abs(" Expression ")" ) .
However, a terminal like "abs(" is restrictive. It is invariably better to allow white space to appear between method names and parameter lists if the user prefers this style.
Several submissions tried to define a number token to incorporate an optional sign. While this is used as an illustration in The Book, it is not the best way of doing it when one is trying to describe free-format expressions, where one might like to separate leading + and - signs from the numbers that might follow them. Furthermore, describing numbers in the PPRODUCTIONS section, for example as
Number = decNumber [ "." decNumber ] .
is not a good idea either. Numbers with points in them are nearly always written as contiguous characters, hence 3.45 and not 3 . 45. Remember that in Cocol spaces may be insered between tokens (but very rarely within tokens, save when these are bracketed by unique delimiters such as quotes).
This was meant to be relatively straightforward and should not have caused too many difficulties. A criticism of several submissions was that they were too restrictive. Here is one solution in the spirit of the exercise:
COMPILER Family1 $CN /* Describe a family P.D. Terry, Rhodes University, 2011 */ CHARACTERS control = CHR(0) .. CHR(31) . uLetter = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" . lLetter = "abcdefghijklmnopqrstuvwxyz" . digit = "0123456789" . TOKENS name = uLetter { lLetter | "'" uLetter | "-" uLetter } . number = digit { digit } . IGNORE control PRODUCTIONS Family1 = { Section SYNC } EOF . Section = Surname | Parents | Grandparents | Children | Grandchildren | Possession . Surname = "Surname" ":" name { name } . Parents = "Parents" ":" PersonList . Grandparents = "Grandparents" ":" PersonList . Children = "Children" ":" PersonList . Grandchildren = "Grandchildren" ":" PersonList . PersonList = OnePerson { "," OnePerson } . OnePerson = name { name } [ "(" "deceased" ")" ] { Description } [ Spouse ] . Spouse = "=" name { name } . Description = "[" Relative "of" OnePerson "]" . Relative = "son" | "daughter" | "mother" | "father" | "wife" | "husband" | "partner" | "mistress" . Possession = number [ "small" | "large" ] ( "cat" | "cats" | "dog" | "dogs" | "bagpipe" | "bagpipes" | "house" | "houses" | "car" | "cars" ) . END Family1.
That solution does not insist that the surname should be part of all descriptions. Here is an alternative PRODUCTIONS set that does just that, and also factorizes the grammar slightly differently:
PRODUCTIONS Family2 = { Generation SYNC } Surname SYNC { Generation SYNC } { Possession } EOF . Surname = "Surname" ":" name { name } . Generation = ( "Parents" | "Grandparents" | "Children" | "Grandchildren" ) ":" PersonList . PersonList = OnePerson { "," OnePerson } . OnePerson = name { name } [ "(" "deceased" ")" ] { Description } [ Spouse ] . Spouse = "=" name { name } . Description = "[" Relative "of" OnePerson "]" . Relative = "son" | "daughter" | "mother" | "father" | "wife" | "husband" | "partner" | "mistress" . Possession = number [ "small" | "large" ] ( "cat" | "cats" | "dog" | "dogs" | "bagpipe" | "bagpipes" | "house" | "houses" | "car" | "cars" ) . END Family2.
Four points are worth making (a) the Surname section should not have allowed the possibility of listing the name as deceased (b) it is better to use a construct like "(" "deceased" ")" than "(deceased)" as a single terminal (c) relationships are best between OnePerson and another OnePerson, and not simply between OnePerson and some names (d) there is no need to make line feeds significant in this example - although no harm is done if you do, and they certainly make the text easier for a human reader to decode.
Note how we have defined "cat" and "cats" as keywords. We might alternatively have introduced a token
item = lLetter { lLetter } .
and changed the production
Possession = number [ "small" | "large" ] item .
The exercise called for you to develop a Cocol grammar that describes the words of a song and the notes sung to those words, expressed in "Tonic Solfa".
This toy problem is straightforward, but note the way in which an lf singleton character set is introduced from which the single character EOL token is defined - this is a rather unusual case (in most languages end-of-line is insignificant). Note also that a line of words might also contain some tonic solfa key words as ordinary words - for example "so" and "me". Note how the token word has been defined - multiple - and ' characters are allowed, but at most one trailing punctuation mark. We probably would not want to cater for sequences like Tom!!,Dick,Harry as making up one word.
It is preferable to use CHR(10) = lf as the line mark and to ignore CHR(13) = cr. Then the system will work equally well on Windows and on Linux systems. On a Mac, just to be perverse, they choose to use a single cr character to mark line breaks. You might like to decide how one could define a line feed token that could suffice on all three operating systems.
COMPILER Solfa $CN /* Describe the words and notes of a tune using tonic solfa P.D. Terry, Rhodes University, 2011 */ CHARACTERS lf = CHR(10) . control = CHR(0) .. CHR(31) . letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . TOKENS word = letter { letter | "'" | "-" letter } [ "." | "," | "!" | "?" ] . EOL = lf . IGNORE control - lf PRODUCTIONS Solfa = { Line } EOF . Line = Words EOL Tune EOL EOL { EOL } . Words = ( word | Note ) { word | Note } . Tune = Note { Note } . Note = "do" | "re" | "me" | "fa" | "so" | "la" | "te" . END Solfa.
There are, in fact, other note names in Tonic Solfa, which can be handled in the same way.
These "words" were pretty minimal. One might have wanted to have lines in a song like
"Ha ha" said the clown
(you are too young to remember that one, I bet - about 1967). To handle this one might describe a word as having an alternative that could be in the form of a string (similar to strings in Parva). Fill in the details for yourselves.
The Parva extensions produced some interesting submissions. Many of them (understandably!) were too restrictive in certain respects, while others were too permissive. Admittedly there is a thin line between what might be "nice to have" and what might be "sensible to have" or "easy to compile". Here is a heavily commented suggested solution:
COMPILER Parva1 $CN /* Parva level 1.5 grammar (Extended) This version uses C/Java/C#-like precedences for operators P.D. Terry, Rhodes University, 2011 */ CHARACTERS lf = CHR(10) . backslash = CHR(92) . control = CHR(0) .. CHR(31) . letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . digit = "0123456789" . nonZeroDigit = "123456789" . binDigit = "01" . hexDigit = digit + "abcdefABCDEF" . stringCh = ANY - '"' - control - backslash . charCh = ANY - "'" - control - backslash . printable = ANY - control . TOKENS /* Insisting that identifiers cannot end with an underscore is quite easy */ identifier = letter { letter | digit | "_" { "_" } ( letter | digit ) } . /* but a simpler version is what many people think of identifier = letter { letter | digit | "_" ( letter | digit ) } . Technically this is not quite what was asked. The restriction is really that an identifier cannot end with an underscore. Identifiers like Pat_____Terry are allowed: */ /* Allowing (and restricting) numbers to be of the various forms suggested is easy enough */ number = "0" | nonZeroDigit { digit } | digit { hexDigit } 'H' | binDigit { binDigit } '%' . /* But be careful. There is a temptation to define something like digit = "123456789" . number = "0" | digit { digit | "0" } . and then forget that identifier = letter { letter | digit | "_" } . would not allow identifiers to have 0 in them */ stringLit = '"' { stringCh | backslash printable } '"' . charLit = "'" ( charCh | backslash printable ) "'" . COMMENTS FROM "//" TO lf COMMENTS FROM "/*" TO "*/" IGNORE control PRODUCTIONS Parva1 = "void" identifier "(" ")" Block . Block = "{" { Statement } "}" . /* The options in Statement are easily extended to handle the new forms */ Statement = ( Block | ConstDeclarations | VarDeclarations | AssignmentStatement | IfStatement | WhileStatement | ReturnStatement | HaltStatement | ReadStatement | WriteStatement | ReadLineStatement | WriteLineStatement | ForStatement | BreakStatement | ContinueStatement | DoWhileStatement | SwitchStatement | ";" ) . /* Declarations remain the same as before */ ConstDeclarations = "const" OneConst { "," OneConst } ";" . OneConst = identifier "=" Constant . Constant = number | charLit | "true" | "false" | "null" . VarDeclarations = Type OneVar { "," OneVar } ";" . OneVar = identifier [ "=" Expression ] . /* AssignmentStatements require care to avoid LL(1) problems */ AssignmentStatement = ( Designator ( AssignOp Expression | "++" | "--" ) | "++" Designator | "--" Designator ) ";" . /* In all these it is useful to maintain generality by using Designator, not identifier */ Designator = identifier [ "[" Expression "]" ] . /* The if-then-else construction is most easily described as follows. Although this is not LL(1), this works admirably - it is simply the well-known dangling else ambiguity, which the parser resolves by associating the else clauses with the most recent if */ IfStatement = "if" "(" Condition ")" Statement [ "else" Statement ] . /* The switch statement has to be handled carefully. The labelled case "arms" are optional, but the "default" option can only appear once. The selection can best be done by a general Expression rather than an identifier, and the case "labels" can be constants of any type that would match the type of the selector expression, including identifiers defined in a ConstDeclarations section */ SwitchStatement = "switch" "(" Expression ")" "{" { OneCase } [ "default" ":" { Statement } ] "}" . OneCase = CaseLabel ":" { Statement } . CaseLabel = "case" ( Constant | ("+" | "-") ( number | ConstantIdentifier ) ) . ConstantIdentifier = identifier . /* You might like to consider the differences (if any) between the preceding definition of a switch statement and the alternative below SwitchStatement = "switch" "(" Expression ")" "{" { CaseLabelList Statement { Statement } } [ "default" ":" { Statement } ] "}" . CaseLabelList = CaseLabel { CaseLabel } . CaseLabel = "case" [ "+" | "-" ] ( Constant | Constantidentifier ) ":" . */ /* The case arms usually have to contain a "break" statement, which is syntactically simply another form of statement. There is actually a context-sensitive feature embedded in this - break statements cannot really be placed "anywhere", but we reserve further discussion for a later occasion. */ /* Remember that the DoWhileStatement must end with a semicolon! */ DoWhileStatement = "do" Statement "while" "(" Condition ")" ";" . /* The ForStatement needs to avoid using AssignmentStatement as one might be tempted to do. It is sensible to control a for loop using a simple identifier rather than a general Designator for reasons that might be discussed later in the course */ ForStatement = "for" identifier "=" Expression ( "to" | "downto" ) Expression [ "by" Expression ] Statement . /* Break and Continue statements are very simple. They are really "context dependent" but we cannot impose such restrictions in a context free grammar */ BreakStatement = "break" ";" . ContinueStatement = "continue" ";" . /* ReadLine and WriteLine statements must allow for an empty argument list */ ReadLineStatement = "readLine" "(" [ ReadElement { "," ReadElement } ] ")" ";" . WriteLineStatement = "writeLine" "(" [ WriteElement { "," WriteElement } ] ")" ";" . /* Much of the rest of the grammar remains unchanged: */ WhileStatement = "while" "(" Condition ")" Statement . ReturnStatement = "return" ";" . HaltStatement = "halt" ";" . ReadStatement = "read" "(" ReadElement { "," ReadElement } ")" ";" . ReadElement = stringLit | Designator . WriteStatement = "write" "(" WriteElement { "," WriteElement } ")" ";" . WriteElement = stringLit | Expression . Condition = Expression . /* The basic form of Expression introduces "in", effectively as another relational operator with the same precedence as the other relational operators */ Expression = AddExp [ RelOp AddExp | "in" ExpList ] . AddExp = [ "+" | "-" ] Term { AddOp Term } . Term = Factor { MulOp Factor } . Factor = Designator | Constant | "new" BasicType "[" Expression "]" | "!" Factor | "(" Expression ")" . /* The ExpList used after the "in" operator can be quite general, sytntactically */ ExpList = "(" Range { "," Range } ")" . Range = Expression [ ".." Expression ] . Type = BasicType [ "[]" ] . BasicType = "int" | "bool" . AddOp = "+" | "-" | "||" . MulOp = "*" | "/" | "%" | "&&" . RelOp = "==" | "!=" | "<" | "<=" | ">" | ">=" . AssignOp = "=" . END Parva1.
The problem suggested a Cocol grammar that describes correctly marshalled trains (Trains.atg):
COMPILER Trains $CN /* Grammar for simple railway trains P.D. Terry, Rhodes University, 2011 */ IGNORECASE COMMENTS FROM "(*" TO "*)" NESTED IGNORE CHR(0) .. CHR(31) // '\u0000' .. '\u001f' // Linz version PRODUCTIONS Trains = { OneTrain } EOF . OneTrain = LocoPart [ [ GoodsPart ] HumanPart ] SYNC "." . LocoPart = "loco" { "loco" } . GoodsPart = Truck { Truck } . HumanPart = "brake" | { "coach" } "guard" . Truck = "coal" | "closed" | "open" | "cattle" | "fuel" . END Trains.
and went on to suggest modifying the grammar to build in restrictions that fuel trucks may not be marshalled immediately behind the locomotives, or immediately in front of a passenger coach.
In my experience the wheels come off in many attempts at solving this problem. It is quite hard to get right, and at first one may not easily find an LL(1) grammar that really matches the problem as set.
Given that passenger trains do not have a safety complication, one might be tempted to refactor the grammar to give the equivalent one below, which seems more closely to define a train in terms of the three ways in which it can be classified:
COMPILER Train1 $CN /* Grammar for simple railway trains P.D. Terry, Rhodes University, 2011 */ IGNORECASE COMMENTS FROM "(*" TO "*)" NESTED IGNORE CHR(0) .. CHR(31) PRODUCTIONS Train1 = { OneTrain } EOF . OneTrain = LocoPart [ Passengers | FreightOrMixed ] SYNC "." . LocoPart = "loco" { "loco" } . FreightOrMixed = Truck { Truck } ( "brake" | Passengers ) . Passengers = { "coach" } "guard" . Truck = "closed" | "coal" | "open" | "cattle" | "fuel" . END Train1.
Here is an attempt at safety. But this one insists on at least two safe trucks in any train, and is not LL(1):
PRODUCTIONS Train2 = { OneTrain } EOF . OneTrain = LocoPart [ Passengers | FreightOrMixed ] SYNC "." . LocoPart = "loco" { "loco" } . FreightOrMixed = SafeTruck { AnyTruck } LastPart . LastPart = "brake" | SafeTruck Passengers . Passengers = { "coach" } "guard" . SafeTruck = "closed" | "coal" | "open" | "cattle" . AnyTruck = SafeTruck | "fuel" . END Train2.
Why is it not LL(1) compliant? We could apply all the theory of Chapter 7 of the textbook, but maybe an example will suffice. Suppose we have a valid train like
loco coal coal coal coal coach guard
The first coal truck is parsed by the leading SafeTruck in GoodsPart. The next two coal trucks must be parsed by the repetitive part { AnyTruck }, but you can probably see that the last coal truck would have to be parsed by the alternative within LastPart. Unfortunately an LL(1) parser can't see far enough ahead to make that decision, and would be tempted to treat this last coal truck as part of the { AnyTruck } sequence.
Here is one that is LL(1)
PRODUCTIONS Train3 = { OneTrain } EOF . OneTrain = LocoPart [ Passengers | FreightOrMixed ] SYNC "." . LocoPart = "loco" { "loco" } . FreightOrMixed = SafeTruck MoreTrucks HumanPart . MoreTrucks = { "fuel" { "fuel" } SafeTruck | SafeTruck } . HumanPart = "brake" | Passengers . Passengers = { "coach" } "guard" . SafeTruck = "coal" | "closed" | "open" | "cattle" . END Train3.
At first you might thing that this is, at last, a correct solution. But no, it isn't quite. This solution does not allow you to have a train like:
loco loco open fuel fuel brake .
as the last fuel truck in a sequence has now to be followed by at least one safe truck. The grammar does, however, allow trains like
loco open coach coach guard .
with only one truck in the freight section.
It is remarkable that something that at first sight looks so simple might turn out to be frustratingly difficult. Not being able to find an LL(1) grammar is not a train smash - one quite often cannot find an LL(1) grammar for a language. But it's usually worth a try, as parsers for LL(1) grammars are so easy to write. The clue is to be found in a suggestion that one should factorize the grammar not to concentrate on the "obvious" types of train, but on the requirement that at any point along a train that might incorporate fuel trucks, the last part of the train should be "safe". Thus:
PRODUCTIONS Train4 = { OneTrain } EOF . OneTrain = LocoPart [ SafeLoad | "brake" | Passengers ] SYNC "." . LocoPart = "loco" { "loco" } . Passengers = { "coach" } "guard" . SafeLoad = SafeTruck { SafeTruck } ( "brake" | Passengers | SafeFuel ) . SafeFuel = "fuel" { "fuel" } ( SafeLoad | "brake" ) . SafeTruck = "coal" | "closed" | "open" | "cattle" . END Train4.