Compiling with C# and Java (2005)

P.D. Terry, Rhodes University

Chapter exercises

This document comprises extracts from a final draft of the complete 2005 textbook, and is offered "as is".

Exercises 1

1.1 Make a list of as many translators as you can think of that can be found on your computer system.

1.2 Make a list of as many other systems programs (and their functions) as you can think of that can be found on your computer system.

1.3 Make a list of existing features in your favourite (or least favourite) programming language that you find irksome. Make a similar list of features that you would like to have seen added. Then examine your lists and consider which of the features might possibly be related to the difficulty of implementation.

Exercises 2

2.1 Make a list of as many translators as you can think of that can be found on your system.

2.2 Are there any pre-processors on your system? What are they used for?

2.3 The Java Development Kit (J2 SDK) includes a disassembler (javap) that can produce pseudo-assembler source from Java class files. A similar disassembler is Gnoloo (Engel 1999). Experiment with these tools by compiling some simple Java programs and then disassembling them again. Can you track down any more sophisticated decompilers for Java class files?

2.4 Microsoft's .NET development systems include a disassembler (ildasm) that can produce "textual CIL" assembler code from Common Language Runtime (CLR) assemblies. Experiment with this tool by compiling some simple C# or Visual Basic .NET programs and then disassembling them. Can you track down any more sophisticated decompilers for .NET assemblies?

2.5 A pretty-printer is a program that will take a source program and reformat it to make it conform to a particular style as regards layout (for example, indentation of statements and spacing around operators). A pretty-printer can be thought of as a compiler. How would this fit into the classification system just described? Draw T-diagrams that represent a pretty-printer and the operation of pretty-printing a messy program source text to produce an aesthetically pleasing version of this source.

2.6 Which of the translators known to you are of the load-and-go type?

2.7 Do you know whether any of the translators you use produce relocatable code? Is this of a standard form? Do you know the names of the linkage editors or loaders used on your system?

2.8 What sort of problems can you foresee a Fortran compiler having in analyzing statements beginning

                       IF ( I(J) - I(K) ) ........
                       CALL IF (4 ,    ...........
                       IF (3 .EQ. MAX) GOTO ......
                 100   FORMAT(X3H)=(I5)

2.9 What sort of code would you have produced had you been coding a statement like "while (1 < p && p < 9) p = p + q;" into your favourite ASSEMBLER language?

2.10 Write a simple C# or Java program incorporating this statement, compile it, and then use the appropriate disassembler (ildasm or javap/Gnoloo respectively) to examine the sort of code produced by the C# or Java compiler. Does this appear very different from the 80x86 code illustrated in this section?

2.11 Draw the concrete syntax tree for the C++ version of the while statement used for illustration in this section.

2.12 Are there any reasons why short-circuit evaluation should be preferred over the Boolean operator approach? Can you think of any algorithms that would depend critically on which approach was adopted?

2.13 Write down a few other high-level constructs and try to imagine what sort of ASSEMBLER-like machine code a compiler would produce for them.

2.14 What do you suppose makes it relatively easy to compile Pascal? Can you think of any aspects of Pascal which could prove really difficult?

2.15 Descriptions of translators often use two terms which at first seem interchangeable, namely "separate" and "independent" compilation. See if you can discover what the differences are.

2.16 Try to find out which of the compilers you have used are single-pass, and which are multi-pass. For the latter, find out how many passes are involved. Which produce relocatable code needing further processing by linkers or linkage editors?

2.17 Do any of the compilers in use on your system produce ASSEMBLER, C or other such code during the compilation process? Can you foresee any particular problems that users might experience in using such compilers?

2.18 One of several compilers that translates from Modula-2 to C is called mtc, and is freely available from several file transfer protocol (FTP) sites. If you are a Modula-2 programmer, obtain a copy, and experiment with it.

2.19 An excellent compiler that translates Pascal to C is called p2c, and is widely available for UNIX systems from several FTP sites. If you are a Pascal programmer, obtain a copy, and experiment with it.

2.20 Can you foresee any practical difficulties in using C as an intermediate language?

2.21 Try to find out which of the translators you have used are interpreters, rather than full compilers.

2.22 If you have access to both a native code compiler and an interpreter for a programming language known to you, attempt to measure the loss in efficiency when the interpreter is used to run a large program (possibly one that does substantial number-crunching).

Exercises 3

3.1 Draw the T-diagram representations for the development of a P-code to M-code assembler, assuming that you have a C++ compiler available on the target system.

3.2 Later in this text we shall develop an interpretive compiler for a small language called Parva, using C# or Java as the host language. Draw T-diagram representations of the various components of the system as you foresee them.

Exercises 4

4.1 Write, assemble and test programs for the PVM that will solve the following problems. It is suggested that you develop the algorithms, simple though they may be, in a high-level language first, and then translate them to stack machine code.

(a) Find the largest of three numbers.
(b) Find the largest and the smallest of a list of numbers terminated by a zero (which is not regarded as a member of the list).
(c) Find the average of a list of non-zero numbers, the list being terminated by a zero.
(d) Compute n! for small n.
(e) Read a list of numbers and then write it backwards. Assume that the list is terminated by a zero.
(f) Determine the prime numbers between 0 and 255.
(g) Determine the longest repeated sequence in a sequence of digits terminated with zero. For example, for data reading 1 2 3 3 3 3 4 5 4 4 4 4 4 4 4 6 5 5 0 report that "4 appeared 7 times".
(h) Read an input sequence of numbers terminated with zero, and extract the embedded monotonically increasing sequence. For example, from 1 2 12 7 4 14 6 23 0 extract the sequence 1 2 12 14 23.
(i) Read a small array of integers and sort them into order.

4.2 Based on your experiences with Exercise 4.1, comment on the usefulness, redundancy and any other features of the instruction set for the PVM. For example, are the AND, OR and NOT opcodes really necessary or could their effect be achieved in other ways?

4.3 How well does the informal description of the PVM instruction set allow you to develop programs and understand the interpreter for the machine?

4.4 Do you suppose interpreters might find it difficult to handle I/O errors in user programs?

4.5 How difficult would it be to hand translate programs written in this stack machine code into your favourite ASSEMBLER?

4.6 One of the advantages of an emulated machine is that it is usually very easy to extend it - provided the host language for the interpreter can support the features required. Try introducing two new instructions, say INPC and PRNC, which will read and print single character data. Use these new opcodes in programs that solve the following problems.

(a) Read a sentence terminated by a period and write it out backwards.
(b) Read a sentence terminated by a period and produce a frequency count giving the number of times each letter appeared. Treat upper- and lower-case letters as equivalent.

4.7 Notice that there is a BZE instruction, but not a complementary BNZ (one that would branch if TOS were non-zero). Do you suppose this is a serious omission? How could it be added?

4.8 Can you think of ways in which the interpreter can be improved, both as regards efficiency and user friendliness? In particular, try adding debugging aids over and above the simple stack dump already provided. Can you think of any ways in which the interpreter could be made to detect infinite loops in a user program, or to allow itself to be manually interrupted by an irate or frustrated user?

4.9 The interpreter attempts to prevent corruption of the memory by detecting when the machine registers go out of bounds. The implementation is not totally foolproof so, as a useful exercise, improve on it.

4.10 The interpreter checks for division by zero but does no other checking that arithmetic operations will stay within bounds. Improve it so that it does so, bearing in mind that one has to predict overflow, rather than wait for it to occur.

4.11 As an alternative, extend the machine so that overflow detection does not halt the program, but sets an overflow flag in the processor. Provide instructions whereby programmers can check this flag and take whatever action they deem appropriate.

4.12 Extend the interpreter to incorporate the various short-cut instructions suggested in this section. Pay attention to the checks needed to prevent errors from causing the system to collapse.

4.13 Recall that in short-circuit semantics

A AND B means IF A THEN B ELSE FALSE END
A OR B means IF A THEN TRUE ELSE B END

Does the instruction set lend itself easily to coding short-circuit semantics into complex Boolean expressions such as occur (and are even necessary) in code like

if (Y != 0 && X / Y > Z) ...

If not, suggest new instructions that could be used to ease the programmer's lot, and show how these could be implemented in the interpreter.

4.14 As a further variation on the emulated machine, develop a variation where the branch instructions are "relative" rather than "absolute". This makes for rather simpler transition to relocatable code.

4.15 If the LDL N, STL N, LDE and STE instructions are introduced, could one eliminate the LDA N, LDV, LDXA and STO instructions and still have a viable machine?

4.16 If the Bxx "test and branch" instructions are introduced, could one eliminate the Cxx "compare" instructions and still have a viable machine?

4.17 It might be deemed unsatisfactory to locate the literal pool in high memory. An alternative arrangement would be to locate it immediately above the executable code, on the lines of Figure 4.8. Develop a variation on the assembler (and, if necessary, the interpreter) to exploit this idea.

      ┌───────────┬──────────┬────────────┬───────────────┬──────────────┬────────────┐
      │           │          │            │               │              │            │
      │   Code    │  String  │── Heap ───≻│    Unused     │≺── Operand ──│   Local    │
      │           │  pool    │            │               │     stack    │  variables │
      │           │          │            │               │              │            │
      └───────────┴──────────┴────────────┴───────────────┴──────────────┴────────────┘
      0     ↑                             ↑               ↑                           ↑
            │                             │               │                           │
    Program counter                  Heap pointer   Stack pointer               Frame pointer

    Figure 4.8  Alternative memory layout in a stack-oriented computer

4.18 (Slightly harder) Develop a C# or Java implementation of the PVM that treats the memory as an array of 8-bit bytes rather than 32-bit words. This will save a great deal of space so far as instructions are concerned, but some means has to still to be found to store 32-bit integer values that represent integer operands to instructions, or variables stored in memory.

Exercises 5

5.1 How would the regular expression for even binary numbers need modification if the string 0 (zero) were allowed to be part of the language?

5.2 In some programming languages, identifiers may have embedded underscore characters. However, the first character may not be an underscore, nor may two underscores appear in succession. Write a regular expression that generates such identifiers.

5.3 Suppose we define an alphabet A = { a , b }. Write down a regular expression for each of the following languages:

(a) L(R) = { all strings with exactly one occurrence of a };
(b) L(R) = { all strings with at most one occurrence of a };
(c) L(R) = { all strings with at least one occurrence of a }.

5.4 Find a regular expression that generates strings like "facetious" and "abstemious" that contain all five vowels, in order, but appearing only once each.

5.5 Can you find regular expressions that describe the form of 'floating-point' literal constants in Pascal? In C# and Java? In Modula-2?

5.6 Describe the stack assembler language for the PVM of Chapter 4 by means of regular expressions.

5.7 (Louden 1997) A grammar G = { N , T , S , P } is defined to have

N = { A }
T = { ( , ) }
S = A
P = { A → ( A ) A , A → ε }

Give some examples, and describe the general form, of the strings in L(G). What differences would there be if the productions were changed to

P = { A → A ( A ) , A → ε }

or to

P = { A → A ( A ) , A → a , A → ε }

5.8 A grammar G = { N , T , S , P } is defined to have

N = { A , B }
T = { a , b }
S = A
P = { A → Aa, A → a, A → BaB, B → bb }

Give some examples, and describe the general form, of the strings in L(G).

5.9 A grammar G = { N , T , S , P } is defined to have

N = { A , B }
T = { a , b , c }
S = A
P = { A → BAB, A → c, B → a, B → b, B → ab }

Give some examples, and describe the general form, of the strings in L(G).

5.10 Which of the languages in 5.8 and 5.9 are (i) left recursive (ii) right recursive (iii) self-embedding in terms of the definitions in this section?

5.11 What would be the shortest sentence in the language defined by our first example? What would be the longest sentence? Would there be a difference if we used the alternative productions (3a, 3b)?

5.12 Draw the phrase structure trees that correspond to the expressions a - b - c and a * b * c using the second grammar.

5.13 Extend the grammar for expressions so as to incorporate the + and / operators.

5.14 Modify the first of the Sample grammars above to allow an ident token to include digits and underscores, subject to the restriction that an ident must start with a letter, and that two underscores may not appear in succession.

5.15 The following represents an attempt to write a Cocol grammar describing the form of PVM assembler programs like those illustrated in Chapter 4.

  COMPILER Parva

  CHARACTERS
    letter     = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
                 + "abcdefghijklmnopqrstuvwxyz" .
    digit      = "0123456789" .
    printable  = CHR(32) .. CHR(127) .
    lf         = CHR(10) .
    stringCh   = ANY - '"' .

  TOKENS
    identifier = letter { letter | digit } .
    label      = letter { letter | digit } ":" .
    comment    = ";" { printable } .
    number     = digit { digit } .
    string     = '"' { stringCh } '"' .
    EOL        = lf .

  IGNORE CHR(9) + CHR(12) + CHR(13)

  PRODUCTIONS
    Parva      = { Statement } .
    Statement  = [ Label ] [ Mnemonic [ Address ]] [ comment ] EOL .
    Label      = label .
    Mnemonic   =   "ADD"  | "AND"  | "ANEW" | "BRN"  | "BZE"
                 | "CEQ"  | "CGE"  | "CGT"  | "CLE"  | "CLT"
                 | "CNE"  | "DIV"  | "DSP"  | "HALT" | "INPB"
                 | "INPI" | "LDA"  | "LDC"  | "LDV"  | "LDXA"
                 | "MUL"  | "NEG"  | "NOP"  | "NOT"  | "OR"
                 | "PRNB" | "PRNI" | "PRNL" | "PRNS" | "REM"
                 | "STO"  | "SUB" .
    Address    = [ "-" ] number | identifier | string .

  END Parva.

(a) What would be the effect of replacing the production for Statement by

Statement = [ Label ] [ Mnemonic [ Address [ comment] ] ] .

(b) What would be the effect of replacing the production for Mnemonic by

Mnemonic = identifier .

Label = identifier ":" .

(d) Develop a grammar that uses the COMMENT facility in Cocol rather than defining a comment as a token.

(e) The above grammar could be criticized for being too permissive - it will accept statements that are clearly syntactically incorrect. How could it be improved?

5.16 (ICPC 1994) In the land of Phool the official language is Phoolish. A Phoolish professor noticed that many of his students had not mastered the syntax of Phoolish well. Tired of correcting their many syntactical mistakes, he decided to challenge his students and asked them to write a program that could check the syntactical correctness of any sentence they wrote. As befits the simple nature of Phools, the syntax of Phoolish is also pleasantly simple. Here are the rules:

the characters in the language are the 26 letters of the standard Western alphabet and the language distinguishes between lower-case and upper-case letters;

every lower-case letter is a simple correct sentence;

if s is a correct sentence, then so is a sentence consisting of an upper-case vowel followed by s - that is As Es Is Os or Us;

if s and t are correct sentences, then so is a sentence consisting of an upper-case consonant followed by st - for example Bst Cst Dst ... Zst.

Describe this language in Cocol.

5.17 Develop simple Cocol grammars to describe each of the following.

(a) A person's name, with optional title and qualifications (if any), for example

        S.B. Terry , BSc
        Mr Kenneth David Terry
        Helen Margaret Alice Terry

(b) A railway goods train, with one (or more) locomotives, several varieties of trucks, and a guard's van at the rear.
(c) A mixed passenger and goods train, with one (or more) locomotives, then one or more goods trucks, followed either by a guard's van or by one or more passenger coaches, the last of which should be a passenger brake van. In the interests of safety, try to build in a regulation to the effect that fuel trucks may not be marshalled immediately behind the locomotive or immediately in front of a passenger coach.
(d) A book with covers, contents, chapters and an index.
(e) A shopping list with one or more items, for example

        3 laboratory assignments
        124 bottles Castle Lager
        12 cases Rhine Wine
        large box aspirins

(f) Input to a postfix (reverse Polish) calculator. In postfix notation, brackets are not used but instead the operators are placed after the operands.
For example,
             infix expression             reverse Polish equivalent 

             6 + 9 =                      6 9 + =
             (a + b) * (c + d)            a b + c d + *
(g) A message in Morse code.
(h) UNIX or MS-DOS file specifiers.
(i) Numbers expressed in Roman numerals.
(j) Boolean expressions incorporating disjunction (OR), conjunction (AND) and negation (NOT).

5.18 Develop a Cocol grammar that defines the rules for expressing a set of EBNF productions, but which does not use the metabrackets { , } , [ and ] in its own production rules.

5.19 Develop a Cocol grammar that defines the rules for expressing a set of BNF productions, and which does not use the metabrackets { , } , [ and ] in its own production rules.

5.20 Develop an EBNF grammar that defines regular expressions as described in section 5.3.

5.21 What practical advantage (if any) does the Wirth notation using [ ] and { } afford over the use of the Kleene closure symbols?

5.22 In another variation on EBNF ε can be written into an empty right-hand side of a production explicitly, in addition to being handled by using the [ ] notation - for example:

Sign = "+" | "-" | ε .

Develop a Cocol grammar that caters for this. Can this be done so as to forbid redundant and unnecessary ε symbols from creeping into productions - as they have done in

Sign = "+" | "-" | ε | ε .

5.23 In yet another variation on EBNF ε can be incorporated in productions implicitly, as exemplified by

Sign = "+" | "-" | . (* the ε or null is between the last | and . *)

Productions like this cannot be described by the productions for EBNF given in section 5.9.1. Develop a Cocol grammar that describes EBNF productions that do allow an empty string to appear implicitly.

5.24 Members of the local Senior Citizens Association make a feature of Friday evenings, when they employ a mediocre group to play for dancing. At such functions the band perform a number of selections and these are interspersed with periods of silence which are put to other good use. The band has only four kinds of selection at present. The first of these consists of waltzes - such a selection always starts with a slow waltz, which may be followed by several more slow waltzes, and finally (but only if the mood of the evening demands it) by one or more fast waltzes. The second type of selection consists of several rock'n'roll numbers. The third is a medley consisting of a number of tunes of any sort played in any order. The last is the infamous "Paul Jones", which is a special medley in which every second tune is "Here we go round the mulberry bush". During the playing of this, the dancers all pretend to change partners, in some cases actually succeeding in doing so. Develop a grammar which describes the form that the evening assumes.

5.25 Scottish pipe bands often compete at events called Highland Gatherings where three forms of competition are traditionally mounted. There is the so-called "Slow into Quick March" competition, in which each band plays a single slow march followed by a single quick march. There is the so-called "March, Strathspey and Reel" competition, where each band plays a single quick march, followed by a single strathspey, and then by a single reel - this set may optionally be followed by a further quick march. And there is also the "Medley" in which a band plays a selection of tunes in almost any order. Each tune fall into one of the categories of march, strathspey, reel, slow march, jig and hornpipe but, by tradition, a group of one or more strathspeys within such a medley is always followed by a group of one or more reels.

Develop a grammar to describe the activity at a Highland Gathering at which a number of competitions are held, in each of which at least one band performs. Competitions are held in one category at a time. Regard concepts like "March", "Reel" and so on as terminals - in fact there are many different possible tunes of each sort, but you may have to be a piper to distinguish one tune from another.

5.26 Here is an extract from the index of my forthcoming bestseller "Hacking out a Degree".

abstract class 12, 45
abstraction, data 165
advantages of C# and Modula-2 1-99, 100-500, Appendix 4
aegrotat examinations -- see unethical doctors
class attendance, intolerable 745
deadlines, compiler course -- see sunrise
horrible design (C and C++) 34, 45, 85-96
lectures, missed 1, 3, 5-9, 12, 14-17, 21-25, 28
recursion -- see recursion
senility, onset of 21-24, 105
subminimum 30
supplementary exams 45-49
wasted years 2004-2006

Develop a Cocol grammar that describes this form of index.

5.27 You may be familiar with the "make" facility that is found on UNIX (and sometimes on MS-DOS) for program development. A "make file" consists of input to the make command that typically allows a system to be re-built correctly after possible modification of some of its component parts. A typical example for a system involving C++ compilation is shown below. Develop a Cocol grammar that describes the sort of make files that you may have used in your own program development.

            # makefile for maintaining my compiler
            CFLAGS = -Wall
            CC     = g++
            HDRS   = parser.h scanner.h generator.h
            SRCS   = compiler.cpp \
                     parser.cpp scanner.cpp generator.cpp
            OBJS   = compiler.o parser.o scanner.o generator.o

            %.o: %.cpp $(HDRS)
                    $(CC) -c $(CFLAGS) $<

            all:    compiler

            new:    clean compiler

            cln:    $(OBJS)
                    $(CC) -o cln $(CFLAGS) $(OBJS)

            clean:
                    rm *.o
                    rm compiler

5.28 C programmers should be familiar with the use of the standard functions scanf and printf for performing input and output. Typical calls to these functions are

scanf("%d %s %c", &n, string, &ch);printf("Total = %-10.4d\nProfit = %d\%%\n", total, profit);

in which the first argument is usually a literal string incorporating various specialized format specifiers describing how the remaining arguments are to be processed.

Develop a Cocol grammar that describes such statements as fully as you can. For simplicity restrict yourself to the situation where any arguments after the first refer to simple variables.

5.29 Attempt to express some of the solutions to previous exercises in terms of syntax diagrams.

Exercises 6

6.1 Draw syntax diagrams which reflect the different approaches taken to factorizing these grammars.

6.2 Another attempt at writing a grammar for expressions using the EBNF metasymbols is

(G6) Expression = { Term "-" } Term .       (1)
         Term = { Factor "*" } Factor .       (2)
         Factor = [ Factor "^" ] Primary .       (3)
         Primary = "a" | "b" | "c" .       (4, 5, 6)

Is this equivalent to grammar G5?

6.3 Develop sets of productions for algebraic expressions that will incorporate the operations of addition and division as well as subtraction and multiplication. Analyze your attempts in some detail, paying heed to the issues of associativity and precedence.

6.4 Suppose that we wish to allow our expressions to incorporate unary negation, as exemplified by - a * b - c. Either of the following variations on grammar G5 appears to offer what we require.

(G7) Expression = [ "-" ] Term { "-" Term } .       (1)
         Term = Factor { "*" Factor } .       (2)
         Factor = Primary [ "^" Factor ] .       (3)
         Primary = "a" | "b" | "c" .       (4, 5, 6)

(G8) Expression = Term { "-" Term } .       (1)
         Term = Factor { "*" Factor } .       (2)
         Factor = Primary [ "^" Factor ] .       (3)
         Primary = "a" | "b" | "c" | "-" Primary .     (4, 5, 6, 7)

Are these grammars really equivalent? If not, what are the differences between them?

6.5 Develop sets of productions which describe expressions exemplified by

- a + sin(b + c) * ( + ( b - a) )

that is to say, fairly general mathematical expressions, with parentheses, leading unary signs, the usual operations of addition, subtraction, division and multiplication, and simple function calls. Ensure that the productions correspond to the conventional precedence and associativity rules for arithmetic expressions.

6.6 We have noted that the exponentiation operator is normally regarded as right associative, in contrast to operators like multiplication and subtraction which are normally regarded as left associative. Have you come across other operators that are right associative?

6.7 Convince yourself that the last set of productions for IF ... THEN ... ELSE statements is unambiguous.

6.8 Our discussion of the "dangling else" has employed Pascal syntax, where the controlling expression is not parenthesized as is required in languages derived from C, and where the word THEN is explicitly present, whereas it is omitted in C derivatives. Do these apparently trivial changes affect the discussion? Why do you suppose Pascal insists on THEN and derivatives of C insist on the parentheses? Could their designers not have opted for simpler syntax, as exemplified by

IF i = j a := b ELSE a := c

6.9 (Aho, Sethi, Ullman 1986) Other possible productions for the IF ... THEN ... ELSE statement are

Statement = "IF" Condition "THEN" Statement
| Matched .
Matched = "IF" Condition "THEN" Matched "ELSE" Statement
| Assignment .

Do these productions solve the dangling else problem?

6.10 Long ago, the Romans represented the numbers 1 through 10 by the strings

I II III IV V VI VII VIII IX X

The following Cocol grammar attempts to describe a sequence of such numbers, separated by commas and terminated by a period.

        COMPILER Roman

        PRODUCTIONS
          Roman  = Number { "," Number }  "."  EOF .
          Number = StartI | StartV | StartX .
          StartI = "I" ( "V" | "X" | [ "I" [ "I" ] ] ) .
          StartV = "V" [ "I" ] [ "I" ] [ "I" ] .
          StartX = "X" .
        END Roman.

Why is it ambiguous? Develop an unambiguous equivalent grammar.

6.11 (Louden 1997) Why is the simple grammar

A = A A | "(" A ")" | ε .

ambiguous? Determine the form of the language it describes and attempt to find an unambiguous equivalent grammar.

6.12 Another attempt to describe expressions, very similar to G5, turns out to be ambiguous. Why is this?

(G10) Expression = Term { "-" Term } . (1)
            Term = Factor { "*" Factor } . (2)
            Factor = Primary [ "^" Expression ] . (3)
            Primary = "a" | "b" | "c" . (4, 5, 6)

6.13 Why are cyclic grammars ambiguous? (See section 6.3.3.)

6.14 Can you find a regular grammar that describes a language only a little more general that the one above

L(G) = { aⁿ b a^m | n, m ≥ 0 }

6.15 Can you find a regular grammar that describes the language

L(G) = { aⁿ b c^m | n, m ≥ 0 }

6.16 Would your grammars be simpler or easier if you were allowed to use ε-productions?

6.17 Find a context-free grammar that describes Modula-2 comments (unlike Pascal, Java, C# and C++, these may be nested).

6.18 Develop a context-free grammar that generates all palindromes constructed of the letters a and b (palindromes are strings that read the same from either end, like ababbaba).

6.19 Derive (or show how to parse) the strings

abc and aaabbbccc

using grammar G12.

6.20 Show informally that the strings

abbc , aabc and abcc

cannot be derived using grammar G12.

6.21 Derive a context-sensitive grammar for strings of 0s and 1s so that the number of 0s and 1s is the same.

6.22 Derive (or show how to parse) the strings

caccac caaccaac caabccaabc

using grammar G13.

Exercises 7

7.1 Test the following grammar for being LL(1)

G = { N , T , S , P }
N = { A , B }
T = { a , b , c , d }
S = A
P =
A → B ( b | d ) | ( a | d ) B
B → b B d | { c }

7.2 Show that the grammar describing EBNF itself (section 5.9.1) is LL(1).

7.3 The grammar for EBNF as presented in section 5.9.1 does not allow an implicit ε to appear in a production, although Exercise 5.25 implied that this was often useful in practice. What change could you make to the grammar to allow an implicit ε? Is your resulting grammar still LL(1)? If not, can you find a formulation that is LL(1)?

7.4 A simplified grammar for constant declarations in Pascal is described by the productions

ConstDeclarations = "CONST" OneConst { OneConst } .
OneConst = identifier "=" number ";" .

Is this part of the grammar LL(1)? What would be the effect if one were to factorize the grammar

ConstDeclarations = "CONST" OneConst { ";" OneConst } ";" .
OneConst = identifier "=" number .

7.5 Consider the following grammar expressed in EBNF for describing the progress of a typical university course:

Course = Introduction Section { Section } Conclusion .
Introduction = "lecture" [ "handout" ] .
Section = { "lecture" | "practical" "test" | "tutorial" | "handout" } "test" .
Conclusion = [ "panic" ] "examination" .

Is this an LL(1) grammar? If not, where does it break the rules?

7.6 As a more interesting example of applying an analysis to a grammar expressed in EBNF, let us consider how we might describe the theatrical production of a Shakespearian play with five acts. In each act there may be several scenes, and in each scene appear one or more actors who gesticulate and make speeches to one another (for the benefit of the audience, of course). Actors come onto the stage at the start of each scene and come and go as the scene proceeds - to all intents and purposes between speeches - finally leaving at the end of the scene (in the Tragedies some may leave dead, but even these usually revive themselves in time to go home). Plays are usually staged with an interval between the third and fourth acts.

Actions like "speech", "entry" and "exit" are really in the category of the lexical terminals which a scanner (in the person of a member of the audience) would recognize as key symbols while watching a play. So one description of such a staged play might be on the lines of

Play = Act Act Act "interval" Act Act .
Act = Scene { Scene } .
Scene = { "speech" } "entry" { Action } .
Action = "speech" | "entry" | "exit" | "death" | "gesticulation" .

This does not require all the actors to leave at the end of any scene (sometimes this does not happen in real life either). We could try to get this effect by writing

Scene = { "speech" } "entry" { Action } { "exit" } .

but note that this context-free grammar cannot force as many actors to leave as entered - in computer language terms the reader should recognize this as the same problem as being unable to specify that the number of formal and actual parameters to a procedure agree.

Analyze this grammar in detail. If it proves out to be non-LL(1), try to find an equivalent that is LL(1), or argue why this should be impossible.

7.7 Suppose, as many authors do, we had decided to define FIRST(ε) = { ε } rather than as ∅. How would we have to express the closure rules for computing FIRST sets in section 7.2.4, the rules for computing FOLLOW sets in section 7.2.8 and the two rules for conforming to an LL(1) grammar?

7.8 The following grammar attempts to describe expressions incorporating subtraction, multiplication and exponentiation with the correct precedence and associativity of these operations. Is it an LL(1) grammar? If not, why not, and can you find a suitable grammar that is LL(1)?

Expression = Term { "-" Term } .
Term = Factor { "*" Factor } .
Factor = Primary [ "^" Expression ] .
Primary = "a" | "b" | "c" .

7.9 The following simple grammar describes a railway train with a locomotive at one end, a guard's van at the other - and a sequence of various types of trucks in between.

Train = "Locomotive" { Truck } "GuardsVan" .
Truck = "OpenTruck" | "ClosedVan" | "PetrolTruck" .

The railway company have decided, in the interests of safety, that in future no train may have a petrol truck immediately behind the locomotive or immediately in front of the guard's van. Can you find a grammar that describes such trains? Is your grammar LL(1)? If not, can you find an equivalent grammar that is LL(1)? If not, why not?

7.10 Palindromes are character strings that read the same from either end. The following represent various attempts at finding grammars to describe palindromes made only of the letters a and b.

Palindrome = "a" Palindrome "a" | "b" Palindrome "b" .
Palindrome = "a" Palindrome "a" | "b" Palindrome "b" | "a" | "b" .
Palindrome = "a" [ Palindrome ] "a" | "b" [ Palindrome ] "b" .
Palindrome = [ "a" Palindrome "a" | "b" Palindrome "b" | "a" | "b" ] .

Which grammars achieve their aim? Which of them are LL(1)? Can you find other (perhaps better) grammars that describe palindromes and which are LL(1)?

7.11 Which of the following statements are true?

(a) An LL(1) grammar cannot be ambiguous.
(b) A non-LL(1) grammar must be ambiguous.
(c) An ambiguous language cannot be described by an LL(1) grammar.
(d) It is possible to find an LL(1) grammar to describe any non-ambiguous language.

7.12 An equivalent syntactic description of Parva that uses only BNF style productions for Parva can be found in the Resource Kit. Does this description use right or left recursion? Write an equivalent grammar which uses the opposite form of recursion.

7.13 Develop a set of syntax diagrams for Parva (see section 5.10).

7.14 We have made no attempt to describe the semantics of programs written in Parva, but to a reader familiar with similar languages they should be self-evident. Write simple programs in the language to do the following tasks.

(a) Find the sum of the numbers between two input data, which can be supplied in either order.

(b) Use Euclid's algorithm to find the HCF of two integers.

(c) Read a list of year dates and print those that correspond to leap years.

(d) Read a sequence of numbers and write out the embedded monotonic increasing sequence.

(e) Use a "sieve" algorithm to determine all the prime numbers less than 255.

In the light of your experience in preparing these solutions, and from the intuition which you have from your background in other languages, can you foresee any gross deficiencies in Parva as a language for handling problems in integer arithmetic (apart from its lack of procedural facilities, which we shall deal with in Chapter 14)?

7.15 Suppose someone came to you with the following draft program, seeking answers to the questions currently found in the comments next to some statements. How many of these can you answer by referring only to the syntactic description given earlier? How many more can you answer by referring to the description of Parva found in the Resource Kit? (The program is not supposed to do anything useful.)

  void main() {
    const
      max = 100000000000000, // Can I have a constant this big?
      min = -89,             // Can I declare a negative constant?
      heading = "Title";     // Can I declare a constant string?
    int
      i = -6,                // Can I initialize as I declare it?
      main;                  // Can I use 'main' as a variable name?
    bool b = i > 4 && i < 7; // Is this a valid assignment?
    write(heading);          // Can I write a string constant?
    int[] l1 = new int[10],  // Can I allocate space as I declare l1?
          l2, l3;            // Are l2 and l3 also allocated space?
    l1[1] = true;            // Is this a valid assignment?
    l1[1] = null;            // Is this a valid assignment?
    l2 = null;               // Is this a valid assignment?
    l1 = new int[-100];      // Can arrays have negative subscripts?
    l1[-1] = 345;            // If so, is this a valid assignment?
    ;; i = j ;;              // Do extra semicolons cause trouble?
    l2 = new int[100];
    l3 = new int[i];         // Can I set the length with a variable?
    if (l2 == l3)            // Can I compare l2 and l3 like this?
      write("equal");
    if (l1[3])               // Can I use l1[3] as a condition?
      write("true", true);   // What does this do?
    i = -i;                  // Is this a valid assignment?
    i = + - i * -5;          // Is this sequence of operators valid?
    if (true) ;              // Is this a valid statement?
    if (i > 4) {;}           // Is this a valid statement?
    {}                       // Can I just have {} as a statement?
    if (i > 4) int k;        // Can I have a declaration like this?
  }

7.16 As a more challenging exercise, consider a variation on Parva, one that resembles Modula-2 rather more closely than it does C#. Using the translation below of the sample program given earlier as a guide, derive a grammar that you think describes this language, which we shall call Mikra (after the Greek feminine adjective for "small").

  MODULE Main;
    CONST
      votingAge = 18;
    VAR
      age, eligible, total : INTEGER;
      allEligible, canVote : BOOLEAN;
      voters : ARRAY OF INTEGER;
    BEGIN
      voters := NEW ARRAY [100] OF INTEGER;
      age := 0;
      total := 0;
      allEligible := TRUE;
      READ(age);
      WHILE age > 0 DO
        canVote := age > votingAge;
        allEligible := allEligible AND canVote;
        IF canVote THEN
          voters[eligible] := age;
          eligible := eligible + 1;
          total := total + voters[eligible - 1]
        END;
        READ(age)
      END;
      WRITE(eligible, " voters.  Average age ", total / eligible);
      IF allEligible THEN
        WRITE("Everyone was above voting age")
      END
    END Main.

7.17 In the light of your experience with Exercises 7.15 and 7.16, discuss the ease of "reverse-engineering" a programming language description by consulting only a few example programs. Why do you suppose so many students attempt to learn programming by imitation?

7.18 Modify the Parva language definition to incorporate C#-like forms of:

(a) the do ... while statement;
(b) the if ... else statement;
(c) the switch statement;
(d) the for loop;
(e) the % operator.

7.19 Repeat the last exercise for the Mikra language suggested by Exercise 7.16, using syntax that resembles that found in Pascal or Modula-2.

7.20 Has Parva been described by an LL(1) grammar? If not, what must be changed to achieve this?

7.21 Is your description of Mikra LL(1)? If not, is it possible to find an equivalent LL(1) grammar?

7.22 In Modula-2 and Mikra, structured statements are each terminated with their own END. How would you have to change the Parva language definition to use something similar for the existing statements, and for the extensions suggested in Exercise 7.18? What advantages, if any, do these forms have over those found in C#?

7.23 Study how the specification of string tokens has been achieved in Cocol. Some languages, like Modula- 2, allow strings to be delimited by either single or double quotes, but not to contain the delimiter as a member of the string (so that we might write "David's Helen's brother" or 'He said "Hello"', but not 'He said "That's rubbish!"'). How would you specify string tokens if these had to match those found in Modula-2, or those in Pascal (where various quote characters within strings are represented by two quotes in succession, as exemplified by 'That''s a clever idea'?

7.24 Whether to regard the semicolon as a statement separator or as a terminator has been a matter of some controversy. Do we need semicolons at all in languages like Parva and Mikra? Try to write productions for versions of these language where they are simply omitted, and check whether the grammars you produce satisfy the LL(1) conditions. If they do not, try to modify them until they do.

7.25 A close look at the syntax of Pascal, Modula-2, C#, Java, Parva or Mikra should suggest that all of them allow for effectively empty statement sequences. Can you think of any reasons at all why one should not simply forbid such empty sequences?

7.26 In a language like Modula-2 or Pascal there are two classes of statements that start with identifiers, namely assignment statements and procedure calls. Is it possible to find a grammar that allows this potential LL(1) conflict to be resolved? Does the problem arise in C# or Java and can it be resolved there too?

7.27 The grammar for expressions in Parva employs very few levels of operator precedence, corresponding exactly to the levels found in Pascal and its derivatives. Develop a grammar that uses precedences equivalent to those found in C# or Java. Why do you suppose C derivatives have so many levels of precedence and the rules they have for associativity? What do all these levels offer to a programmer that Pascal derivatives might appear to withhold? Do these derivatives really withhold these features?

7.28 Do you suppose there may be any correlation between the difficulty of writing a grammar for a language (which programmers do not usually try to do) and learning to write programs in that language (which programmers often do)?

Exercises 8

8.1 In section 6.4 an unambiguous set of productions was given for the IF ... THEN ... ELSE statement. Is the corresponding grammar LL(1)? Whatever the outcome, can you construct a recursive descent parser to handle such a formulation of the grammar?

8.2 Add Boolean negation (represented by the token "!") to the grammar and parser.

8.3 Does our grammar for expressions involving relational operators (Relop) give these the same precedence relative to && and || as in C# and Java? If not, develop a grammar and parser that match Boolean expressions in those languages more closely.

8.4 Modify the parsing routines given here to make use of as many statically constructed sets as possible. Can all the sets be statically constructed?

8.5 Experiment with the parser by deliberately giving it incorrect input. Do the error messages make sense? How good is the error recovery? Can you improve it further?

8.6 Provide error recovery for the parsers suggested in earlier exercises that incorporate relational and negation operators.

8.7 Our scanner algorithms have all had the property that they consume at least one character. Suppose that the initial character could not form part of a token (that is, did not belong to the vocabulary of the language). Might it be better not to consume it?

8.8 Suppose our scanner was also required to recognize quoted strings, subject to the common restriction that these should not be allowed to carry across line breaks in the source. How could this be handled? Consider the extensions that would be needed to the ad hoc scanner and also to the FSA scanner.

8.9 Develop recursive descent parsing routines for the language used in this case study.

8.10 Modify the programs supplied in the Resource Kit to incorporate the suggestion just made that efficiency can be improved by first reading the entire source text into a large buffer. Can you find a way to measure the suspected improvement in performance?

8.11 Modify the FSA scanner developed in the previous section to recognize keyword alternatives "and" (for &&) and "or" (for ||) by introducing extra states as suggested in Figure 8.3.

8.12 Develop an ad hoc scanner for Parva, the language described in section 7.4. Use the techniques suggested here for searching a table of keywords in various ways and comment on the relative efficiency of each one.

8.13 Develop a robust technique for evaluating numeric literals as scanning proceeds. An alternative technique is to pass the unconverted lexeme on to the parser and let this assume the responsibility for conversion and possible error detection. Which technique do you suppose is preferable?

8.14 Modify the ad hoc scanner developed in section 8.5 to discard Pascal-style comments. Go on to develop a scanner that can discard Modula-2 style comments.

8.15 In C# and Java there are two forms of comments.

// Comments can extend from // to the end of line
/* Comments can extend from /* to a closing */

Develop a scanner that can recognize and deal with either form, assuming also that / forms part of the token alphabet.

8.16 The following Cocol description is of a series of regular expressions of the sort discussed in section 5.3, incorporating various metasymbols.

      COMPILER RE
      /* Regular expression grammar */

      CHARACTERS
        control  = CHR(1) .. CHR(31) .
        noquote1 = ANY - control - "'" - CHR(0) .
        noquote2 = ANY - control - '"' - CHR(0) .
        meta     = "()*|.;[]-?+" .
        simple   = ANY - control - "'" - '"' - meta - CHR(0) .

      TOKENS
        atomic  = simple .
        escaped = "'" noquote1 "'" | '"' noquote2 '"' .

      COMMENTS FROM "\" TO "\"

      IGNORE  control

      PRODUCTIONS
        RE         = { Expression ";" } EOF .
        Expression = Term { "|" Term } .
        Term       = Factor { [ "." ] Factor } .
        Factor     = Element [ "*" | "?" | "+" ] .
        Element    = Atom | Range | "(" Expression ")" .
        Range      = "[" OneRange { OneRange } "]" .
        OneRange   = Atom [ "-" Atom ] .
        Atom       = atomic | escaped .
      END RE.

Here ; is being used as a metasymbol to separate regular expressions in these examples - it is not normally special in regular expression theory. Develop a parser and scanner from this specification and incorporate them in a program that can recognize input like

a* s d | g | ( i o o ) ; \ a very simple one \
a.s.d? \ d is optional \ | 2 | [ a-s BC A-S 0-9]* (e | f+) ;

8.17 In the Resource Kit will be found implementations of Coco/R for Java and C#. Submit the sample grammars given earlier to the version of your choice, and compare the code generated with that produced by hand in earlier sections.

8.18 Exercises 5.18 through 5.28 required you to produce Cocol descriptions of a number of grammars. Submit these to Coco/R and explore its capabilities for testing grammars, listing FIRST and FOLLOW sets and constructing scanners and parsers.

Exercises 9

9.1 Extend the grammar and the parsers so as to handle an expression language in which one may have an optional leading + or - sign (as exemplified by + a * ( - b + c ) ).

9.2 In section 5.9.1 we gave a set of productions that described Wirth's EBNF notation. Other forms of EBNF make use of an explicit ε and the Kleene Closure operator. Thus, whereas in the Wirth notation we might write

Family = "father" "mother" { "son" | "daughter" } [ "dog" | "cat" ] .

in the other notation we would write

Family = "father" "mother" ( "son" | "daughter" )* ( "dog" | "cat" | ε ) .

Using the grammar in section 5.9.1 as the basis, develop a recursive descent system that will read a set of productions expressed in the Wirth convention and rewrite them in the other convention.

9.3 Develop an attribute grammar and corresponding parser to handle the evaluation of an expression where there may be an optional leading + or - sign (as exemplified by + 9 * ( - 6 + 5 ) ).

Exercises 10

10.1 Study the code produced by Coco/R from the grammar used in this case study. How closely does it correspond to what you might have written by hand?

10.2 Experiment with the grammar suggested in the case study. What happens if the CONTEXT clause is omitted in the scanner specification? What happens if the placement of the WEAK and SYNC keywords is changed?

10.3 Extend the system in various ways. For example, direct output to a file other than standard output, or arrange that ranges can be correctly interpreted in either order.

Exercises 11

11.1 Perusal of the original grammar (and of the equivalent LL(1) version) will suggest that the following declarations would be allowed. Some of them are, in fact, illegal in C.

            int f()[100];  // Functions cannot return arrays
            int g()();     // Functions cannot return functions
            int x[100]();  // We cannot declare arrays of functions
            int p[12][20]; // We are allowed arrays of arrays
            int q[][100];  // We can declare arrays like this
            int r[100][];  // We cannot declare arrays like this

Can you write a Cocol specification for a parser that accepts only the valid combinations of suffixes? If not, why not?

11.2 Extend the grammar to cater for the declaration of more than one item based on the same type, as exemplified by

            int f[100], *x, (*g)[100];

11.3 Extend the grammar and the parser to allow function prototypes to describe parameter lists, and to allow variable declarators to have initializers, as exemplified by

            int x = 10, y[3] = { 4, 5, 6 };
            int z[2][2] = {{ 4, 5 }, { 6, 7 }};
            double f(int x, char &y, double *z);

11.4 Develop a system that will do the reverse operation - read in a description of a declaration (such as might be output from the program we have just discussed) and construct the C code that corresponds to this.

11.5 The parser above allows only single character variable names. Extend it to allow variable names that consist of an initial letter, followed by any sequence of digits and letters.

11.6 Suppose that we wished to be able to generate code for expressions that permit leading signs, as for example + x * ( - y + z). Extend the grammar to describe such expressions, and then develop a program that will generate appropriate code. Do this in two ways: firstly, assume that there is no special machine instruction for negating a register; secondly, assume that such an operation is available (NEG Rx).

11.7 Suppose the machine also provided logical operations:

                AND Rx,Ry        ;  Rx := Rx AND Ry
                OR  Rx,Ry        ;  Rx := Rx OR Ry
                XOR Rx,Ry        ;  Rx := Rx XOR Ry
                NOT Rx           ;  Rx := NOT Rx

Extend the grammar to allow expressions to incorporate infix and prefix logical operations, in addition to arithmetic operations, and develop a program to translate them into simple machine code. This will require some decision as to the relative precedence of all the operations. NOT always takes precedence over AND, which in turn takes precedence over OR. In Pascal and Modula-2, NOT, AND and OR are deemed to have precedence equal to unary negation, multiplication and addition respectively. However, in languages derived from C, NOT has precedence equal to unary negation, while AND and OR have lower precedence than the arithmetic operators - the 16 levels of precedence in C, like the syntax of declarations, are another example of baroque language design that causes some difficulty to beginners. Choose whatever relative precedence scheme you prefer or, better still, attempt the exercise both ways.

11.8 (Harder). Try to incorporate short-circuit Boolean semantics into the language suggested by Exercise 11.7, and then use Coco/R to write a translator for it. The reader will recall that these semantics demand that

A AND B is defined to mean IF A THEN B ELSE FALSE
A OR B is defined to mean IF A THEN TRUE ELSE B

that is to say, in evaluating the AND operation there is no need to evaluate the second operand if the first one is found to be FALSE, and in evaluating the OR operation there is no need to evaluate the second operand if the first is found to be TRUE. You may need to extend the instruction set of the machine to provide conditional and other branch instructions - feel free to do so!

11.9 It is unrealistic to assume that one can simply allocate registers numbered from 1 upwards. More usually a compiler has to select registers from a set of those known to be free at the time the expression evaluation commences, and to arrange to release the registers once they are no longer needed for storing intermediate values. Modify the grammar (and hence the program) to incorporate this strategy. Choose a suitable data structure to keep track of the set of available registers - in C# or Java you could make use of the class for set handling discussed briefly in section 8.3.

11.10 It is also unreasonable to assume that the set of available registers is inexhaustible. What sort of expression requires a large set of registers before it can be evaluated? How big a set do you suppose is reasonable? What sort of strategy do you suppose has to be adopted if a compiler finds that the set of available registers becomes exhausted?

11.11 The constant folding demonstrated here is dangerous in that it has assumed that arithmetic overflow will never occur. Try to improve it.

11.12 We have chosen to implement a single BinNode class, rather than using this as a base class from which are derived AddNode, SubNode, MulNode and DivNode classes. Was this a sensible decision?

11.13 The tree handler is readily extended to perform other simple optimizations. For example, binary expressions like x * 1, 1 * x, x + 0, x * 0 are quite easily detected, and the otherwise redundant operations can be eliminated. Try to incorporate some of these optimizations into the routines given earlier. Is it better to apply them while the tree is under construction or when it is later walked?

11.14 Rework Exercises 11.6 - 11.10 to use ASTs for intermediate representations of source expressions.

11.15 Improved two-address code of the form discussed in this section can be generated without the construction of an explicit AST. A technique described by Wirth (1996) arranges for each parser routine to return an Item descriptor of the value represented by that routine. A suitable class for achieving this is defined by

  class Items {
    const int // kinds of Items
      constant = 0, variable = 1, register = 2;

    int
      kind,   // kinds
      value,  // if kind = constant
      reg;    // if kind = register
    char
      name;   // if kind = variable
  }

From the descriptor pairs (constant, value), (variable, name) or (register, reg) it turns out to be easy to generate high quality code. An implementation of this technique is to be found in the Resource Kit. Study the code, and experiment to discover whether the two-address code it generates differs in any respects from that generated from an explicit AST.

11.16 The ArrayList is useful for prototyping, but in serious applications the use of this class and of simple linear searching algorithms might become expensively inefficient. Explore the use of alternative data structures, either of your own design or making use of some of those to be found on the .NET or Java platforms.

11.17 While the conditions indicated above are necessary for a grammar to be reduced (in the sense described in section 6.3.1), they are not sufficient. Consider, for example, the productions

One = "a" .
Two = "b" Two | "c".

A stronger test might be that each non-terminal (with the exception of the goal) should occur on the right-hand side of at least one production other than the one in which it appears on the left-hand side. Extend the code above to include this stronger test. Is this a sufficient test for the productions to form a reduced set, or can you find a simple set of productions that satisfies the test but is clearly not a reduced set?

11.18 The PRNS opcode in the above grammar can handle only very simple strings. As an interesting exercise, extend the grammar to allow for strings defined by

       CHARACTERS
         control    = CHR(0) .. CHR(31) .
         backSlash  = CHR(92) .
         digit      = "0123456789" .
         printable  = ANY - control .
         stringCh   = printable - '"' - backSlash .

       TOKENS
         stringLit  = '"' { stringCh | backSlash printable } '"' .

Once recognized, a stringLit token will need to be processed to convert any escape sequences like \t to their correct internal values.

11.19 The ASSEMBLER language defined by the above grammar does not allow the user to use named variables. While keeping track of absolute offset numbers for variables is not as much of an inconvenience as keeping track of absolute addresses for branch instructions, programs would be more easily developed if the LDA mnemonic could only be followed by an identifier. Investigate the possibility of implementing this, by building up a list of such identifiers and assigning them offsets automatically as each new one is encountered. A little thought should suggest that the value used as the argument of the initial DSP instruction should initially be regarded as undefined - at the end of assembly it can be backpatched to reflect the total number of variables that have been introduced.

11.20 The simple approach suggested in Exercise 11.19 for automatically declaring variables leaves the programmer vulnerable to hard-to-find bugs if a variable name is typed incorrectly. Investigate the possibility of extending the assembler to require that all variables be declared, as suggested by

             ASSEM
               VAR  I, J, K
             BEGIN
               ...

11.21 A study of the source code for this system in the Resource Kit will reveal that the mapping from mnemonic to opcode is achieved by the method OpCode in the PVM class that uses a very inefficient linear search. Given that this search has to take place once for each line of ASSEMBLER source code, it might be useful to replace it with a more efficient method. Explore various possibilities - for example, using a binary search or a hash table technique.

11.22 Exercise 8.16 presented a Cocol grammar for parsing a list of regular expressions. Extend this grammar so that it will give the alphabet used in each expression - that is, the set of characters that could appear in strings generated from the regular expression. For example:

RE = a|bd|[A-D] Alphabet = a b d A B C D
RE = a*b+xyz("?"|"-") Alphabet = a b x y z ? -

The various expression parsers that have been used in earlier case studies have all assumed that the operands are simple integers. Consider the largely unattributed grammar below, which describes an extended calculator that handles Boolean or integer expressions and can store the results of expressions in a memory that provides for 26 integer values (identified by lower-case names a ... z) and for 26 Boolean values (identified by upper-case names A ... Z).

  COMPILER Calculator $NC
  /* Extended Calculator grammar */

  static int[] intVar = new int[26];
  static bool[] boolVar = new bool[26];

  static void InitializeVars() {
    for (int i = 0; i < 26; i++) {
      boolVar[i] = false; intVar[i] = 0;
    }
  }

  CHARACTERS
    lf          = CHR(10) .
    lowerLetter = "abcdefghijklmnopqrstuvwxyz" .
    upperLetter = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" .
    digit       = "0123456789" .

  TOKENS
    intId       = lowerLetter .
    boolId      = upperLetter .
    number      = digit { digit } .
    EOL         = lf .

  IGNORE CHR(9) + CHR(11) .. CHR(13)

  PRODUCTIONS
    Calculator =                        (. InitializeVars(); .)
                 { Operation } EOF .
    Operation  = [ Print | Assign ] EOL .
    Print      = "print" Expression .
    Assign     = Variable "=" Expression .
    Variable   = intId | boolId .
    Expression = AddExp [ RelOp AddExp ] .
    AddExp     = [ "+" | "-" ] Term { AddOp Term } .
    Term       = Factor { MulOp Factor } .
    Factor     =   number | Variable | "true" | "false"
                 | "(" Expression ")" | "!" Factor .
    AddOp      = "+" | "-" | "||" .
    MulOp      = "*" | "/" | "&&" .
    RelOp      = "==" | "!=" | "<" | "<=" | ">" | ">=" .
  END Calculator.

By itself this grammar does not handle much in the way of semantics (other than to establish the usual sorts of precedence and associativity for the operators), but it acts as a splendid system on which to base various exercises. Consider a few (or all) of the following exercises:

11.23 How could the part of the grammar that parses expressions be attributed so that an Expression could synthesize a Boolean output that would be true if all the operands in the expression were constants, and false otherwise?

11.24 How could the part of the grammar that parses expressions be attributed to ensure that an Expression would synthesize a result that would denote the type of that expression, but would complain if it were presented with an expression that mixed the types of operands and operators in a meaningless way - for example

Acceptable                 Not acceptable

3 + 4 * 6            3 + 4 < 6
(x > y) && (a < b) x < y || a < b
A || B && !C         a || b && -c

Similarly, attempts to assign Boolean values to integer variables or vice versa should be flagged as erroneous.

11.25 Add the actions and attributes needed to develop a complete working calculator from the above Cocol grammar.

11.26 Rather than initialize all integer variables to zero and all Boolean variables to false, suppose one were required to assign a value to a variable (by means of an Assign operation) before using it as an operand in a later expression. How could this constraint be imposed?

11.27 Suppose that the definition of a Variable was relaxed to allow longer (but still distinctive) names, by redefining

     TOKENS
       intId   = lowerLetter { lowerLetter | digit } .
       boolId  = upperLetter { upperLetter | digit } .

Two simple arrays would no longer suffice to model the machine's memory. Design and implement a more flexible system that would allow a user to choose names at will, but still guard against attempting to use a variable before its name had been used as the target of an Assign operation.

A number of interesting projects can be developed from the basic grammar used to describe EBNF productions themselves, as illustrated in section 11.4.

11.28 Develop a system that will analyze a set of productions and tell you which non-terminal has the longest right-hand side, in terms of the number of terminals or non-terminals that appear in one of its possible alternatives.

11.29 Develop a system that will analyze a set of productions, listing them on an output file as they are read, and with each one preceded by a production number. Follow that listing with a cross-reference table, listing for each non-terminal the numbers of the productions in which it appears. This list should clearly distinguish the number of the production in which the non-terminal appears on the left-hand side from the numbers of the productions in which it appears on the right-hand side.

A simple system would probably list the non-terminals in the order in which they had first been encountered on either side of any production rule. How difficult is it to sort the cross-reference list into alphabetical order?

11.30 Until now we have made frequent use of the Cocol/EBNF notation for expressing production rules, and in particular the use of the {} and [] meta-brackets introduced by Wirth (1977) - for example,

  Home       = Family { Pets } [ Vehicle ] "house" .
  Pets       = "dog" [ "cat" ] | "cat" .
  Vehicle    = ( "scooter"  "bicycle" ) | "fourbyfour" .
  Family     = Parents { Children | OldFolks } .
  Parents    = "Dad" "Mom" | "Mom" "Dad" .
  Parents    = "Dad" | "Mom" .
  OldFolks   = "Grandad" | "Grandma" .
  Child      =   "Helen" | "Margaret" | "Alice" | "Robyn" | "Donna"
               | "Lauren" | "Ntombizodwa" | "Ntombizanele" .

In analyzing such productions and/or testing the grammar it was suggested in section 7.2.10 that they might be rewritten without using the Wirth brackets, but using right recursive productions and an explicit ε - as for example,

  Home       = Family AllPets Transport "house" .
  AllPets    = Pets AllPets | ε .
  Transport  = Vehicle | ε .
  Pets       = "dog" PussyCat | "cat" .
  PussyCat   = "cat" | ε .
  Vehicle    = ( "scooter"  "bicycle" ) | "fourbyfour" .
  Family     = Parents Dependants .
  Dependants = ( Children | OldFolks ) Dependants | ε .
  Parents    = "Dad" "Mom" | "Mom" "Dad" .
  Parents    = "Dad" | "Mom" .
  OldFolks   = "Grandad" | "Grandma" .
  Child      =   "Helen" | "Margaret" | "Alice" | "Robyn" | "Donna"
               | "Lauren" | "Ntombizodwa" | "Ntombizanele" .

Use Coco/R to develop a system that will perform this task for you. It will be necessary to invent additional non-terminal names. One suggestion is to do this in a formalized way, as illustrated by

    Home = Family HomeSeq1 HomeOpt2 "house" .

After converting a set of productions from EBNF to BNF, try converting the BNF productions once again to check that your system works consistently. You might also pay some attention to the possibility that the original set of productions is malformed - note, for example, that there are two rules defining Parents, no rules for Children,and that Child is unreachable.

11.31 Develop a system that will analyze a set of productions and compute the FIRST and FOLLOW sets for each non-terminal.

In later chapters we shall study how Coco/R might be used to construct a complete compiler for a simple programming language. As a prelude to this you might like to consider some or all of the following language related projects.

11.32 A pretty-printer (see Exercise 2.5) is a "compiler" that takes a source program and "translates" the source into the same language. That probably does not sound very useful! However, the "object code" is formatted neatly and consistently, according to some simple conventions, making it far easier for humans to understand.

Develop a pretty-printer for programs written in the simple Parva language (for which the grammar was given in section 7.4) or for programs written in the Mikra language (suggested in Exercise 7.16). The good news is that you will not have to develop any semantic analyzers, code generators or symbol table handlers in this project, but can assume that the source program is semantically correct if it is syntactically correct. The bad news is that you may have some difficulty in retaining the comments. They can no longer be ignored, but should preferably be copied across to the output in some way.

A starting point is to enhance the grammar with actions that simply write each terminal as it is parsed, intermingled with actions that insert extra line feeds and spaces at appropriate points. An example will clarify. In Mikra, as in Pascal, statements can be grouped into so-called Compound Statements, in a similar way to the Block of Parva. Part of a pretty-printer grammar for Mikra might read

      CompoundStatement =
        "BEGIN"            (. PP.Append("BEGIN"); PP.IndentNewLine(); .)
           Statement
           { ";"           (. PP.Append(";"); PP.NewLine(); .)
             Statement }
        "END"              (. PP.ExdentNewLine(); PP.Append("END"); .) .

Of course, the productions for all the variations on Statement append their appropriate text as they are unravelled.

Once again, an auxiliary class (PP) should be introduced to give the support needed for these semantic actions, with a specification that might include methods suggested by the following.

    class PP {
      public static void OpenOutput(string fileName)
      // Opens output file from specified fileName

      public static void CloseOutput()
      // Closes output file

      public static void Append(string s)
      // Appends string s to output

      public static void IndentNewLine()
      // Writes line mark to output, and then prepares to indent
      // further lines by a fixed amount more than before

      public static void ExdentNewLine()
      // Writes line mark to output, and then prepares to indent
      // further lines by a fixed amount less than before

      public static void NewLine()
      // Writes line mark to output; leaves indentation unchanged

      public static void Indent()
      // Increments indentation level

      public static void Exdent()
      // Decrements indentation level

      public static void SetIndentationStep(int stepSize)
      // Sets indentation stepSize
    }

11.33 If two high-level languages are very similar, a translator from one to the other can often be developed by taking the idea of a pretty-printer one stage further - rather than writing the same terminals as it reads, it writes slightly different ones. For example, a Mikra CompoundStatement would be translated to a neatly laid out equivalent Parva Block by attributing the production as follows:

      CompoundStatement =
        "BEGIN"           (. PP.Append("{"); PP.IndentNewLine(); .)
           Statement
           { ";"          (. PP.NewLine(); .)
             Statement }
        "END"             (. PP.ExdentNewLine(); PP.Append("}"); .) .

Develop a complete Mikra to Parva translator in this way.

A little reflection will show that translating Parva programs to Mikra is rather more difficult. In Mikra a CompoundStatement cannot contain any declarations - these must all have been made before the first BEGIN is encountered.

11.34 A rather useful tool to have when dealing with large amounts of source code is a "cross-reference generator". This is a program that will analyze the source text and produce a list of all the identifiers that appear in it, along with a list for each identifier of the line numbers on which it can be found. Construct a cross-reference generator for programs written in Parva or one for programs written in Mikra. This can be done at various levels of sophistication - you should at least try to distinguish between the line on which an identifier is "declared", and those where it is "applied".

You should find that the actions needed to enhance the grammar are very straightforward; the bulk of any programming effort falls on the development of a data structure for storing the list of identifiers and each of the lists of line numbers associated with them.

Coco/R can also be used to assist in the production of other syntax-directed systems that are not quite so obviously compiler related.

11.35 As you are probably aware, IP addresses as used in Internet communication are typically expressed in "dotted decimal" form, as exemplified by 146.231.128.6. The IP address actually represents a 32-bit integer; the four numbers in the quadruple corresponding to successive 8-bit components of this integer. For humans, machine addressing is usually more memorable when expressed in "DNS" format, as exemplified by terrapin.ru.ac.za. Some systems maintain tables of matching addresses, for example

   146.231.122.13      cspt1.ict.ru.ac.za    #comments like this
   146.231.128.6       terrapin.ru.ac.za
   146.231.56.10       thistle-sp.ru.ac.za
   147.28.0.62         psg.com

When a computer science department moved to new premises last year, a decision was made to rename and uniquely renumber the many machines in their possession. The system administrators tried to draw up a table like the one above, which was then merged with an existing table in the IT division. Unfortunately, a few mistakes were made which caused havoc until they were ironed out.

The Cocol grammar below might be used to develop a system to parse and check a file in this format.

  COMPILER Check $NC
  /* Grammar to check a list of IP numbers and host names */

  IGNORECASE

  CHARACTERS
    digit  = "0123456789" .
    letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" .
    eol    = CHR(10) .

  TOKENS
    number = digit { digit } .
    name   = letter { letter | digit | "." | "-" } .

  COMMENTS FROM "#" TO eol

  IGNORE CHR(0) .. CHR(31)

  PRODUCTIONS
    Check    = { Entry } EOF .
    Entry    = IPNumber name .
    IPNumber = Quad "." Quad "." Quad "." Quad .
    Quad     = number .
  END Check.

As you may know, multiple names can be given to a host (associated with one IP number) on the Internet. On the basis that the system administrators might wish to do this for their computers, but still assign each name only once, extend the application so that it might react sensibly to data of the form exemplified below.

   146.231.122.13    cspt1.ict.ru.ac.za    #comments like this
                     scifac.ru.ac.za       #another name for it
                     www.scifac.ru.ac.za   #and yet another
   146.231.128.6     terrapin.ru.ac.za
   147.28.0.62       psg.com
   147.28.0.67       psg.com               #name already in use
   146.231.122.1123  cspt2.ict.ru.ac.za    #bad IP address
   146.231.122.15    cspt3.ict.ru.ac.za
   146.231.122.15    cspt4.ict.ru.ac.za    #non-unique IP address

11.36 A computer centre has decided to introduce a system of charging users for electronic mail messages. The scale of charges will be as follows:

message charge - 20 units plus a charge per word of message text; 10 units for each word with at most 8 characters, 60 units for each word with more than 8 characters;
the total charge is applied to every copy of the message transmitted - if a message is addressed to N multiple users, the sender's account is debited by N * Charge.

The program will be required to process data files exemplified by the following (read the messages - they give you some hints).

            From: p.terry@ru.ac.za
            To:   reader@in.bed, guru@uni-rhodes.ac.za
            CC:   staff@listserv.ru.ac.za, mentor@coldmail.com
            This is a message containing twenty-seven words
            The charge will be 20 plus 24 times 10 plus 3 times
            60 units - total 440 multiplied by 4
            ####
            From: tutor@cs.ru.ac.za
            To:   students@lab.somewhere
            You should note that messages contain only words
            composed of plain text or numbers or possible - signs

            Assume for this project that no punctuation marks or
            other extra characters will ever appear - this will
            make it much easier to do

            User names and addresses may also contain digits
            and - characters
            ####

Each message has mandatory "From" and "To" lines, and an optional "CC" (carbon copy) line. Users are addressed in the usual Internet form, and case is insignificant. Ends of lines are, however, significant in addressing, and hence an EOL token must be catered for.

The chargeable text of a message starts after the To or CC line, and is terminated by the (non-chargeable) #### line.

Describe this input by means of a suitable grammar, and then enhance it to provide the necessary attributes and actions to construct a complete charging system that will read and process a file of messages and then list the charges.

11.37 (This project requires some familiarity with music). Tonic Solfa is a system often used to help singers learn the relationships amongst the notes of the major scale so as to learn to pitch musical intervals accurately. Many readers may have met this - the notes of the major scale are named doh, ray, me, fah, soh, lah, te (and, as Julie Andrews taught us in The Sound of Music, that brings us back to doh).

Although historically the syllables were introduced first, essentially as mnemonics, the system was later refined to provide a complete system for musical notation - one which also had the advantage that it could be disseminated cheaply in the days when the production of music expressed in staff notation could only be achieved with expensive engraving techniques. In formal notation the syllables are indicated by their initial letters only - d, r, m, f, s, l, t. Sharpened notes are indicated by adding the letter e, and flattened notes by adding the letter a (so that if the major scale were C major, fe would indicate F sharp and la would indicate A flat). Notes in octaves above the "starting" doh are indicated by superscript numbers, and notes below the "starting" doh are indicated by subscripts. Although the system is basically designed to indicate relative pitch, a particular key can optionally be specified by attaching a letter name to doh at the beginning of the piece (for example, doh = F).

If, for the moment, we ignore timing information, the notes of the well-known jingle Happy Birthday To You could be represented by

s₁ s₁ l₁ s₁ d t₁ s₁ s₁ l₁ s₁ r d
s₁ s₁ s m d t₁ l₁ f f m d r d

Of course we cannot really ignore timing information, which, unfortunately, complicates the picture considerably. In this notation, bar lines ║ and double bar lines ║║ appear much as in staff notation. Braces, { and }, are used at the beginning and end of every line (except where a double bar line occurs).

The notation indicates relative note lengths according to the basic pulse of the music. A bar line is placed before a strong pulse, a colon is placed before a weak pulse and a shorter vertical line | indicates the secondary accent at the half bar in quadruple time. Horizontal lines indicate notes lasting longer than one beat (including dotted or tied notes). Pulses are divided in half by using dots as separators, and half pulses are further divided into quarter pulses by commas. Rests are indicated simply by leaving spaces. For example

║ d : d ║ indicates duple time with notes on each pulse (two crotchets, if it were 2/4 time)

║ d : - | d : d ║ indicates quadruple time (minim followed by two crotchets, in 4/4 time)

║ d : - . d : ║ indicates triple time (dotted crotchet, quaver, crotchet rest, in 3/4 time)

║ d : d . d : d,d . d ║ indicates triple time (crotchet, two quavers, two semiquavers, quaver, in 3/4
time)

Happy Birthday To You might then be coded fully as

doh = F
{║ : : s₁ . - , s₁ ║ l₁ : s₁ : d ║ t₁ : - : s₁ . - , s₁ }
{║ l₁ : s₁ : r ║ d : - : s₁ . - , s₁ ║ s : m : d }
{║ t₁ : l₁ : f . - , f ║ m : d : r ║ d : - : ║║

Clearly this is fairly complex, and one suspects that singers may learn the rhythms "by ear" rather than by decoding this as they sing!

Write a Cocol grammar that describes this notation. Then go on to use it to develop a program that can read in a tune expressed this way and produce "machine code" for a device that is driven by a long stream of pairs of numbers, the first indicating the frequency of the note in Hz, and the second the duration of that note in milliseconds.

Recognizing superscripts and subscripts is clearly awkward, and it is suggested that you might initially use d0 r0 m0 f0 s0 l0 t0; d r m f s l t; d1 r1 m1 f1 s1 l1 t1 to give a range of three octaves, which will suffice for most songs.

Initially you might like to assume that a time signature (like the key) will preface the piece (which will simplify the computation of the duration of each note) and that the timing information has been correctly transcribed. As a later extension you might like to consider how varying time signatures and errors in transcription could be handled, while still assuming that each bar takes the same time to play.

Exercises 12

The first few exercises concentrate on ways of making the system more beginner-friendly:

12.1 Experiment with where best to place the SYNC and WEAK directives for error recovery. While doing so, you might like to attempt to parse source code that deliberately contains a mixture of likely errors (missing semicolons, mismatched brackets) and a mixture of gross errors (keywords omitted, used in completely the wrong context or spelt incorrectly). Can the suggestions made in section 12.5 be improved upon? In particular, can some (or all) of the many WEAK directives be eliminated?

12.2 Consider what happens when an attempt is made to parse programs that use language features that might be expected but which are not included. For example, in Parva, as we have defined it so far, the IfStatement does not have an else clause. If else is not recognized as a keyword, it will be taken for an undeclared identifier and the resulting error messages may be less than helpful. Can you devise a scheme whereby words like this - or keywords in the wrong case - trigger off helpful error messages (or, at least, warnings) to the user?

12.3 As we have used it, the error reporter makes no real distinction between context-free and semantic or context-sensitive errors. Would be an improvement to do this and, if so, how could it be done?

12.4 Balanced comments are actually dangerous. If not properly closed, they may consume valid source code, as exemplified by

         /* this comment opens here and does not close

            a = b + c ;  // this comment and code are swallowed

         /* the /* is ignored; the first comment thus ends here */

One way of assisting the coder find such bugs is to issue a warning if a semicolon is found within a comment. Can you find a way to do this?

12.5 Why do you suppose languages generally impose a restriction that a literal string must be contained on a single line of code?

In C++, two or more literal strings that appear in source with nothing but white space between them are automatically concatenated into a single string. This provides a mechanism for breaking up strings that are too long to fit on one line of source code. How might this feature be added to a Parva parser? This feature is not needed in languages like C# and Java that have proper strings, as the concatenation can be done with a + operator. How would the Parva grammar be modified to allow for this concatenation operator?

A cynic might contend that if a language has features which are the cause of numerous beginners' errors, then one should redesign the language. Consider a selection of the following.

12.6 Bailes (1984) made a plea for the introduction of a "Rational Pascal". According to him, the keywords DO (in WHILE and FOR statements), THEN (in IF statements) and the semicolons used as terminators at the ends of declarations and as statement separators should all be discarded. (He had a few other ideas, some even more contentious). Can you excise semicolons from Parva and Mikra? If, indeed, semicolons seem to serve no purpose other than to confuse learner programmers, why do you suppose language designers use them?

12.7 Although not strictly illegal, the appearance of a semicolon in a program immediately following the condition in an IfStatement or WhileStatement or immediately preceding a closing brace may be symptomatic of omitted code. The use of a so-called "empty statement" means that the example below almost certainly will not behave as its author intended.

         read(i);
         while (i > 10);
         { write(i); i = i - 1; }

Is it possible to warn the user when this sort of code is parsed, and if so, how?

12.8 Brinch Hansen (1983) did not approve of implicit "empty" statements. In what ways would an explicit empty statement (like the SKIP suggested by Brinch Hansen) be any improvement?

12.9 It would be trivially easy to forbid empty statements by simply removing the option from the concrete syntax for Statement. A similar device could be used in Pascal-like languages. Why then do you suppose language designers allow them - are they ever of use?

12.10 Even if one did exclude the empty statement from Parva, a language lawyer would be quick to derive examples showing that it would still be possible to get into trouble. Consider the code below.

         i = 10;
         while (i > 10)
         { /* write(i); i = i - 1; */ }

or, perhaps even more ridiculously

         while (i > 10) {}

Can you find a way to warn a user of the existence of an empty block? We stress that such blocks are legal - but are they ever of practical use?

12.11 We are not finished with pathological cases yet! Consider the following code, which falls into none of the above categories but is still effectively useless:

         while (i > 10) int k;

Can you find a way either to prevent this sort of code from being accepted, or at least warn the user if it is detected?

12.12 Warnings are all very well. The generated parsers have a method Warning that can be called to report them in the same style as hard errors are reported, but a case can be made for not issuing warnings at all (on the basis that it should not be the compiler's responsibility to protect users from their own stupidity to such an extent) or for being able to suppress warning messages (which tend to irritate experienced users). This can be done by making use of the pragma features and command-line parameter handling in Coco/R and you might like to explore this further.

The next few exercises concentrate on variations on scope rules and symbol table implementation. Before attempting them, study the implementation of the Table class carefully.

12.13 Simple linear lists and sequential search algorithms are probably perfectly adequate for implementing symbol table handlers for the small Parva or Mikra programs that one is likely to write. They would becomes highly inefficient for large applications. It may be better to store the table using binary search trees, like those that you may have encountered in data structure courses. Develop such an implementation, noting that it should not be necessary to alter the public interface to the Table class, and also that you will still have to implement the scope rules of the language.

12.14 Yet another approach is to construct the symbol table using an underlying hash table, which probably yields the shortest times for retrievals. Hash tables should also be familiar from other courses you may have taken in computer science. Once again, any implementation you attempt will have to take cognizance of scope rules.

12.15 Notwithstanding the difficulties mentioned in section 12.6.2, modify the scope rules in Parva to handle code like the following.

         int b, i, j, k;  // outer scope
         while (i > j) {  // open inner scope
           int k = 16;      // acceptable - new k masks out earlier k
           bool b = true;   // acceptable - bool b masks out int b
           bool k = false;  // unacceptable - int k still applies
           i = i - 4;
         }                // close inner scope

12.16 An ingenious way for a single-pass compiler to check that the scope of an identifier extends over the whole of the block in which it has been declared was suggested by Sale (1979). The basic algorithm requires that every block be numbered sequentially as it compiled (notice that these numbers do not represent nesting levels). Each identifier entry in the symbol table is given an extra numeric attribute. This is originally defined to be the unique number of the block making the insertion. Each time that the identifier is referenced thereafter, this attribute is reset to the number of the block making the reference. Each time an identifier is declared, and needs to be entered into the table, a search is made of all the identifiers that are in scope to see if a duplicate identifier entry can be found that is already attributed with a number equal to or greater than that of the block making the declaration. If this search succeeds, it implies that the scope rules are about to be violated. Implement this scheme.

12.17 Given that more sophisticated compilation techniques like those suggested in Exercise 12.16 can overcome the problems discussed earlier, why do you suppose the designers of Java deemed it a good idea not to allow identifiers to be redeclared in inner blocks?

12.18 Identifiers that are undeclared by virtue of mistyped declarations tend to be annoying, for they result in many subsequent errors being reported. Perhaps we could adopt a strategy where each undeclared identifier is flagged as undeclared at the point of first reference, and then quietly entered as a variable of the noType pseudo-type in the symbol table. Is this a good idea? Can it be implemented easily? Can you extend the idea to recognize an undeclared array reference variable and thereby reduce the plethora of "unexpected subscript" messages that this might otherwise generate?

The next few exercises explore the possibility of adding extra features to Parva.

12.19 Parva affords a limited opportunity for names to be associated with simple constant values. For convenience we quote the relevant productions again:

         ConstDeclarations
           = "const" OneConst { "," OneConst } ";" .
         OneConst
           = identifier "=" Constant .
         Constant
           = number | charLit | "true" | "false" | "null" .

This mechanism is useful for a programmer (it allows for descriptive names to be given to otherwise cryptic or unmemorable constants) and it is efficient in that it is easily implemented and carries no run- time overhead when incorporated into code generation (as we shall see in the next chapter). It can be extended in a number of ways:

it might be used to give alternative names to already named constants;
it might be used to give alternative names to variables.

These could be handled by extending the production for Constant to read

         Constant = Ident | number | charLit | "true" | "false" | "null" .

and arranging that the attributes for an original entry matching the Ident (other than its name) were simply copied to a new entry - after checking, of course, that such a previous entry had been made. Implement this suggestion. Do you have to take any special precautions to ensure that scope rules are obeyed consistently?

12.20 Perhaps a more useful extension than that suggested by Exercise 12.19 would be to replace the production for OneConst to read

         OneConst = identifier "=" Expression .

but this clearly carries semantic implications. The motivation for introducing constant declarations is to be able to determine (at compile-time) a value that can be stored in the symbol table, and we shall not be able to do this if the Expression is defined in terms of one or more variables. A way around this problem might be to define

         OneConst = identifier "=" ConstExpression .

where ConstExpression leads to a set of productions that are essentially identical to those previously used for Expression, differing only in the semantic checks they perform to ensure that the only acceptable Designators are associated with previously declared constants. The reader might like to pursue this rather wasteful approach, or a possibly better alternative of using the existing hierarchy with further inherited attributes to each production that provide the appropriate context for the semantic checks needed.

12.21 As a rather different way of solving the problems mentioned in Exercise 12.20, we might consider dispensing with the ConstDeclarations production completely and replacing the productions for variable declarations by

         VarDeclarations
           = [ "const" ] Type OneVar { "," OneVar } ";" .
         OneVar
           = identifier [ "=" Expression ] .

in which the implication is that the keyword const specifies that the variables introduced by the declaration sequence - after possible initialization - cannot be altered thereafter. An example may clarify this. Code like the following should be accepted or rejected in the ways indicated in the comments.

         const int i = 10;               // acceptable
         const int n;                    // unacceptable
         const int[] list = new int[i];  // acceptable
         int j = i;
         while (j < 100) {
           const int k = j;              // acceptable (a)
           read(i);                      // unacceptable
           i = j;                        // unacceptable
           list[4] = 5;                  // acceptable
           j = j + list[4];              // acceptable
           list = new int[12];           // unacceptable
         }

The reader might like to ponder why the statement marked (a) should be acceptable - surely each pass around the loop has the effect of changing j and hence of changing k?

12.22 (Harder) Yet another use for the const keyword might be to give alternative names for types. Readers may be familiar with languages like Pascal and Modula-2 where code that names types, as exemplified by

           TYPE
             IdKinds = INTEGER;
             IdTypes = INTEGER;
             Entry = RECORD
                       entryName : STRING;
                       entryKind : IdKinds;
                       entryType : IdTypes;
                     END;

is often encouraged, on the basis that it portrays the purpose of the various fields in the Entry record type rather better than the "everything is an integer" approach beloved of C and Java programmers that we have had to adopt in the code in this text.

Consider the implications of changing the syntax for ConstDeclarations

         ConstDeclarations
           = "const" OneConst { "," OneConst } ";" .
         OneConst
           = identifier "=" Constant .
         Constant
           = Type | Expression .

The reader might wonder why this simple syntactic change results in it being marked "harder". The reason is as follows - if this facility is to be of any use, we shall have to extend the production for VarDeclarations as well. At first all that seems necessary is to write

         VarDeclarations
           = ( Type | Ident ) OneVar { "," OneVar } ";" .
         OneVar
           = identifier [ "=" Expression ] .

but a little reflection will show that, like this, the grammar is no longer LL(1). Show that it is possible to generate an LL(1) parser by refactorizing the grammar appropriately.

12.23 The productions for Expressions in Parva are (deliberately) modelled on those used in Pascal, which had remarkably few levels of precedence. Languages derived from C have many more levels of precedence. Without adding any more operators, replace the part of the grammar dealing with Expression with one modelled on C# or Java, paying attention to type checking.

12.24 C, C++, C# and Java have a range of extended assignment operators, such as +=, -= and *=. It is fairly trivial to add these to the concrete syntax. Are there any implications for constraint analysis that would need special treatment?

12.25 Time to move up to Parva++? Extend Parva to allow for statements of a form exemplified by

         age[dad]++;  // birthday - buy something big
         income--;

12.26 (Harder) In C, C++, C# and Java there is no Assignment statement as such. Instead, the assignment operators are treated as low-precedence infix operators in expressions, and a "statement" like x = y; is really an expression that: (a) computes the value of y; (b) assigns this to x; (c) has an intrinsic value which can then be discarded. This means that it is possible to write useful expressions like x = (y = z + 5); in which the value of z + 5 is first assigned to y and the value of the subexpression (y = z + 5) is then assigned to x. Because the assignment operator is right-associative one can write this concisely as x = y = z + 5;.

Suppose that we modified the relevant parts of the grammar for Parva to read

         Statement  = Block | Expression ";" | /* rest as before */
         Expression = SimpleExp [ AssignOp Expression ] .
         SimpleExp  = AddExp [ RelOp AddExp ] .
         AddExp     = [ "+" | "-" ] Term { AddOp Term } .
         Term       = Factor { MulOp Factor } .
         Factor     =   Designator | Constant
                      | "new" BasicType "[" Expression "]" .
                      | "!" Factor | "(" Expression ")" .
         AddOp      = "+" | "-" | "||" .
         MulOp      = "*" | "/" | "%" | "&&" .
         RelOp      = "==" | "!=" | "<" | "<=" | ">" | ">=" .
         AssignOp   = "=" | "+=" | "-=" | "*=" | "/=" | "%=" .

Would this suffice to develop a more C-like compiler? If not, can you find a better alternative? Can the productions be attributed so as to forbid expression "statements", exemplified by

         a + b = (c) = d + e;
         6;
         a[10];

which seem to be acceptable to the above grammar?

In passing we might comment that the ability to write terse C-style expressions in which there are side- effects arising from the application of assignment operators like =, ++ and -- (which can appear in prefix or suffix form, each with their own dynamic semantics) is one that is often exploited by programmers. Many people would agree that it makes for what some call write-only code - code that is very hard for a human to understand and very prone to error if the precedence rules, associativity and side-effects are not properly understood. Compilers that must be able to unravel this complexity unambiguously are probably correspondingly complex to develop.

12.27 It is trivially easy to add an optional else clause to Parva's IfStatement. If you do so you will find that Coco/R reports an LL(1) violation. Examine the code produced by Coco/R and verify that it correctly applies the disambiguating rule mentioned in sections 7.3 and 8.3.

12.28 Suppose we try an extended form of IfStatement similar to that found in some other languages:

         IfStatement = "if" "(" Condition ")" Statement
                         { "elsif" "(" Condition ")" Statement }
                         [ "else" Statement ] .

Does this have any advantages over the standard arrangement - in particular, does it resolve the "dangling else" problem neatly?

12.29 Repeat Exercise 12.28 for yet another syntactic variation that is sometimes found, where a special keyword fi (if spelt backwards!) marks the end of the statement.

         IfStatement = "if" "(" Condition ")" { Statement }
                         { "elsif" "(" Condition ")" { Statement } }
                         [ "else" Statement ]
                       "fi" .

12.30 As you can see, many variations on the IfStatement have been proposed. One that certainly does not exhibit the "dangling else" ambiguity simply replaces the associated Statements with Blocks.

         IfStatement = "if" "(" Condition ")" Block [ "else" Block ] .

This has some nuisance value. Frequently IfStatements make use of Blocks anyway, but requiring that they always be used clutters up simple code with seemingly extraneous braces - as exemplified by

         if (a > b) { write("a wins"); } else { write("b wins"); }

and this gets more and more tedious when IfStatements are cascaded. As a compromise, someone has suggested that the following syntax be employed.

         IfStatement
           = "if" "(" Condition ")" Block
             [ "else" ( Block | IfStatement ) ] .

What is your reaction?

12.31 The WhileStatement or "pre-test" loop is but one of several variations found in languages for expressing repetition. Sometimes a "post-test" loop is better suited to an algorithm. Pascal-like languages express this with syntax like

         RepeatStatement
           = "REPEAT"
                Statement { ";" Statement }
             "UNTIL" Condition .

and the usual corresponding syntax in C-like languages is

         DoWhileStatement
           = "do" Statement "while" "(" Condition ")" ";" .

Since one often requires several statements between the do and the while, examine whether one could use syntax closer to the Pascal form - say

         DoWhileStatement
           = "do"
                { Statement }
             "while" "(" Condition ")" ";" .

12.32 Another looping construct that most languages provide in one way or another is the familiar ForStatement. In Pascal the simplest of these loops have a syntax that can be expressed

         ForStatement
         = "FOR" identifier ":=" Expression "TO" Expression "DO"
           Statement .

as exemplified by

         FOR i := 1 TO 10 DO write(i)

the equivalent of which would probably be coded in a C-like language as

         for (i = 1; i <= 10; i++) write(i);

The variable exemplified by i is usually known as the control variable.

In fact, although every reader is doubtless familiar with ForStatements and has probably used them correctly on many occasions, they have semantic properties that are deceptively complex and many languages impose quite severe constraints on them, affording computer language lawyers ample opportunity to point out subtleties that the beginner might not have considered. We shall explore a few of these in Chapter 13. For the moment let us adopt a syntax suggested by Pascal and pose the following problems.

(a) Allowing the associated Statement to alter the value of the control variable would allow for fairly chaotic behaviour. Consider, for example, the following meaningless code.

         for i = 1 to 10
           i = i - 1;
         for j = 1 to 10
           read(j);
         for k = 1 to 10
           for k = 1 to 10
             write(k); // a nested loop;

How could you attribute the production for a ForStatement to prevent such statements?

(b) One school of thought maintains that in a ForStatement the control variable should be implicitly declared at the start of the loop, so that it is truly local to the loop - in other words, that a new scope should apply to this variable. Examine the implications of this carefully, and come up with a way of implementing it.

12.33 The simplest applications of the loops we have considered so far are are structured - they have only one entry point, and only one exit point. Some languages allow a slightly less structured loop, which has only one entry point, but which allows exit from various places within the loop body. An example of this might be as follows.

         loop {
           read(a);  if (a > 100) exit; ──────────────────┐
           loop {                                         │
             write(a); read(b);                           │
             if (b > 10) { write('Last '); exit; }        │
             a = a + b;                      │            │
             if (a > 12) exit; ────≻─────┐   │            │
           }                             │   │            │
           write("Total ", a); ≺─────────┴───┘            │
         }                                                │
         write("Finished")  ≺─────────────────────────────┘

The syntax for these new statements appears to be trivial.

         LoopStatement = "loop" Statement .
         ExitStatement = "exit" ";" .

but this is deceptively ingenuous. LoopStatements can be nested, and ExitStatements may only appear within the body of a LoopStatement. Can you find context-free productions that will allow you to incorporate these statements into Parva? If not, can you impose the necessary constraints in some other way?

12.34 An equivalent idea to that suggested in Exercise 12.33 is already to be found in C-like languages in the break or continue statements that they allow within their various structured statements like switch, do, for and while. How would you extend the grammar and parser for Parva to incorporate these statements subject to the same sort of constraints?

12.35 Brinch Hansen (1983) incorporated only one form of loop into Edison - the WhileStatement - arguing that the other forms of loops were unnecessary. What particular advantages and disadvantages do these loops have from the points of view of a compiler writer and a compiler user respectively? If you were limited to only one form of loop, which would you choose, and why?

12.36 The classic "unstructured" statement is the GoToStatement, which in its simplest incarnation allows users to write spaghetti code of the worst possible form. Attempts to control this, and yet still allow some of the flexibility that the statement affords for neatening up some otherwise very clumsily coded combinations of structured statements, usually deem it permissible to go to any part of a Block from within that block or from blocks nested within it, but not permissible to go to an arbitrary point in an inner block from outside that block. Code like the following should be accepted or rejected in the ways indicated in the comments.

         void main () {
           10: int i;
           read(i);
           15: if (i == 0) goto 30;    // acceptable
           goto 20;                    // unacceptable
           while (i > 100) {
             i = i - 1;
             if (i % 2 == 0) goto 20;  // acceptable
             write(i);
             20: read(i);
             if (i < 0) goto 30:       // acceptable
             goto 20;                  // acceptable
           }
           30: write("finished");
         }

A little reflection on this should suggest that labels, like identifiers, have scope and that similar mechanisms to those used to handle variable scoping might handle the rules just suggested. Explore these possibilities, noting that while it might be allowable to define a label that is never the target of a GoToStatement (as is the case with 15 in the above example), no label that is such a target can be left undefined. The above example is not supposed to represent any profound algorithm, but you might like to consider how the analysis would change were the statement 10: int i; to be replaced by 20: int i;.

12.37 A multiple-selection switch is found in many languages, in slightly differing forms. The SwitchStatement of C# and Java has syntax that can be described by

         SwitchStatement   = "switch" "(" Expression ")" "{"
                               { OneSwitch }
                             "}" .
         OneSwitch         = CaseLabel ":" StatementSequence .
         StatementSequence = { Statement } .
         CaseLabel         = "case" Constant | "default" .
         Constant          =   IntConst | CharConst
                             | "true" | "false" | "null" .

but, as usual, this has to be supplemented by semantic constraints. Among these are, fairly obviously, that no Constant may appear more than once as a CaseLabel within a single SwitchStatement, that the types of all the constants and the type of the selector Expression must be compatible, and that there may be only one default label. Although not illegal, a SwitchStatement without any instances of OneSwitch is of little use and might elicit a warning to the user if it is detected.

In C, C++ and Java the statements that follow each CaseLabel are not restricted in any way, although it is common to find that the last of them is a BreakStatement or ReturnStatement - if absent the semantics prescribe that control then falls through to the next option. This is expressly forbidden in C#, which requires that if a StatementSequence is not empty (or is comprised only of empty statements) then it must not be possible to fall through to the next option. To a good approximation, this means that each StatementSequence must terminate with an unguarded explicit BreakStatement or ReturnStatement.

Develop the productions given above in such a way that all these constraints are properly checked.

12.38 Another possibility for the syntax of a multiple-selection statement is given below, inspired by the form of the statement found in Modula-2.

         CaseStatement     = "case" "(" Expression ")" "{"
                               OneCase { "|" OneCase }
                               [ "default" ":" StatementSequence ]
                             "}" .
         OneCase           = [ Range { "," Range } ":"
                               StatementSequence ] .
         StatementSequence = { Statement } .
         Range             = Constant [ ".." Constant ] .

This is more convenient when one wishes to write code such as is exemplified by

         case (age) {
             0 .. 12 : write("weeny-bopper");
          | 13 .. 19 : write("teenager");
          | 100, 200 : write("another century");
            default  : write("adult");
         }

Note that what might at first appear to be a rather perverse way of defining OneCase means that the above could also have been coded

         case (age) {
          |  0 .. 12 : write("weeny-bopper");
          | 13 .. 19 : write("teenager");
          | 100, 200 : write("another century");
          | default  : write("adult");
         }

and, as in several other exercises, the reader is invited to dwell on whether this apparent redundancy is either useful or necessary.

In Modula-2 the presence of a BreakStatement is implicit (the user did not supply it; the compiler did). Clearly the CaseStatement is otherwise subject to the same sorts of constraints as mentioned in Exercise 12.37, with the added complication that the case labels are effectively prescriptions for sets and none of these must have an element in common with any other.

Develop these productions to give a parser that checks these constraints.

12.39 Parva, as so far defined, has only two basic types (integer and Boolean) which are quite distinct from one another. Languages derived from C have another fundamental type char that is somewhat of a hybrid. One can usually combine operands of char and int types quite freely, treating them effectively all as integers. A character literal can be thought of as being of char type so that code like

          char c = 'A'; write(c);

would output the character A. Code like

          int c = 'A'; write(c);

is also acceptable but would output the number 65 (the ASCII encoding of the character A). In short, an expression containing integer and character operands is deemed to be of integer type unless it is composed of a single character operand. When necessary, one is allowed to make explicit casts, as exemplified by

          char c = (char) 34;

Add a character type to Parva, paying careful attention to constraint analysis. As an alternative to the rather permissive rules inspired by C-like languages you might like to consider following the Pascal/Modula-2 model, which dictates that character and integer operands cannot be mixed freely in expressions or assigned freely one to another - any mixing has to be done by making use of explicit conversion mechanisms.

Exercises 13

Many of the previous suggestions for extending Parva will act as useful sources of inspiration for projects. Some of those may call for extra interfaces to the code generator, and will be suggested later, but the following ones will be found to require no more than we have already discussed.

13.1 How do you generate code for an IfStatement that allows an else option, with the "dangling else" ambiguity resolved in the usual way? Bear in mind that the else part may or may not be present, and ensure that your solution can handle both situations.

13.2 Generate code for the extended form of IfStatement described by

         IfStatement = "if" "(" Condition ")" Statement
                       { "elsif" "(" Condition ")" Statement }
                       [ "else" Statement ] .

13.3 Repeat Exercise 13.1 for yet another syntactic variation that is sometimes found

         IfStatement = "if" "(" Condition ")" Block
                       [ "else" ( Block | IfStatement ) ] .

13.4 Generate code for the DoWhileStatement in its traditional form

DoWhileStatement = "do" Statement "while" "(" Condition ")" ";" .

This may require some ingenuity, as the code generator does not have a BranchTrue method.

13.5 Modify the language and compiler to allow you to handle the BreakStatement and ContinueStatement as they are found in C-like languages. Bear in mind that loops may be nested, that these statements are not allowed except within the context of a loop of some form and that there might be several of them within any one loop.

13.6 The GoToStatement was the subject of Exercise 12.36. If you take into account all the constraints mentioned there, can you generate code for such statements? Bear in mind that there might be several statements that all (legitimately) target the same address.

13.7 Brinch Hansen (1983) introduced an extended form of the WhileStatement into the language Edison. If we make some minor changes to match the rest of our syntax, his suggestion could be described by

         WhileStatement = "while" "(" Condition ")" Block
                          { "or" "(" Condition ")" Block } .

The Conditions are evaluated one at a time in the order written until one is found to be true, when the corresponding Block is executed, after which the process is repeated. If no Condition is true, the loop terminates. How could this be implemented? Can you think of any algorithms where this statement would be useful?

13.8 What changes would be needed to add an explicit WriteLine statement to Parva?

13.9 Modify the HaltStatement to allow it to take an optional string parameter that would be displayed before execution ceased.

13.10 The production for OneVar effectively allocates a fresh element in the activation record for each variable that is declared. It may have struck you that it is possible to reuse elements after the statements in a Block have completed execution. Reflection on the example below will show that the sets of variables {p, q} and {s, t, u} are never all in scope together at compile-time, and need not both exist simultaneously at run-time. Hence they could share memory in the activation record if they were assigned the offsets indicated in the comments.

    void main () {
      int i = 0;                   // offset 0
      int[] list = new int[10];    // offset 1
      while (i < 10) {             // local block scope opens
        int p, q;                  // offsets 2, 3
        read(p, q);
        list[i] = p + q;
        i = i + 1;
      }                            // local block scope closes
      while (i >= 0) {             // local block scope opens
        int s, t, u;               // offsets 2, 3, 4
        s = list[i];
        t = s * list[i];
        u = t * list[i];
        write(s, t, u);
        i = i - 1;
      }                            // local block scope closes
    }

Modify the compiler to economize on activation record space in this way.

The following problems are probably best solved by making use of opcodes in addition to those in the basic set suggested in section 4.4.

13.11 Implement the full range of assignment operations in Parva and the ++ and -- statements, by following the ideas outlined in section 13.5.1.

13.12 Complete the implementation of a SwitchStatement, on the lines suggested in section 13.5.3, generating code of the form required by the SWI and SEL opcodes that are implemented in the interpreters in the Resource Kit.

13.13 If you add a character data type to Parva, as suggested in Exercise 12.39, how do you generate any special opcodes needed to support it? A set of suitable opcodes has been provided in the source code in the Resource Kit. Study these carefully, noting that operations on character data, to be safe, require run-time constraint checking (for example, to ensure that incrementing or decrementing a character variable does not take its value outside the equivalent integer range 0 ... 255). No special operations have been provided to support "casting". Is this a serious omission?

13.14 Various space-saving opcodes such as LDC_n and LDA_n were suggested in section 4.10.1 for handling the frequent cases where the LDC N and LDA N opcodes use small values of N. Add these to the interpreter, and modify the code generator to make use of them.

13.15 (Harder) If you study the code generated by the compiler you should be struck by the fact that the sequence LDA x; LDV occurs frequently. Much better code can be generated if the stack machine is extended to provide the operations LDL N and STL N (as suggested in section 4.10.2) for loading and storing variables, and the operations LDE and STE for optimizing the loading and storing of array elements. As an aside, this sort of optimization is much easier if one is not restricted to using an incremental compilation technique, but with care and ingenuity it can still be accomplished.

13.16 Extend the system so that you can declare constants to be literal strings and print these, and automatically concatenate strings that appear with nothing but white space between them (see Exercise 12.5). For example,

                void main () {
                  const Planet = "Dear" " World";
                  write("Hello ", Planet);
                }

13.17 The PVM supports an opcode (ALEN) for determining the length of an array, which might be useful for handling code like that probably familiar from C#

i = 0; while (i < array.Length) { write(array[i]); i++; }

Extend Parva to support this syntax and generate the appropriate code.

13.18 As you will recall, code of the form

int[] list2 = new int[100], list2 = list1;

only allocates space for one array object, and assigns the address of that object to both list1 and list2. It does not "clone" the array. This, and attempted comparisons of the form

if (list1 == list2) write("Equal arrays");

are a frequent source of confusion in the minds of programmers trained in the Wirth school of languages where the assignment operation has the effect of cloning, and where the comparison operation is semantically prohibited. Give some thought to extending Parva and (if necessary) enhancing the PVM to support the easier "cloning" and "deep comparison" of arrays.

13.19 Some users like to live dangerously. How could you arrange for the compiler/interpreter to have an option whereby generation of subscript range checks could be suppressed?

13.20 As was mentioned in Exercise 12.32, many languages provide for a ForStatement in one form or other. The semantics of these can differ in surprising ways. For example, the Pascal statement

FOR i := 1 TO 10 DO write(i)

and its apparent C#/Java equivalent

for (i = 1; i <= 10; i++) write(i);

are often explained to beginners as both being semantically equivalent to

         i = 1;
         while (i <= 10) {
           write(i);
           i = i + 1;
         }

However, it is dangerous to extrapolate from this example. Consider a similar statement pair

FOR i := i + 4 TO i + 10 DO write(i)

and its apparent C#/Java equivalent

for (i = i + 4; i <= i + 10; i++) write(i);

A Pascal compiler is require to generate code that corresponds closely to

         i = i + 4;
         int final = i + 10;
         while (i <= final) {
           write(i);
           i = i + 1;
         }

while the apparently equivalent C-like statement leads to an potentially infinite loop

         i = i + 4;
         while (i <= i + 10) {
           write(i);
           i = i + 1;
         }

In principle, of course, the condition i < i + 10 would now always be true, although the loop would probably terminate when the value assigned to i overflowed the capacity of an integer.

How would you implement a Pascal-style ForStatement if you were to define a ForStatement in Parva to be of the form

ForStatement = "for" identifier "=" Expression ( "to" | "downto" ) Expression Statement .

How would you implement a ForStatement with syntax closer to that found in C#/Java, restricting yourself to statement forms analogous to those suggested here?

(A Pascal compiler is required to make sure that overflow does not jeopardize the execution of the loop, but consideration of the subtle nuances of this is beyond the scope of the present discussion. The interested reader will find a good treatment of the peculiarities of the ForStatement in the book by Grune et al. (2000).)

13.21 Why do you suppose languages are often defined so as to allow a ForStatement to terminate with its control variable "undefined"?

13.22 Many development environments incorporate "debuggers" - sophisticated tools that will trace the execution of a compiled program in conjunction with the source code, referring run-time errors to source code statements, allowing the user to interrogate (and even alter) the values of variables by using the identifiers of the source code, and so on. Development of such a system could be a very open- ended project. As a less ambitious project, extend the interpretive compiler for Parva that, in the event of a run-time error, will relate this to the corresponding line in the source, and then print a post- mortem dump showing the values of the variables at the time the error occurred. A (handcrafted) system of this sort was described for the well-known subset of Pascal known as Pascal-S (Wirth 1981, Rees and Robson 1987), and was also used in the implementation of the simple teaching language Umbriel (Terry 1995).

13.23 The implementations of Parva provided in the Resource Kit produce a text file of PVM code. They do this by "decompiling" the code produced by the compiler. Develop a version of the code generator that produces assembler code directly, suitable for input to the sort of assembler discussed in section 11.5. You will have to extend that assembler if you have made additions to the PVM opcode set.

Systems that have an expression grammar of some form as a fundamental component provide a rich source of projects. Here are some suggestions.

13.24 The spreadsheet has become a very popular tool in recent years. A modern commercial package provides many thousands of features - we shall be less ambitious. In essence, a simple two-dimensional spreadsheet is based on the concept of a matrix of cells, typically identified by a letter-digit pair (such as E7) in which the letter specifies a row, and the digit specifies a column. Part (or all) of this matrix is displayed on the terminal screen - one cell is taken as the active cell and is usually highlighted in some way (for example, in inverse video).

Input to a spreadsheet is then provided in the form of expressions typed by the user, interleaved with commands that can reselect the position of the active cell. Each time an expression is typed, its formula is associated with the active cell, and its value is displayed in the correct position. Changing the contents of one cell may affect the values of other cells. In a very simple spreadsheet implementation, each time one cell is assigned a new expression the values of all the other cells are recomputed and redisplayed.

For this exercise assume that the expressions are confined to integer expressions of the sort exhaustively discussed in this text. The operands may be integer literals or the designators of cells. No attempt need be made to handle string or character values.

A simple session with such a spreadsheet might be described as follows.

   (* we start in cell A1 *)
   1 RIGHT        (* enter 1 in cell A1 and move on to cell A2 *)
   99 RIGHT       (* enter 99 in cell A2 and move on to cell A3 *)
   (A1 + A2) / 2
   ENTER          (* cell A3 contains the average of A1 and A2 *)
   DOWN LEFT LEFT (* move to cell B1 *)
   2 * A1         (* cell B1 now contains twice the value of A1 *)
   UP             (* move back to cell A1 *)
   5              (* alter expression in A1, A3 and B1 affected *)
   GOTO B3        (* move to cell B3 *)
   A3 % 3 ENTER   (* B3 contains remainder when A3 is divided by 3 *)
   QUIT

At the point just before we quit, the grid displayed on the top left of the screen might display

                    1        2        3       (* these are column numbers *)
               ┌────────────────────────────────────
            A  │    5       99       52
            B  │   10                 1
               │

It is possible to develop such a system using Coco/R in a number of ways, but it is suggested that you proceed as follows.

(a) Derive a context-free grammar that will describe the form of a session with the spreadsheet like that exemplified above.

(b) Enhance your grammar to provide the necessary attributes and actions to enable a complete system to be generated that will read and process a file of input, and compute and display the spreadsheet, updating the display each time new expressions become associated with cells.

Make the following simplifying assumptions.

(a) A spreadsheet is normally run "interactively". However, Coco/R generates systems that most conveniently take their input from a disk file. If you want to work interactively you will need to modify the scanner frame file considerably.

(b) Assume that the spreadsheet has only 20 rows and 9 columns, extending from A1 through S9.

(c) Apart from accepting expressions typed in an obvious way, assume that the movement commands are input as LEFT, RIGHT, UP, DOWN, HOME and GOTO Cell as exemplified above. Assume that attempts to move too far in one direction either "wrap around" (so that a sequence like GOTO A1 UP results in cell S1 becoming the active cell; GOTO A12 actually moves to A3, and so on), or simply remain at the edge, as you please.

(d) An expression may also be terminated by ENTER, which does not affect the selection of the active cell.

(e) Input to the spreadsheet is terminated by the QUIT operation.

(f) The semantics of updating the spreadsheet display are captured in the following pseudo-code.

           When Expression is recognized as complete
             Store Expression[CurrentRow, CurrentColumn] in a form
                   that can be used for future interpretation
             Update value of Value[CurrentRow, CurrentColumn]
             FOR Row FROM A TO S DO
               FOR Column FROM 1 TO 9 DO
                 Update Value[Row, Column] by
                   evaluating Expression[Row, Column]
                 Display new Value[Row, Column]
               END
             END

(g) Arrange that the spreadsheet starts with the values of each cell set to zero, and with no expressions associated with any cell.

(h) No facilities for editing an expression need be provided - if a cell's expression is to be altered it must be typed afresh.

Hint: The most intriguing part of this exercise is deciding on a way to store an expression so that it can be evaluated again when needed. It is suggested that you associate a simple auxiliary data structure with each cell of the spreadsheet. Each element of this structure can store an operation or operand for a simple interpreter.

13.25 Imagine that you are a lecturer called upon to give a course in Boolean Algebra, and wish to familiarize the class with the basic concepts of Boolean algebra, which is defined on a set of operands, say B = {a, b, ... }, two binary infix operators . and + (AND and OR), and a unary postfix operator ' (NOT).

As you will recall, these operators have a precedence ordering. NOT takes precedence over AND, which takes precedence over OR. This means that an expression written without parentheses, such as a.b + c (or, equivalently, a AND b OR c) means the same as the parenthesized expression (a.b) + c, and does not mean the same as the parenthesized expression a.(b + c). Sometimes the AND operation becomes implicit - as in simple algebra where "a b" can be read as "a times b", so in Boolean algebra "a b" can be read as "a AND b".

Other symbols are sometimes used for the operators. (Expressed as an apostrophe ' , the NOT operator is one of the very few mathematical operators that is written after its operand, as what is called a postfix operator. Another example of a postfix operator is the "factorial" one, as in n!) In computer language terms it is usual to find prefix operators preferred to postfix ones, and to avoid the use of . and +, which tend to get confused with decimal points and addition signs. Hence one finds the use of the full words (as in Pascal and Modula-2) or the use of other symbols entirely, such as & to mean AND, | to mean OR and ! or ~ to mean NOT.

Here is the same Boolean expression written in several different ways, using these various notations, with parentheses used in the expressions in the right-hand column to make the precedence quite explicit.

               A AND B OR NOT C AND D  (A AND B) OR ((NOT C) AND D)
               a.b + c'.d              (a.b) + (c'.d)
               a b + c' d              (a b) + (c' d)
               a & b | ~c & d          (a & b) | ((~c) & d)
               a & b | !c & d          (a & b) | ((!c) & d)

When beginners are taught Boolean algebra and combinational circuits in introductory computer science courses, they are often introduced to the concept of a truth table, like those given below (sometimes these tables are expressed in terms of 0s and 1s; sometimes in terms of TRUE and FALSE).

             ┌────────┬───────┐
             │   X    │ NOT X │
             ├────────┼───────┤
             │  FALSE │  TRUE │
             │   TRUE │ FALSE │
             └────────┴───────┘

        ┌─────────────┬─────────────┐
        │  ONE   TWO  │ ONE XOR TWO │
        ├─────────────┼─────────────┤
        │   0     0   │     0       │
        │   0     1   │     1       │
        │   1     0   │     1       │
        │   1     1   │     0       │
        └─────────────┴─────────────┘

Such tables are easy enough to derive by hand, though it is tedious to do so for expressions that involve more than two inputs, and the process is error prone. Design a tool that will help you produce them quickly for arbitrarily complex expressions. It should be able to accept a list of (independent) functions - for example

BOOLEAN F(x) = x';
BOOLEAN f(x, y) = x | y;
xor(one, two) = one & two' or one' & two;
CanReturn(S1, S2, S3, Excluded) = (S1 S2 OR S2 S3 + S3 S1) AND ~ Excluded;
BOOLEAN Equality(x) = x ' ' & ! (NOT x);
NUMERIC Silly(Billy) = 1 . 0' + Billy;
BOOLEAN Silly(Billy) = TRUE AND NOT FALSE OR Billy

each one defining some function of its explicit arguments, and separated from the next by the traditional semicolon. For each function a table should be produced showing the value of the function for each combination of the possible values of its parameters. If present, the keyword BOOLEAN indicates that the table should be produced in terms of combinations of TRUE and FALSE - the absence of a keyword, or the presence of the keyword NUMERIC, indicates that the table should be produced in terms of combinations of 0s and 1s. Notice that the expressions defining the functions:

may only refer to variables in the parentheses of the function header;
may refer to these variables more than once in an expression, as illustrated by CanReturn;
may not refer to other functions;
must be able to accept any of the equivalent tokens for operators within an expression (as exemplified in CanComplete, where one sees word tokens AND and OR, the alternative ~ used for NOT, and the alternative + used for OR).

This problem can be solved in various ways. Consider the following pseudo-code algorithm which would tabulate a truth table for a function F(X, Y, Z) of three variables.

           BEGIN
             X := FALSE;
             REPEAT
               Y := FALSE;
               REPEAT
                 Z := FALSE;
                 REPEAT
                   Write(X, Y, Z, F(X,Y,Z));
                   Z := NOT Z
                 UNTIL Z = FALSE; (* again *)
                 Y := NOT Y
               UNTIL Y = FALSE; (* again *)
               X := NOT X
             UNTIL X = FALSE  (* again *)
           END.

We can recognize the same sort of "prologue" and "epilogue" for setting up a loop to cycle through the values for each variable, while in the innermost body there is simply the important code to evaluate the expression, and then write out its value. The suggestion is now made that the actions inserted into the grammar generate code for a stack machine like the PVM. Code for the evaluation of the function itself can be generated in a similar way to that discussed earlier in this chapter, surrounded by code for the prologues and epilogues that can be generated in a "boilerplate" fashion for each variable.

13.26 The manufacturers of a microprocessor that is currently being developed have been provided with an assembler for it, but are keen to work at a higher level and have called upon various software houses to bid for a contract to provide a simple compiler that will generate assembler code from a simple source language, which is itself in the process of being developed.

Potential suppliers have been asked to meet an initial specification that calls for a compiler to support a language that can declare simple variables and have simple assignment statements, if-then-else statements and compound statements. Because things are not yet stable, they would like to be able to mix inline assembler code with higher level source code, so as to compensate for the absence of some high-level features.

The microprocessor design is also rather unstable. At this stage all that is known of the instruction set is that it has five opcodes MOV, CMP, BRN, BZE and BNZ. These can be described as follows.

The load/store instructions are of the form

             MOV  Rx, Ry     ;    Rx := Ry
             MOV  Rx, a      ;    Rx := a
             MOV  [a], Rx    ;    Mem[a] := Rx
             MOV  [a], b     ;    Mem[a] := b
             MOV  [a], [b]   ;    Mem[a] := Mem[b]

where Rx and Ry are chosen from R1, R2, R3, R4, R5, the machine's five registers, and a and b are positive integers.

The comparison instructions are of the form

             CMP  x, y       ;    CPU.Z := (x = y)
             CMP  x, [b]     ;    CPU.Z := (x = Mem[b])
             CMP  [b], y     ;    CPU.Z := (Mem[b] = y)

where x and y can be any of R1, R2, R3, R4, R5 or a simple positive integer, and b is a positive integer.

The absolute branch instructions are of the form

             BRN  x          ;    CPU.PC := x
             BZE  x          ;    if CPU.Z then CPU.PC := x
             BNZ  x          ;    if not CPU.Z then CPU.PC := x

where x is a positive integer.

In the high-level language, variable names can be freely chosen and no restriction is imposed against accidentally using a register name or opcode mnemonic as a variable name.

Just to make things interesting, the company has requested that the compiler is not case sensitive and that variables can be tied, at the point of declaration, to a specific machine register. Variables that are not tied in this way are sequentially assigned addresses in memory, decreasing from location 255.

Within the inline assembler code, however, while the opcode names are "reserved", other variable names may be used as operands.

All this has rather perplexed the host of firms who are competing for the lucrative contract, so the company has released a few sample programs, showing all these features in programs otherwise devoid of any real semantic interest. One of these is as follows.

   program example;
   (* simple example of mixed language program *)
     int
       x, r4, r5,      (* some with register names *)
     bool
       CMP, MOV, c, d; (* some with opcode names *)
     int
       r3 = r3,        (* r3 is attached to to register R3 *)
       register1 = r1, (* register1 is attached to register R1 *)

     begin
       x := r4;
       if x = r4
         then
           begin CMP := MOV; x := c; r3 := c; c := 203 end
         else
           if x = r5
             then MOV := c
             else MOV := r3;
       r3 := register1;
       if c = d
         then begin
           d := CMP;
           asm
             mov r1, r2;
             mov r4, 34;
             mov [34], 34;
             mov x, r3;
             mov r3, register1;
             cmp r5, CMP;
             bne 34;
           end
         end
         else (* nothing *)
     end.

Here is the sort of output its compiler must produce (there is no need for the compiler to provide the comments - these are simply to help the reader). Labels can be used as indicated.

      MOV   [255], [254]     ; x := r4
      CMP   [255], [254]     ; if x = r4
      BNE   L0               ;   then begin
      MOV   [252], [251]     ;     CMP := MOV
      MOV   [255], [250]     ;     x := c
      MOV   R3, [250]        ;     R3 := c
      MOV   [250], 203       ;     c := 203
      BRN   L1               ;   end
   L0                        ;   else
      CMP   [255], [253]     ;     if x = r5
      BNE   L2               ;       then
      MOV   [251], [250]     ;         MOV := c
      BRN   L3               ;
   L2                        ;       else
      MOV   [251], R3        ;         MOV := r3
   L3                        ;
   L1                        ;
      MOV   R3, R1           ; r3 := register1
      CMP   [250], [249]     ; if c = d
      BNE   L4               ;   then begin
      MOV   [249], [252]     ;     d := CMP
                             ;     asm
      MOV   R1, R2           ;       mov r1, r2
      MOV   R4, 34           ;       mov r4, 34
      MOV   [34], 34         ;       mov [34], 34
      MOV   [255], R3        ;       mov x, r3
      MOV   R3, R1           ;       mov r3, register1
      CMP   R5, [252]        ;       cmp r5, CMP
      BNE   34               ;       bne 34
      BRN   L5               ;     end
   L4                        ;   else nothing
   L5                        ;

Naturally, the compiler must be able to recognise semantic errors; here is the sort of listing that should accompany a (deliberately) wrong submission.

      1  program example;
      2  (* simple example of incorrect mixed language program *)
      3    int
      4      x, r4, r5;        (* some with register names *)
      5    bool
      6      CMP, MOV, c2, d;  (* some with operator names *)
      7    int
      8      r3 = r43,         (* attach to register *)
  *****           ^ No such register
      9      register1 = r1,
     11
     12    begin
     13      x := r4;
     14      if x = r4
     15        then
     16          begin CMP := MOV; x := c; r3 := c; c := 203 end
  *****                                 ^ Undeclared
     17        else
     18          if x = r5
     19            then MOV := c
  *****                        ^ Undeclared
     20            else MOV := r3;
     22      base := x;
     23      r4 := base;
     24      if c = d
  *****         ^ Undeclared
     25        then begin
     26          d := CMP;
     27          asm
     28            mov r1, r2;
     29            mov r4, 34;
     30            mov 34, 34;
     31            mov x;
  *****                ^ Too few operands
     32            move x, r3;
  *****            ^ Invalid opcode
     33            mov r3, register1;
     34            cmp r5, CMP;
     35            bne 34, xx;
  *****                ^ Too many operands
  *****                    ^ Undeclared
     36          end
     37        end
     38        else (* nothing *)
     39    end.

Confine yourself to simple assignment statements and comparisons like those in the example program - that is, where the right-hand side of an assignment consists of a single constant or variable, and where comparisons are made for equality only, between operands that are single constants or variables - and consider only if-then-else statements where both the then and else clauses are present.

Exercises 14

14.1 The implementation of Parva found among the source code for this chapter in the Resource Kit does not incorporate the many suggestions for additional statements and data types that were the subjects of the exercises in earlier chapters. Extend the Parva compiler to handle some or all of these extensions (most of which are orthogonal to the material of this chapter).

14.2 Our specification of Parva has insisted that only a void function can be used in the guise of a Statement. In most C-like languages a value-returning function can also be used. All that happens is that the return value is discarded - the function is called merely to take advantage of any side effects it produces. Extend Parva to include this feature.

14.3 If you study the interpreter that we have been developing, you should be struck by the fact that this does a great deal of checking that the stack pointer stays within bounds. This check is strictly necessary, although unlikely to fail if the memory is large enough. It would probably suffice to check only for opcodes that push a value or address onto the stack. Even this would severely degrade the efficiency of the interpreter. Suggest how the compiler and run-time system could be modified so that at compile-time a prediction is made of the extra depth needed by the run-time stack by each function. This will enable the run-time system to do a single check that this limit will not be exceeded, as the function begins execution.

14.4 Extending the syntax of Parva to suggest that parameters are to be passed using call-by-reference semantics is straightforward - inspired by the syntax of C# we might simply change the productions for FormalParameters given earlier to

    FormalParameters = [ OneParam { "," OneParam } ] .
    OneParam         = [ "ref" ] Type identifier .

Extend the Parva compiler to support both modes of parameter passing. This is not too difficult to do - call-by-reference relies on passing address values as arguments, so you will need to add an extra attribute to the Entry class to distinguish formal parameter identifiers that denote addresses from those that denote simpler values, and arrange for code within a function to dereference the corresponding argument values where necessary.

14.5 Follow up the suggestion that parameters can be pushed on the stack before the frame header is allocated and can then be accessed through positive offsets from the frame pointer register FP.

14.6 In previous exercises we have suggested that undeclared identifiers could be entered into the symbol table at the point of first declaration, so as to help with suppressing further spurious errors. What is the best way of doing this if we might have undeclared variables, constants, arrays or functions?

14.7 Complete the implementation of the extension to Parva suggested in section 14.5 for incorporating function prototypes as a syntactic device for allowing forward references to be made in a set of mutually recursive functions. How do you guard against trying to execute programs in which function prototypes have been declared, but the corresponding functions have never been properly defined?

14.8 Extend the definition of Parva to allow for functions to be nested in the way suggested in section 14.6, and go on to extend the compiler to generate code for a version of the PVM extended to support either the static link approach to run-time storage management or the use of a display.

14.9 If you allow for nested functions, how do you allow also for forward declarations of these? Bear in mind that when a complete definition of a function is finally encountered, a check will have to be made that this has been done at the right "level".

14.10 Explore the possibility of providing a fairly sophisticated post-mortem dump in an extended interpreter. For example, provide a trace of the function calls up to the point where an error was detected, and give the values of the local variables in each activation record. To be really user-friendly the run-time system will need to refer to the user names for such entities. How would this alter the implementation of the symbol table?

14.11 (Harder) Develop a reference-counting heap manager for Parva. Develop appropriate specialized opcodes to maintain the reference counts and link together the chunks in such a way that adjacent chunks returned to the pool of free space can easily be coalesced so as to reduce fragmentation as far as possible. The only objects that can be allocated from the heap are simple arrays of integers, Booleans (and characters, if you have extended Parva as suggested in earlier exercises), which means that complications arising from cyclic references cannot arise.

14.12 Develop a compiler that translates a Parva program to C# (or to Java).

14.13 How would the Sale algorithm (see Exercise 12.16) be applied to multi-function programs?

The following exercises invite the reader to reflect critically on some aspects of language design and implementation:

14.14 The example given in section 14.6 made no use of function prototypes, yet it handled two mutually recursive functions. Is there any real need to provide function prototypes if you can nest functions?

14.15 In an extended PVM the computation of every run-time address by invoking a function call to traverse the static link chain might prove to be excessively slow if the idea were extended to a native-code generator. Since references to "intermediate" variables are likely to be less frequent than references to "local" or "global" variables, some compilers (for example Turbo Pascal) generate code that unrolls the loop implicit in the base function for such accesses - that is, they generate an explicit sequence of N assignments, rather than a loop that is performed N times - thereby sacrificing a marginal amount of space to obtain speed. Explore the implications and implementation of this idea.

14.16 If you use a display for up-level addressing, is there any real need to use the frame pointer register FP as well?

14.17 Follow up the suggestion that the display does not have to be restored after every return from a function. When should the compiler generate code to handle this operation, and what form should the code take? Are the savings worth worrying about? (The Pascal-S system took this approach (Wirth 1981, Rees and Robson 1987).)

14.18 One possible approach to the problem of running out of display elements is to store as large a display as will be needed in the frame header for the activation record itself. Explore the implementation of this idea and comment on its advantages and disadvantages.

14.19 Some authors suggest that value-returning function subprograms are not really necessary - one can simply use void functions with call-by-reference parameter passing instead. Examine the relative merits of providing both in a language, from the compiler writer's and the user's viewpoints.

14.20 Now that you have a better understanding of how recursion is implemented, study the compiler you are writing with new interest. It uses recursion a great deal. How deeply do you suppose this recursion goes when the compiler executes? Is recursive descent "efficient" for all aspects of the compiling process? Do you suppose a compiler would ever run out of space in which to allocate new activation records for itself when it was compiling large programs?

14.21 In classic Pascal the ordering of the components in a program or procedure block is very restrictive. It can be summarized in EBNF on the lines of

Block = [ ConstDeclarations ]
               [ TypeDeclarations ]
               [ VarDeclarations ]
               { ProcDeclaration }
               CompoundStatement .

In Modula-2, however, this ordering is highly permissive

Block = { ConstDeclarations | TypeDeclarations | VarDeclarations | ProcDeclaration }
CompoundStatement .

Oberon (Reiser and Wirth 1992) introduced an interesting restriction

Block = { ConstDeclarations | TypeDeclarations | VarDeclarations }
{ ProcDeclaration }
CompoundStatement .

Umbriel (Terry 1995) imposed a different restriction

Block = { ConstDeclarations | TypeDeclarations | ProcDeclaration }
{ VarDeclarations }
CompoundStatement .

Although allowing declarations to appear in any order makes for the simplest grammar, languages that insist on a specific order presumably do so for good reasons, perhaps not only for ease of compilation. Can you suggest what these might be?

14.22 How would you write a Cocol grammar to insist on a particular declaration order, and yet recover satisfactorily if declarations were presented in any order?

14.23 Originally, in Pascal a function could only return a scalar value, and not, for example, an ARRAY, RECORD or SET. Why do you suppose this annoying restriction was introduced?

14.24 Exercise 12.32 suggested that if a ForStatement were to be added to Parva some method should be found to prevent a malevolent program from altering the value of the loop control variable within the loop. Some attempts are doing so are easily detected, but those involving function calls are a little trickier, as study of the following might reveal.

     int i;

     void nasty() {
       i = 10;
     }

     void main() {
       for i = 0 to 10 // Corrupt by using as inner control variable
         for i = 0 to 5 {
           read(i);    // Corrupt by reading a new value
           i = 6;      // Corrupt by direct assignment
           nasty();    // Corrupt by calling function with i in scope
         }
     }

Can you find ways to prevent these, and any other similar crimes you might think of committing?

14.25 Several language designers decry function subprograms for the reason that most languages do not prevent a programmer from writing functions that have side effects. The program below illustrates several esoteric side effects. Given that one really wishes to prevent these, to what extent can a compiler detect them?

      int A;
      int[] B = new int[12];

      void F1(int[] X) {
        X[3] = 1;    // modifies an element in X
      }

      void F2() {
        A = 1;       // modifies global variable
      }

      void F3() {
        F2();        // indirect attack on a global variable
      }

      int F4(int[] Y) {
        A = 3;       // side effect
        read(A);     // side effect
        Y[4] = 4;    // side effect
        F1(B);       // side effect
        F2();        // side effect
        F3();        // side effect
        F1(Y);       // side effect
        return 42;   // desired effect!
      }

      void main() {
        A = F4(B);
      }