Computer Science 3 - 2000

Programming Language Translation


Practical for Week 25-26, beginning 16 October 2000 - Solutions to Task 2


Task 2 - Getting to understand Topsy's compiler

Here are some awkward questions. Answer them by editing the file QUIZ in the kit and submit it as your solution by next Tuesday. Your solutions can refer to the line numbers on the listings that form part of this handout.

(a) In Clang a declaration like

VAR List[10];

specifies the upper range of an index, and is equivalent to Pascal's

VAR List : ARRAY [0 .. 10] OF INTEGER;

In Topsy++, however, a declaration like

int List[10];

specifies the size of the array, not the upper limit on an index, and would be closer to Pascal's

VAR List : ARRAY [0 .. 9] OF INTEGER;

So what on earth would happen if you submitted a declaration like

int x[0];

to the Topsy++ compiler?

See the attribute grammar, line 106 (C++) or line 113 (Pascal). The compiler would report a semantic error.

Does it mean anything to have an array of size 0? What?

Well, it would be hard to find any sensible meaning...

What does it mean in C++,? Is it allowed in C++?

Hey-hey-hey. For once C++ adopts a sane approach. Or does it? Submitting code like the above to the Borland compiler produced the message "arrays must have at least one element".

But just look at this nonsense! I compiled the code below with gcc under Unix, which raised no error message.and ran it and received the assurance that yy[0] had been assigned the value 1.

   #include <stdio.h>
   void main (void) {
     int y[10];
     int yy[0];
     yy[0] = 1;
     printf("%d", yy[0]);
     exit(1);
   }

I did not bother to check what it had corrupted behind my back.

(b) Study how the compiler handles escape sequences like \t and \a and \n that appear in string literals. Does it handle them all correctly? If not, where or why not?

The attribute actions for handling escaped characters (lines 371 - 383 for C++ or 390 - 404 for Pascal) should be easily understandable. The actions for escaped characters in strings are a bit more abstruse (lines 386 - 409 for C++ or 406 - 438 for Pascal). The C++ version erases the \ lead in character, and does some substitution on the character that follows it; so does the Pascal one. But the C++ version gets away with substituting CR only for \n; the Pascal one has to substitute CR-LF. This is because of the different ways in which the interpreters' I/O systems ultimately work. These are some of the trickiest actions to understand in the entire grammar, I think.

(c) The for loop is quite a complex construct. Why do you suppose you are not allowed to alter the control variable (see section 10.6)?

Because you would then be allowed to write code that would get you into a mess; for example

for (i = 1; i < 10; i++) { i--; }

(this would produce an infinite loop).

Where is this checked by the compiler?

Lines 232 - 235 (C++) or 244 - 247 (Pascal) are arranged to "protect" the control variable before you parse the associated "body" statement, by marking the symbol table entry so that if the control variable is threatened in an Assignment or ReadStatement in the body of the loop this can be detected (see line 147 (C++) or line 155 (Pascal)).

Is the check robust, or could you write a Topsy++ program that defeated the check? If so, how?

At first I thought it was robust, but it is not. Consider the following:

    void main (void) {
      int i;
      for (i = 1; i < 3; i++) {
        cout << "outer " << i;
        for (i = 1; i < 3; i++) {
          cout << "inner " << i;
        }
      }
    }

Unfortunately this messes things up. The inner loop would corrupt the value of i that is used to control the outer loop (incidentally this sort of mess is frequently got into by beginners when they first write nested loops, and without compiler help they can have a tough time finding out why their programs don't work - usually, of course, it is not quite as obvious as in the above example.)

To fix this, the Topsy attribute grammar needs but a simple repair - lines 208 - 209 (C++) should be replaced by

                 (. if (!entry1.v.scalar || !arithtypes.memb(controltype))
                      SemError(225);
                    if (!entry1.v.canchange) SemError(229); .)

or, for the Pascal version, replace lines 221 - 222 by

                 (. IF NOT Entry1.Scalar OR NOT (ControlType IN ArithTypes)
                      THEN SemError(225);
                    IF NOT Entry1.CanChange THEN SemError(229) .)

In more general programs, implementing this sort of check is very difficult; almost impossible in fact. For example, suppose that one has the control variable declared in one file, and the for loop in another, and that these files are compiled separately. Or suppose that the control variable is passed as a parameter to a function called within the for loop body. Some examples of programs that are deliberately nasty in this sort of way are discussed in that well-known (and by now well-loved?) book (A4 edition; pages 279 - 280).

Every year students come up with things I had not thought of, which is fun. This year some people pointed out that another way to "break" a for loop might be to write silly code like

    void main (void) {
      int i, x = 10;
      for (i = 1; i < x; i++) x++;  // alter top limit as you go through the loop
    }

which, although it does not corrupt the control variable also results in an infinite loop. Of course many other examples can be devised on those lines. In fact, advanced books on compiler practice will tell you that the for loop, delightfully easy as it at first appears, is fraught with all sorts of traps for code generators.

(d) Study carefully how the compiler handles declarations of variable local to inner block statements - Consider, for example, programs like

  $D+ // Turn diagnostic mode on for testing the compiler
  // The dynamic semantics of this program are not really meaningful!

  void main (void) {
    int i;                 // i is global
    { int j, k = 10;       // j and k are local to this inner statement
      j = 2 * k;
      int l = j + k;       // l is also local to this statement
      cin >> i;            // i is still in scope
      if (i > 0) {
        l = k;             // l and k are still in scope
        int m = 90;        // m is local to the body of the if statement
      }
    }
    i = 2 * j + m;         // i is in scope, but j and m are not
  }

  $D+ // Turn diagnostic mode on for testing the compiler
  // The dynamic semantics of this program are not really meaningful!

  void main (void) {
    int i;                 // i is global
    { int i, j, k = 10;    // j and k, and a new i, are local to this statement
      j = 2 * k;
      int l = j + k;       // l is also local to this inner statement
      cin >> i;            // the second i is the one in scope
      if (i > 0) {         // the second i is the one in scope
        l = k;             // l and k are still in scope
        int m = 90;        // m is local to the body of the if statement
        bool i = true;     // yet another i is allowed, local to this body
      }
      if (i > 0) {
        bool i = true;     // yet another i
        int m = 90;        // yet another m
      }
    }
  }

What offsets (addresses) from the base pointer CPU.BP are associated with each variable?

Leaving out all but the declarations for the moment, the first one is easy:

  void main (void) {
    int i;                 // i is global       - offset 1
    { int j, k = 10;       // j and k are local - offsets 2 and 3
      int l = j + k;       // l is also local   - offset 4
      if (i > 0) {
        int m = 90;        // m is local        - offset 5
      }
    }
  }

The second one is more intriguing:

  void main (void) {
    int i;                 // i is global            - offset 1
    { int i, j, k = 10;    // a new i, and j and k   - offsets 2, 3, 4
      int l = j + k;       // l is also local        - offset 5
      if (i > 0) {
        int m = 90;        // m is local             - offset 6
        bool i = true;     // yet another i          - offset 7
      }
      if (i > 0) {         // the last m and i have gone out of scope so
        bool i = true;     // yet another i          - offset 6
        int m = 90;        // yet another m          - offset 7
      }
    }
  }

(e) In the second example above it appears that we can have more than one variable in existence at run time with the apparent name i, and that these variables can even be of different types. Does this really make sense?

Perfectly, though it can confuse the hell out of beginners. As each Block is parsed at compile time it introduces a further level of scope, in a stack-like way, and can add (push onto the stack) its own locally-declared identifiers into the symbol table. After the Block has been parsed, these entries are popped off the symbol table. To see where this is done in the attribute grammar, study lines 64 - 67 (C++ version) or lines 68 - 71 (Pascal version) in conjunction with the code for the table handler (and maybe you will start to appreciate why I keep telling you to understand the workings of simple table handlers).

If so, how do the compiler and interpreter work out which i is which?

At compile time, when an identifier has to be looked up in the table, we plant a sentinel in the bottom of the table and look down from the top. If more than one entry has been pushed into the symbol table for a given identifier spelling, the most recently added will be found first, so we get the effect we require very easily. The interpreter, of course, does not have any problem at all. By the time we get to interpret the code the identifier names no longer appear in any way, the ADR instructions work merely in terms of numbers.

But note that this means that we cannot generate the DSP instruction at the start of parsing the main routine. We have to be prepared to compute the variable space slowly as we work through all the inner statements and inner blocks. So that is why the "statement" parsers now all have a framesize parameter, passed by reference.

You may be interested to learn that this "scope" business formed the basis of the 24 hour exam question in 1996. In the prac course that year, students had developed Topsy for themselves starting from Clang, almost to the point that you were given for this practical, but with the restriction that all variables must be declared before any statements were encountered. For the 24 hour question they were asked to work out how to modify the compiler to handle statements and declarations in any order. And if you think this was tough; well, it was. But a great many outstandingly good answers were received.

(f) In the definition of the symbol table you will find an enumeration of the possible types that variables and expressions can have:

enum TABLE_types {TABLE_none, TABLE_ints, TABLE_chars, TABLE_bools };

in C++, or the equivalent in Pascal:

TYPES = (none, ints, chars, bools);

Surely this is ridiculous? We can only declare variables to be of int, bool or char types. So why is there a none type in the enumeration?

This is a device to be user-friendly. If you look at the attribute grammar in lines 55 - 56 (C++) or 59 - 60 (Pascal) you will find that the none type is included in the set of possibilities for booleans as well as in the set for numerics. And if you look further, for example at line 296 (C++ version) or 313 (Pascal version) you will find that when a type incompatibility in an expression component is encountered, the type of that component is set to none after reporting the error. However, this component will then be compatible with any and all other components in the expression. So the idea is to suppress what might otherwise be many, many "incompatible types" error messages - the first one generated should be enough to alert the user to the fact that he (or maybe she) has got it Badly Wrong. After all, who reads past the first error message in a compilation?

(g) In the Topsy++ manual you will find that a Block is defined as

Block = "{" { Declaration | Statement } "}" .

but in the TOPSY.ATG file you will find

Block = "{" { Statement } "}" .

with the production for Statement augmented to include the forms of Declaration. Is there something wrong with the ATG file, or with the manual, with both, or with neither?

Nothing is wrong. I have mentioned before that grammars are often written in one way to help people understand the "semantic overtones" of the constructs (declarations and statements are fundamentally rather different to most users), but written in another way to make parsing easier. You will find another example of a discrepancy if you look - the Report (page 7, section 10.7) makes no mention of a ReadElement, yet the attribute grammar does, in lines 244 - 253 (C++ version) and 256 - 265 (Pascal version).

(h) In the Topsy++ manual, section 9.2.1 you will find a description of semantics of the || and && operations. What sort of code is produced for statements like

           i = 5;
           j = 6;
           if ((i > 3) || (j > 4)) cout << "passed";

The code would have been

         0 DSP        2
         2 LIT        5
         4 POP       -1
         6 LIT        6
         8 POP       -2
        10 PSH       -1
        12 LIT        3
        14 GTR
        15 BOR       22
        17 PSH       -2
        19 LIT        4
        21 GTR
        22 BZE       26
        24 PRS   'passed'
        26 HLT

How would this be interpreted at run time (study the operation of the stack machine - it is easy to generate the code using the compiler if you don't feel like doing it by hand).

The point of this was to draw attention to the "short circuit" OR operation at location 15, which will branch to 22 if the first subexpression has left a 1 on the top of the evaluation stack (of course it will pop this 1). If there is a 0 on the top of the stack, then the second subexpression is evaluated, and a more normal BZE instruction used to check the condition. The code generation, of course, involves another forward reference - backpatch combination (look at lines 299 - 303 and 317 - 321 (C++) or lines 316 - 320 and 335 - 339 (Pascal) to see how clever all this is. It is another example of where we cheat a bit - we have introduced new opcodes into the machine to make the whole thing easy. If you had to use only BRN and BZE it would be rather more difficult (if you don't believe me, just try).

(i) Suppose we write a loop like this

char c;
for (c = 'a'; true; c++) cout << c;

which looks a bit odd, but is syntactically legal. What sort of code is produced, and how does it behave at run time?

The code produced is

         0 DSP        1
         2 LIT       97
         4 LIT      255
         6 POR       -1
         8 LIT        1  <-----------.
        10 BZE       25  --------------.
        12 BRN       20  -----.      | |   branch down to loop "body"
        14 ADR       -1  <----|---.  | |
        16 PPR      255       |   |  | |
        18 BRN        8  -----|---|--' |  branch back to start of loop
        20 PSH       -1  <----'   |    |
        22 PRC                    |    |
        23 BRN       14  ---------'    |  branch back to "increment section"
        25 HLT           <-------------'

and the point of this was to draw attention to two features.

Firstly, the ++ operation generates a PPR 255 opcode, which is an "increment and check that the silly fool is not trying to get a value out of the range 0 ... 255) because characters are meaningless outside that range". So if that code runs it will soon go out of range - the body of the loop does nothing, the c++ clocks c up and up, and the interpreter sits with a sly grin on its ex-physicist's face waiting to rub its hands in glee! Note further the use of the LIT 255 and POR -1 combination to check that the value initially assigned to c is also in this limited range.

Secondly, note the form of the code generated for a for loop "control section". The whole thing is a bit spaghetti like, but since it is automatically generated it all works properly. The "increment" section is defined by source code that textually occurs before the statement that forms the loop body, yet must, of course, be executed after this body. To reinforce this point, the loop above is equivalent to

         c = 'a';
         while (true) {
           cout << c;
           c++;
         }

(j) Consider this gem of a program. What is the problem, and how does Topsy handle it?

       void main (void) {
         char c = 'z';
         while (true) { // infinite loop, we should be able to generate an error
           c++;         // haven't I told you to avoid c++?
           cout << c;
         }
       }

The code generated is

         0 DSP        1
         2 LIT      122
         4 LIT      255
         6 POR       -1
         8 LIT        1
        10 BZE       21
        12 ADR       -1
        14 PPR      255
        16 PSH       -1
        18 PRC
        19 BRN        8
        21 HLT

Semantically this is almost equivalent to the program used in (j), but of course the control structure is rather different (and cleaner). Again, note the range-checking operations used at 4-6 and at 14.

(k) The version of Topsy that you have cunningly uses the PSH and POP opcodes where it can. Study the code generation process carefully and describe it briefly in your own words.

Essentially we want to be able to replace sequences like

           ADR -n            with      PSH -n
           VAL

and

           ADR -n            with      Evaluate value
           Evaluate value              POP -n
           STO

where possible. This optimization can only be performed if the address concerned is the address of a scalar - where array manipulations are concerned we have to retain the ADR instructions as part of the code that computes the array element's address at runtime and checks that this is in range.

In handling assignments the choice of whether to use a STO or a POP instruction is made in line 131 (C++ version) or lines 139 - 141 (Pascal version). This necessitates knowing the appropriate offset to use, which is why the Variable nonterminal has the entry parameter to retrieve this information from the Designator nonterminal, which is where the symbol table search takes place (line 153 (C++ version) or line 162 (Pascal version).

The Designator actions generate the ADR opcode only if it is needed as part of an array access, in line 167 (C++) or 176 (Pascal version). Of course we must also distinguish the situations where PSH can be used in place of VAL - this is done in the Factor parser in lines 330 - 331 (C++ version) or 347 - 349 (Pascal version).

Incidentally, the Designator parser embodies another little trick which is worth commenting on. If an identifier is not found in the symbol table, then this compiler goes ahead and makes an entry for that identifier anyway, the aim being to suppress a whole spate of "undeclared identifier" messages for that identifier from then on. This auto-entry is handled in 156 - 161 (C++ version) or lines 166 - 170 (Pascal version). The system looks ahead at the next token to guess whether it is an undeclared array or scalar identifier. The scheme used here is a bit messy, because it relies on knowing the names Coco/R will generate for tokens (line 158, C++ or line 167, Pascal). There must be a better way (there is; students found it a few years ago - see if you can do so too).

(l) Well, Topsy++ may be a nice little toy, but it has a number of annoying differences from C-plus-whatsit. For example:

(a) C++ has a do-while loop and Topsy++ currently has a do-until loop.

(b) C++ has a baroque precedence ordering for its operators (something like 19 levels, while Topsy++ currently has only about 4).

(c) C++ has that cute little break statement that allows you to leave a loop early.

The changes needed to TOPSY0.ATG to allow for the suggested changes to statement forms are very simple:

  Statement
  =    Block           | Assignment
     | IfStatement     | WhileStatement | ForStatement
     | DoStatement     | BreakStatement | EmptyStatement
     | ReturnStatement | ReadStatement  | WriteStatement .

  WhileStatement = "while" "(" Condition ")" Statement .
  DoStatement    = "do" Statement "while" "(" Condition ")" ";" .

  BreakStatement = "break" ";" .

The changes to Expression are more interesting. C++ has an unbelievable set of about 15 precedence levels (textbooks seem to differ on exactly how many, which you should find intriguing). If we restrict ourselves to the operators given in the problem, then a grammar follows below. Note that this allows expressions like

     7 + 5 * 71          (obviously)
     + - ! a             (totally unlike anything allowed in Clang, Pascal, Modula-2)

  Expression  = AndExp      { "||" AndExp } .
  AndExp      = EqualExp    { "&&" EqualExp }.
  EqualExp    = RelationExp { EqualOp RelationExp } .
  RelationExp = AddExp      { RelOp AddExp }.
  AddExp      = MultExp     { AddOp MultExp } .
  MultExp     = UnaryExp    { MulOp UnaryExp } .
  UnaryExp    = ( "+" | "-" | "!" ) UnaryExp | Factor .
  Factor      =   Designator | number | character | "true" | "false"
                | "(" Expression ")" .

with the other operators themselves defined as

  AddOp       = "+" | "-" .
  MulOp       = "*" | "/" | "%" .
  EqualOp     = "==" | "!=" .
  RelOp       = "<" | "<=" | ">" | ">=" .