Examiners: Time 3 hours Prof P.D. Terry Marks 180 Prof E.H. Blake Pages 7 (please check!)
Answer all questions. Answers may be written in any medium except red ink.
(For the benefit of future readers of this paper, various free information was made available to the students 24 hours before the formal examination. This included the full text of Section B. During the examination, candidates were given machine executable versions of the Coco/R compiler generator, access to a computer and machine readable copies of the questions.)
1. Draw a diagram clearly depicting the various phases found in a typical compiler. Indicate which phases belong to the "front end" and which to the "back end" of the compiler. [ 9 marks ]
This is essentially bookwork. The diagram in the text was as follows:
Source code | | | .- Character handler (non-portable) ----. | | | | | | | | | Analytic | .------- Lexical analyser ------------| phase | | (Scanner) | | | | | | | | | (Front end) | |-------- Syntax analyser ------------| | | (Parser) | | | | | | | | | | Table --+------- Constraint analyser -----------+-- Error `- handler | (Static semantic analyser) | reporter | | | .- | | | | |-- Intermediate code generator ------| | | | | | | | | | | | | Synthetic | `--------- Code optimizer ------------| phase | | | | | | | | | (Back end) | Code generator (non-portable) ------' | | | | | | `- Peephole optimizer (non-portable) | | Object code
2. (a) What is meant by the term "self-compiling compiler"? [ 2 marks ]
A self-compiling compiler is one written in the source language that it is intended to compile. When it compiles this source code it should, of course, reproduce itself.
(b) Describe (with the aid of T-diagrams) how you would perform a "half bootstrap" of a compiler for language X, given that you have access to the source and object versions of a compiler for X that can be executed on machine OLD and wish to produce a self-compiling compiler for language X that can be executed on machine NEW. [ 10 marks ]
This is essentially book work. The solution I was looking for is:
(a) Rewrite the back end of the X compiler to produce NEW code and compile this to give a cross-compiler
.--------------------------. .--------------------------. | XtoNEW.X | | XtoNEW.OLD | | X ------> NEW code | | X ------> NEW code | | | | | `-------. .--------------------------. .-------' | | XtoOLD.OLD | | | X | X ------> OLD-code | OLD-code | | | | | `------------------. .------------------' | | | OLD-code | | | `----------'
(b) Compile the source code of the modified compiler with the cross-compiler to produce the NEW compiler
.--------------------------. .--------------------------. | XtoNEW.X | | XtoNEW.OLD | | X ------> NEW code | | X -------> NEW code | | | | | `-------. .--------------------------. .-------' | | XtoNEW.OLD | | | X | X ------> NEW code | NEW code | | | | | `------------------. .------------------' | | | OLD-code | | | `----------'
3. Consider the following grammar expressed in EBNF for describing the progress of a typical university course:
Course = Introduction Section { Section } Conclusion . Introduction = "lecture" [ "handout" ] . Section = { "lecture" | "prac" "test" | "tut" | "handout" } "test" . Conclusion = [ "Panic" ] "Examination" .
(a) Develop a recursive descent parser for the grammar (matching the EBNF as given above). Assume that you have suitable Abort, Accept and GetSym routines (which you need not develop), and that GetSym decodes Sym as one of the tokens below. Your system should detect errors, of course, but need not incorporate "error recovery". [ 12 marks ]
{ EOFSym, lectSym, hanSome, pracSym, testSym, tutSym, PANICSYM, examSym, unknownSym }
A satisfactory solution would look something like
void Course (void) { // Course = Introduction Section { Section } Conclusion . Introduction(); Section(); while (Sym == lectureSym || Sym == pracSym || Sym == tutSym || Sym == hanSome) Section(); Conclusion(); } void Introduction (void) { // Introduction = "lecture" [ "handout" ] . Accept(lectSym, "lecture expected"); if (Sym == HanSome) GetSym(); } void Section (void) { // Section = { "lecture" | "prac" "test" | "tut" | "handout" } "test" . while (Sym == lectureSym || Sym == pracSym || Sym == tutSym || Sym == hanSome) { switch (Sym) { case lectSym : GetSym(); break; case pracSym : GetSym(); Accept(testSym, "test expected"); break; case tutSym : GetSym(); break; case hanSome : GetSym(); break; } } Accept(testSym, "final test expected"); } void Conclusion (void) { // Conclusion = [ "Panic" ] "Examination" . if (Sym = PANICSYM) GetSym(); Accept(examSym, "Exam Expected"); }
(b) What do you understand by the statement "two grammars are equivalent"? [ 2 marks ]
Equivalent grammars produce the same set of sentences, using different sets of productions.
(c) Rewrite these productions so as to produce an equivalent grammar in which no use is made of the EBNF meta-brackets { ... } or [ ... ]. [ 5 marks ]
Course = Introduction Section MoreSections Conclusion .
MoreSections = Section MoreSections | e .
Introduction = "lecture" OptionalHandout .
OptionalHandout = "handout" | e .
Section = Components "test" .
Components = Component Components | e .
Component = "lecture" | "prac" "test" | "tut" | "handout" .
Conclusion = OptionalPanic "Examination"
OptionalPanic = "Panic" | e .
(d) Analyse the equivalent grammar derived in (c) to determine whether it obeys the LL(1) constraints. [ 8 marks ]
The grammar is non-LL(1). OptionalHandout is nullable, and
FIRST(OptionalHandout) = { "handout" } FOLLOW(OptionalHandout) = { "handout", "lecture", "prac", "tut", "test" }
In other respects it is properly LL(1), but candidates should have been able to justify that in some way.
(e) If you found that the grammar did not obey the LL(1) constraints, does that mean that the parser produced in (a) would fail for some valid inputs? If so, give an example of input that could not be parsed; otherwise justify your claim that it would always succeed. [ 4 marks ]
The parser would work. The situation is completely analogous to the "dangling else" problem - if a handout follows the opening lecture it is bound to Introduction, rather than forming part of a subsequent Section.
4. The CS301 language for which a compiler was developed in this course allows for various statements, including a "while" loop. Relevant parts of the attributed grammar are shown below.
StatementSequence = Statement { WEAK ";" Statement } . Statement = SYNC [ Assignment | IfStatement | WhileStatement | ReadStatement | WriteStatement | ReturnStatement | CaseStatement ] . WhileStatement = (. CGEN_labels startloop, testlabel, dummylabel; .) "WHILE" (. CGen->storelabel(startloop); .) Condition "DO" (. CGen->jumponfalse(testlabel, CGen->undefined); .) StatementSequence (. CGen->jump(dummylabel, startloop); CGen->backpatch(testlabel); .) "END" .
An enthusiastic language extender has suggested that CS301 would be greatly improved by the addition of a post-test loop, and has come up with two possibilities:
PostTestStatement = "DO" StatementSequence "WHILE" Condition .
or
PostTestStatement = "DO" StatementSequence "UNTIL" Condition .
(a) Advise her, with reasons, as to whether or not either or both of these suggestions would be acceptable, and which (if either) would be preferable. [ 5 marks ]
PostTestStatement = "DO" StatementSequence "WHILE" Condition .
This won't work. A WhileStatement starts with WHILE, so one would not be able to distinguish a WhileStatement in the StatementSequence from the terminating WHILE of the PostTestStatement.
PostTestStatement = "DO" StatementSequence "UNTIL" Condition .
This of course works.
(b) For the form of your choice, show how the grammar above would be extended to recognise the statement form and generate correct code. [ 5 marks ]
Statement
= SYNC [ Assignment | PostTestStatement
| IfStatement | WhileStatement
| ReadStatement | WriteStatement
| ReturnStatement | CaseStatement
] .
PostTestStatement
= (. CGEN_labels startloop, dummylabel; .)
"DO" (. CGen->storelabel(startloop); .)
StatementSequence
"UNTIL" Condition (. CGen->jumponfalse(dummylabel, startloop); .) .
5. As you should recall, CS301 is a "strictly typed" language, and expressions of the form
NOT 6
or
TRUE > 56
or
3 + 4 AND 5
are unacceptable. Most strictly typed languages allow for programmers to circumvent (bypass) these restrictions, typically by allowing so-called "type casting", as exemplified by
NOT BOOL(6)
or
INT(TRUE) > 56
or
3 + INT(BOOL(4) AND BOOL(5))
(a) In the free information for this paper appears an attributed grammar for generating code to evaluate expressions which does not incorporate such type casting. Show how the grammar could be altered to do so. (Only write out those parts that would have to change.) [ 6 marks ]
This is trivial if you understand the system! We need a simple modification to Factor
Factor<TABLE_types &f> = (. int value; TABLE_entries entry; .) Designator<classset(TABLE_consts, TABLE_vars), entry> (. f = entry.type; switch (entry.idclass) { case TABLE_vars : CGen->dereference(); break; case TABLE_consts : CGen->stackconstant(entry.c.value); break; } .) | Number<value> (. CGen->stackconstant(value); f = TABLE_ints; .) | "TRUE" (. CGen->stackconstant(1); f = TABLE_bools .) | "FALSE" (. CGen->stackconstant(0); f = TABLE_bools .) | "INT" "(" Expression<f> (. if (f != TABLE_bools) SemError(218); .) ")" (. f = TABLE_ints; .) | "BOOL" "(" Expression<f> (. if (f != TABLE_ints) SemError(218); .) ")" (. f = TABLE_bools; .) | "(" Expression<f> ")" .
(b) If strict type checking can be bypassed in this way, what advantages or disadvantages do strictly typed languages possess over languages like C++, where Boolean, integer and character types are all compatible? [ 4 marks ]
The advantage is that one gets some measure of protection against grossly stupid programming errors - one has to make a conscious effort to mix types.
6. As you should be aware, IP addresses as used in Internet communication are typically expressed in "dotted decimal" form, as exemplified by 146.231.128.6. The IP address actually represents a 32 bit integer; the four numbers in the quadruple corresponding to successive 8 bit components of this integer. For humans, machine addressing is usually more memorable when expressed in "DNS" format, as exemplified by terrapin.ru.ac.za. Some systems maintain tables of matching addresses, for example
146.231.122.13 cspt1.ict.ru.ac.za #comments appear like this 146.231.128.6 terrapin.ru.ac.za 146.231.56.10 thistle-sp.ru.ac.za 147.28.0.62 psg.com
When we moved our CS and IS departments to new premises recently, a decision was made to rename and uniquely renumber all the many machines in our possession. Our system administrators tried to draw up a table like the one above, which was then merged with the existing table in the IT division. Unfortunately, a few mistakes were made, which caused havoc until they were ironed out. For example, there were lines reading
146.231.122.11235 cspt1.ict.ru.ac.za #invalid IP address 146.231.122.15 cspt2.ict.ru.ac.za 146.231.122.15 cspt3.ict.ru.ac.za # non-unique IP address
Complete the ATG file below to show how Coco/R could be used to develop a system that would enable a file in this format quickly to be checked and the errors identified. (Hint: make use of the template list handling class that proved useful in various other applications in this course, the interface to which is provided in the free information for this paper). [ 24 marks ]
This is a longish example (hence the high mark allocation). But there is nothing very difficult in it, as the solution will show. The students in this class had written many Cocol grammars that do this sort of token definition and recognition, combined with the use of a list handler template. In fact this example is simple enough that the IPList object could have been declared "globally" which would allow one to eliminate some parameter passing.
COMPILER Check $XCN #include "misc.h" #include "genlist.h" typedef List<long> IPList; IGNORE CASE IGNORE CHR(1) .. CHR(31) CHARACTERS digit = "0123456789" . letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . eol = CHR(10) . COMMENTS FROM "#" TO eol TOKENS number = digit { digit } . name = letter { letter | digit | "." | "-" } . PRODUCTIONS Check = (. IPList L; .) { Entry<L> } (. if (Successful()) printf("\n all correct"); .) EOF . Entry<IPList &L> = IPNumber<L> name . IPNumber<IPList &L> = (. long n, m; .) Number<n> "." Number<m> (. n = 256 * n + m; .) "." Number<m> (. n = 256 * n + m; .) "." Number<m> (. n = 256 * n + m; if (L.isMember(n)) SemError(1001); else L.add(n); .) . Number<long &n> = (. char str[100] .) number (. LexString(str, sizeof(str) - 1); n = atoi(str); if (n > 255) SemError(1000); .) . END Check.
/* Simple generic linked list class. Only the interface is shown here George Wells -- 27 February 1996 */ template<class T> class List { public: List (void); // Constructor List (const List& lst); // Copy constructor ~List (void); // List destructor void add (T item, int position = INT_MAX); // Place new item in a List void remove (int position, T &item); // Remove item at position in List int length (void); // Return number of elements in List T& operator[] (int index); // Subscript a List int position (T item); // Return position item in a List (or -1) int isMember (T item); // True if item is in List }; // class List
Expression<TABLE_types &e> = (. TABLE_types a; CGEN_labels shortcircuit; .) AndExp<e> { "OR" (. if (shortBoolean) CGen->booleanop(shortcircuit, CGEN_opor) .) AndExp<a> (. if (!(booltypes.memb(e) && booltypes.memb(a))) { SemError(218); e = TABLE_none; } else e = TABLE_bools; if (shortBoolean) CGen->backpatch(shortcircuit); else CGen->binarybooleanop(CGEN_orrop); .) } . AndExp<TABLE_types &a> = (. TABLE_types e; CGEN_labels shortcircuit; .) RelExp<a> { "AND" (. if (shortBoolean) CGen->booleanop(shortcircuit, CGEN_opand) .) RelExp<e> (. if (!(booltypes.memb(a) && booltypes.memb(e))) { SemError(218); a = TABLE_none; } else a = TABLE_bools; if (shortBoolean) CGen->backpatch(shortcircuit); else CGen->binarybooleanop(CGEN_andop); .) } . RelExp<TABLE_types &r> = (. TABLE_types a; CGEN_operators op; .) AddExp<r> [ RelOp<op> AddExp<a> (. if (r == TABLE_bools || a == TABLE_bools) SemError(218); r = TABLE_bools; CGen->comparison(op) .) ] . AddExp<TABLE_types &a> = (. TABLE_types m; CGEN_operators op; .) MultExp<a> { AddOp<op> MultExp<m> (. if (!(arithtypes.memb(a) && arithtypes.memb(m))) { SemError(218); a = TABLE_none; } else CGen ->binaryintegerop(op); .) } . MultExp<TABLE_types &m> = (. TABLE_types u; CGEN_operators op; .) UnaryExp<m> { MulOp<op> UnaryExp<u> (. if (!(arithtypes.memb(m) && arithtypes.memb(u))) { SemError(218); m = TABLE_none; } else CGen ->binaryintegerop(op); .) } . UnaryExp<TABLE_types &u> = Factor<u> | "+" UnaryExp<u> (. if (!arithtypes.memb(u)) { SemError(218); u = TABLE_none; } .) | "-" UnaryExp<u> (. if (!arithtypes.memb(u)) { SemError(218); u = TABLE_none; } else CGen->negateinteger(); .) | "NOT" UnaryExp<u> (. if (!booltypes.memb(u)) SemError(218); else CGen->negateboolean(); u = TABLE_bools; .) . Factor<TABLE_types &f> = (. int value; TABLE_entries entry; .) Designator<classset(TABLE_consts, TABLE_vars), entry> (. f = entry.type; switch (entry.idclass) { case TABLE_vars : CGen->dereference(); break; case TABLE_consts : CGen->stackconstant(entry.c.value); break; } .) | Number<value> (. CGen->stackconstant(value); f = TABLE_ints; .) | "TRUE" (. CGen->stackconstant(1); f = TABLE_bools .) | "FALSE" (. CGen->stackconstant(0); f = TABLE_bools .) | "(" Expression<f> ")" .
Please note that there is no obligation to produce a machine readable solution for this section. Coco/R and other files are provided so that you can enhance, refine, or test your solution if you desire. If you choose to produce a machine readable solution, you should create a working directory, unpack EXAM.ZIP, modify any files that you like, and then copy all the files back to the blank diskette that will be provided.
For several years your predecessors in this course - and even yourselves - have been expected, as part of the practical course, to gain an understanding of a stack machine architecture by preparing programs written in a very limited form of assembler language in which all addressing had to be done in terms of numerical values which students were supposed to calculate for themselves - with understandable frustration setting in every time they inserted or deleted a few statements into programs as they debugged them. During this time students have begged me to give them a "real" assembler in which alphanumeric labels could be used to identify constants, variables, and the destinations of branch instructions. I have, of course, always been too busy to do this, but with 24 hours at your disposal and the expert knowledge you have amassed after studying the translators course this year, you should be able to remedy this situation - and if you succeed you may be able to make some useful pocket money selling your system to the class next year!
We start by observing that, rather than writing code as exemplified by the columns on the left, most people would prefer to write code as exemplified by the columns on the right
ASSEM $D+ | ASSEM $D+ # Read a list and write it backwards | CONST # High level declarations | Max = 10; # constants | Width = 6; | INT # variables BEGIN | List[Max], I, Item; DSP 13 | BEGIN # DSP 13 can be generated automatically ADR -12 | ADR I # I := 0; LIT 0 | LIT 0 # STO | STO # ADR -13 | READ ADR Item # LOOP INN | INN # Read(Item); ADR -13 | ADR Item # VAL | VAL # BZE 32 | BZE DONE # IF Item = 0 THEN EXIT END; ADR -1 | ADR List # ADR -12 | ADR I # VAL | VAL # LIT 11 | LIT SIZE(List) # IND | IND # ADR -13 | ADR Item # VAL | VAL # STO | STO # List[I] := Item; ADR -12 | ADR I # I++; PPP | PPP # BRN 7 | BRN READ # END; PRS 'Reversed' | DONE PRS 'Reversed' # Write('Reversed'); ADR -12 | PRINT ADR I # WHILE I > 0 DO VAL | VAL # LIT 0 | LIT 0 # GTR | GTR # BZE 59 | BZE EXIT # ADR -12 | ADR I # I--; MMM | MMM # ADR -1 | ADR List # ADR -12 | ADR I # VAL | VAL # LIT 11 | LIT Max + 1 # IND | IND # VAL | VAL # LIT 6 | LIT Width # Write(List[I] : 6); PRN | PRN # BRN 34 | BRN PRINT # END HLT | EXIT HLT # RETURN END. | END.
Since it might be dangerous to place too much reliance on what would be required of an assembler, or indeed determine exactly what is permitted in the assembler language itself from studying this one single example, here are some suggestions for deriving a complete system. (In the exam kit will be found some other example programs to assist in your development of the assembler.)
(a) You can make use of Coco/R, and in particular derive a solution by making use of the attributed grammar and support modules (symbol table handler, code generator, error handler, frame files etc) that were useful in the development of a CS301 compiler/interpreter.
(b) The assembler statements should appear between a bracketing BEGIN and END, and may optionally be preceded by declarations of constants and variables (like Max, Width, List, I and Item) using similar syntax to that found in CS301 programs.
(c) The assembler system should be able to assemble simple programs in which the addressing is all given in absolute form (as in the example on the left), as well as those with alphanumeric names and labels.
(d) Treat the mnemonics as key (reserved) words. Since AND and NOT are mnemonic opcodes, use !, && and || for Boolean operators.
(e) Alphanumeric labels (like READ, PRINT, DONE and EXIT) used as the targets of branch instructions must be uniquely defined. For simplicity, these labels should not be allowed to duplicate identifiers used in the declaration of named constants or variables.
(f) It is acceptable to define labels without ever having branch instructions that referred to them, to have multiple labels defined at one point, or to have multiple branches to one point, for example
BRN START # Unnecessary, but legal START LOOP LIT 6 LIT 7 PRN BRN START # equivalent to BRN LOOP
(g) It would not be acceptable to have branch instructions refer to labels that are never defined, for example
BEGIN LOOP LIT 6 LIT 7 PRN BRN START # Start is undefined END
(h) The LIT and DSP mnemonics should be allowed to take a constant-generating expression as a parameter:
DSP 6 # Absolute form LIT Max # Equivalent to LIT 10 LIT Max * 10 + Width # Equivalent to LIT 106 LIT Size(Array) # Equivalent to LIT 11
where Size is a pseudo function that can return the storage space needed for the variable quoted as its actual argument (this would clearly be useful in applications that use arrays in particular).
(i) The ADR mnemonic should be allowed to take a (possibly signed) number or a variable name as its parameter. In the case where this name refers to an array a possible extension would be to allow it to have a constant subscript indicating a further offset that could be computed at assemble time, for example:
ADR -1 # absolute addressing ADR Item # equivalent to ADR -13 ADR List # equivalent to ADR -1 ADR List[0] # equivalent to ADR -1 ADR List[2] # equivalent to ADR -3
(j) Not much attention need be paid to type checking - at this level programmers should be relied on to get these semantics correct for themselves.
(k) Apart from situations where they are necessary for separating other alphanumeric quantities, whitespace characters may be used at the coder's discretion to improve the appearance of source code.
(l) In the extended compiler for CS301 you may have made use of additional opcodes to the ones listed below, in particular to handle switch/case statements. For the purposes of this examination you may confine your assembler to the opcodes in the table on page 7.
Several of these operations belong to a category known as zero address instructions. Even though operands are clearly needed for operations such as addition and multiplication, the addresses of these are not specified by part of the instruction, but are implicitly derived from the value of the stack pointer SP. The two operands are assumed to reside on the top of the stack and just below the top; in our informal descriptions their values are denoted by TOS (for "top of stack") and SOS (for "second on stack"). A binary operation is performed by popping its two operands from the stack into (inaccessible) internal registers in the CPU, performing the operation, and then pushing the result back onto the stack.
AND Pop TOS and SOS, and SOS with TOS, push result to form new TOS ORR Pop TOS and SOS, or SOS with TOS, push result to form new TOS ADD Pop TOS and SOS, add SOS to TOS, push sum to form new TOS SUB Pop TOS and SOS, subtract TOS from SOS, push difference to form new TOS MUL Pop TOS and SOS, multiply SOS by TOS, push product to form new TOS DVD Pop TOS and SOS, divide SOS by TOS, push quotient to form new TOS REM Pop TOS and SOS, divide SOS by TOS, push remainder to form new TOS EQL Pop TOS and SOS, push 1 to form new TOS if SOS = TOS, 0 otherwise NEQ Pop TOS and SOS, push 1 to form new TOS if SOS ¹ TOS, 0 otherwise GTR Pop TOS and SOS, push 1 to form new TOS if SOS > TOS, 0 otherwise LSS Pop TOS and SOS, push 1 to form new TOS if SOS < TOS, 0 otherwise LEQ Pop TOS and SOS, push 1 to form new TOS if SOS £ TOS, 0 otherwise GEQ Pop TOS and SOS, push 1 to form new TOS if SOS ³ TOS, 0 otherwise NEG Integer negation of TOS NOT Boolean negation of TOS STK Dump stack to output (useful for debugging) PRN Pop TOS and SOS, write SOS to output as an integer value in field width TOS PRB Pop TOS and SOS, write SOS to output as a Boolean value in field width TOS PRS A Write the nul-terminated string that is assumed to be stacked in the literal pool from Mem[A] NLN Write a newline (carriage-return-line-feed) sequence INN Read integer value, pop TOS, store the value that was read in Mem[TOS] INB Read Boolean value, pop TOS, store the value that was read in Mem[TOS] DSP A Decrement value of stack pointer SP by A LIT A Push the integer value A onto the stack to form new TOS ADR A Push the value BP + A onto the stack to form new TOS. (This value is conceptually the address of a variable stored at an offset A within the stack frame pointed to by the base register BP.) IND (Range checked indexing of array) Pop TOS to yield Size; pop TOS and SOS; if 0 £ TOS < Size then subtract TOS from SOS, push result to form new TOS INX (Unchecked indexing of array) Pop TOS and SOS, subtract TOS from SOS, push result to form new TOS VAL (Dereferencing) Pop TOS, and push the value of Mem[TOS] to form new TOS DUP Push TOS to form duplicate copy STO Pop TOS and SOS; store TOS in Mem[SOS] PPP Pop TOS and increment Mem[TOS] by 1 MMM Pop TOS and decrement Mem[TOS] by 1 HLT Halt BRN A Unconditional branch to instruction A BZE A Pop TOS, and branch to instruction A if TOS is zero BAN A Branch to instruction A if TOS is false; else pop TOS BOR A Branch to instruction A if TOS is true; else pop TOS NOP No operation
The instructions in the first group are concerned with arithmetic and logical operations, those in the second group afford I/O facilities, those in the third group allow for the access of data in memory by means of manipulating addresses and the stack, and those in the last group allow for control of flow of the program itself.
COMPILER ASSEM $XCN /* Stack assembler level 1 grammar - this version for csC 301 exam solution 2001 P.D. Terry, Rhodes University, 2001 */ /* This version is just a simple "absolute" assembler - stage 1 */ #include "misc.h" #include "set.h" #include "cgen.h" #include "table.h" #include "report.h" bool debug; extern TABLE *Table; extern CGEN *CGen; extern REPORT *Report; // needed for debugging pragma /*--------------------------------------------------------------------------*/ IGNORE CASE IGNORE CHR(9) .. CHR(13) PRAGMAS DebugOn = "$D+" . (. Report->debugging = true; .) DebugOff = "$D-" . (. Report->debugging = false; .) CHARACTERS cr = CHR(13) . lf = CHR(10) . letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . digit = "0123456789" . instring = ANY - "'" - cr - lf - CHR(0) . COMMENTS FROM "{" TO "}" COMMENTS FROM "#" TO lf /* ========= addition to original */ TOKENS identifier = letter { letter | digit } . number = digit { digit } . string = "'" (instring | "''") { instring | "''" } "'" . PRODUCTIONS ASSEM = "ASSEM" Block "." . Block = "BEGIN" StatementSequence "END" . StatementSequence = { Statement } . Statement = ( OneWord | TwoWord | WriteString ) . OneWord = ( "ADD" | "AND" | "DUP" | "DVD" | "EQL" | "GEQ" | "GTR" | "HLT" | "INB" | "IND" | "INN" | "INX" | "LEQ" | "LSS" | "MMM" | "MUL" | "NEG" | "NEQ" | "NLN" | "NOP" | "NOT" | "ORR" | "PPP" | "PRB" | "PRN" | "REM" | "STK" | "STO" | "SUB" | "VAL" ) (. char mnemonic[4]; LexName(mnemonic, sizeof(mnemonic) - 1); CGen->oneword(mnemonic); .) . TwoWord = ( "DSP" | "LIT" | "BRN" | "BZE" | "BAN" | "BOR" | "ADR" ) (. char mnemonic[4]; int value; LexName(mnemonic, sizeof(mnemonic) - 1); .) SignedNumber<value> (. CGen->twoword(mnemonic, value); .) . WriteString = (. char str[600]; CGEN_labels startstring; .) "PRS" String<str> (. CGen->stackstring(str, startstring); CGen->writestring(startstring); .) . SignedNumber<int &value> = (. int sign = 1 .) [ "+" | "-" (. sign = -1; .) ] Number<value> (. value = sign * value; .) . /* the next productions are straight out of the original compiler */ String<char *str> = string (. char local[100]; LexString(local, sizeof(local) - 1); int i = 0; while (local[i]) /* strip quotes */ { local[i] = local[i+1]; i++; } local[i-2] = '\0'; i = 0; while (local[i]) /* find internal quotes */ { if (local[i] == '\'') /* single quote */ { int j = i; while (local[j]) { local[j] = local[j+1]; j++; } } i++; } strcpy(str, local); .) . Number <int &num> = number (. char str[100]; int i = 0, l, digit, overflow = 0; num = 0; LexString(str, sizeof(str) - 1); l = strlen(str); while (i <= l && isdigit(str[i])) { digit = str[i] - '0'; i++; if (num <= (maxint - digit) / 10) num = 10 * num + digit; else overflow = 1; } if (overflow) SemError(200); .) . END ASSEM.
(b) The definition of constants and variables and the autogeneration of DSP can be lifted verbatim out of the CS301 compiler.
(c) The LIT/DSP opcodes and their constant expression address fields can simply use the constant expression parser productions and actions straight out of the CS301 compiler. The only extra is the ability to add "size(array)" which is requires a simple symbol table lookup and about three lines of code similar to other look ups students should be familiar with from the CS301 compiler.
(d) The ADR opcode and its variable name parameter similarly requires a symbol table look up. It makes sense to hack the Designator production and actions from the original CS301 compiler to do this.
(e) The only real challenge lies in the labels for branch instructions. The obvious way to do this is to have a new "class" of identifier in the symbol table, and to set the info in the symbol table when the labels are declared. But if forward references are found the entries will have to be made ahead of time and a way found to fix up the forward references when the label is finally declared. This is actually closely analogous with similar fixups students should have learned to do in a practical where they had to deal with multiple forward references as "break" points in a switch statement.
(f) One non-obvious point is that it makes sense not to define a labelled "statement" in th obvious way
Statement = [ Label ] ( OneWord | Branch | DSPorLIT | LValue | WriteString ) .
but instead to treat a Label as a sort of "empty" statement.
Statement = ( Label | OneWord | Branch | DSPorLIT | LValue | WriteString ) .
This allows one conceptually to attach multiple labels to a single code-generating statement. Other attempts to do this lead too easily to non-LL(1) grammars.
A complete solution turns out to rely on only 150 lines or so of new code at the most, and a well prepared student should have been able to extract the essential point in preparing an answer that could be written down under exam conditions very quickly.
The basic grammar looks like this:
/* this first part is all straight out of the extant compiler */ COMPILER ASSEM /* Stack assembler level 1 grammar - this version for csC 301 exam solution 2001 P.D. Terry, Rhodes University, 2001 */ IGNORE CASE IGNORE CHR(9) .. CHR(13) CHARACTERS cr = CHR(13) . lf = CHR(10) . letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . digit = "0123456789" . instring = ANY - "'" - cr - lf - CHR(0) . COMMENTS FROM "{" TO "}" COMMENTS FROM "#" TO lf TOKENS identifier = letter { letter | digit } . number = digit { digit } . string = "'" (instring | "''") { instring | "''" } "'" . PRODUCTIONS ASSEM = "ASSEM" Block "." . Block = { ConstDeclarations | VarDeclarations } "BEGIN" StatementSequence "END" . ConstDeclarations = "CONST" OneConst { OneConst } . OneConst = Ident "=" CExpression ";" . VarDeclarations = ( "INT" | "BOOL" ) OneVar { "," OneVar } ";" . OneVar = Ident [ UpperBound ] . UpperBound = "[" CExpression "]" . CExpression = CAndExp { "||" CAndExp } . CAndExp = CRelExp { "&&" CRelExp } . CRelExp = CAddExp [ RelOp CAddExp ] . CAddExp = CMultExp { AddOp CMultExp } . CMultExp = CUnaryExp { MulOp CUnaryExp } . CUnaryExp = CFactor | "+" CUnaryExp | "-" CUnaryExp | "!" CUnaryExp . CFactor = Ident | Number | "TRUE" | "FALSE" | "(" CExpression ")" . AddOp = "+" | "-" . MulOp = "*" | "%" | "/" . RelOp = "=" | "<>" | "<" | "<=" | ">" | ">=" . Ident = identifier . String = string . Number = number . /* this next part is effectively new */ StatementSequence = { Statement } . Statement = ( Label | OneWord | Branch | DSPorLIT | LValue | WriteString ) . Label = Ident . OneWord = "ADD" | "AND" | "DUP" | "DVD" | "EQL" | "GEQ" | "GTR" | "HLT" | "INB" | "IND" | "INN" | "INX" | "LEQ" | "LSS" | "MMM" | "MUL" | "NEG" | "NEQ" | "NLN" | "NOP" | "NOT" | "ORR" | "PPP" | "PRB" | "PRN" | "REM" | "STK" | "STO" | "SUB" | "VAL" . DSPorLIT = ( "DSP" | "LIT" ) CExpression . WriteString = "PRS" String . Branch = ( "BRN" | "BZE" | "BAN" | "BOR" ) ( Number | Target ) . Target = Ident . LValue = "ADR" ( SignedNumber | Designator ) . SignedNumber = [ "+" | "-" ] Number . Designator = Ident [ "[" CExpression "]" ] . END ASSEM.
In more detail - here are suggested additions to the code generator. This is slight overkill, because these routines could be replaced with simple calls to CGen->emit in the few places in the attributed grammar where they appear, and the error checking is really overcautious:
void CGEN::twoword(char *mnemonic, int adr) { STKMC_opcodes opcode = Machine->opcode(mnemonic); if (opcode <= STKMC_prs) { emit(int(opcode)); emit(adr); } else Report->error(231); } void CGEN::oneword(char *mnemonic) { STKMC_opcodes opcode = Machine->opcode(mnemonic); if (opcode > STKMC_prs && opcode != STKMC_nul) emit(int(opcode)); else Report->error(232); }
Here are the additions needed to the table handler - we need an option in the union for recording labels, which are characterised by their object code addresses and whether they have been declared or not. The address field here is used in a clever way later on to help set up a back-patch list for the situations where a label is referenced before it has been declared/defined.
enum TABLE_idclasses { TABLE_consts, TABLE_vars, TABLE_progs, TABLE_labels }; struct TABLE_entries { TABLE_alfa name; // identifier TABLE_idclasses idclass; // class TABLE_types type; int self; union { struct { int value; } c; // constants struct { int size, offset; bool scalar; } v; // variables struct { CGEN_labels address; bool defined; } l; // labels }; };
Here are the additions needed to the compiler frame file - a few more error messages need to be defined:
char *cs301Error::GetUserErrorMsg(int n) { switch (n) { ... case 229: return "Undefined labels"; case 230: return "Redefined labels"; case 231: return "Opcode requires addressfield"; case 232: return "Opcode may not have addressfield"; default: return "Compiler error"; } }
And here is the complete attribute grammar:
COMPILER ASSEM $CNX /* Stack assembler level 1 grammar - this version for csC 301 exam solution 2001 P.D. Terry, Rhodes University, 2001 */ /* The first part of this is effectively identical to the same grammar used in the CS301 compiler. A few lines have been deleted to tidy things up, but they could have remained in place with no difficulty */ #include "misc.h" #include "set.h" #include "cgen.h" #include "table.h" #include "report.h" typedef Set<7> classset; typedef Set<4> typeset; bool debug; extern TABLE *Table; extern CGEN *CGen; extern REPORT *Report; // needed for debugging pragma typeset arithtypes, booltypes; /*--------------------------------------------------------------------------*/ IGNORE CASE IGNORE CHR(9) .. CHR(13) PRAGMAS DebugOn = "$D+" . (. Report->debugging = true; .) DebugOff = "$D-" . (. Report->debugging = false; .) CHARACTERS cr = CHR(13) . lf = CHR(10) . letter = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" . digit = "0123456789" . instring = ANY - "'" - cr - lf - CHR(0) . COMMENTS FROM "{" TO "}" COMMENTS FROM "#" TO lf /* ========= addition to original */ TOKENS identifier = letter { letter | digit } . number = digit { digit } . string = "'" (instring | "''") { instring | "''" } "'" . PRODUCTIONS Block = (. int framesize = 0; .) SYNC { ( ConstDeclarations | VarDeclarations<framesize> ) SYNC } (. /* reserve space for variables */ CGen->openstackframe(framesize); .) "BEGIN" StatementSequence (. /* ====== CGen->leaveprogram(); can be deleted */ if (Report->debugging) Table->printtable(stdout); .) "END" . ConstDeclarations = "CONST" OneConst { OneConst } . OneConst = (. TABLE_entries entry; .) Ident<entry.name> (. entry.idclass = TABLE_consts; .) WEAK "=" CExpression<entry.type, entry.c.value> ";" (. Table->enter(entry); .) . VarDeclarations<int &framesize> = (. TABLE_types t; .) ( "INT" (. t = TABLE_ints; .) | "BOOL" (. t = TABLE_bools; .) ) OneVar<framesize, t> { WEAK "," OneVar<framesize, t> } ";" . OneVar<int &framesize, TABLE_types t> = (. TABLE_entries entry; .) (. entry.idclass = TABLE_vars; entry.v.size = 1; entry.v.scalar = true; entry.type = t; entry.v.offset = framesize + 1; .) Ident<entry.name> [ UpperBound<entry.v.size> (. entry.v.scalar = false; .) ] (. Table->enter(entry); framesize += entry.v.size; .) . UpperBound<int &size> = (. TABLE_types type; .) "[" CExpression<type, size> (. if (!arithtypes.memb(type)) SemError(226); size++; .) "]" . CExpression<TABLE_types &e, int &value> = (. TABLE_types a; int aval; .) CAndExp<e, value> { "||" CAndExp<a, aval> (. if (!(booltypes.memb(e) && booltypes.memb(a))) { SemError(218); e = TABLE_none; } else e = TABLE_bools; value = value || aval; .) } . CAndExp<TABLE_types &a, int &value> = (. TABLE_types e; int eval; .) CRelExp<a, value> { "&&" CRelExp<e, eval> (. if (!(booltypes.memb(a) && booltypes.memb(e))) { SemError(218); a = TABLE_none; } else a = TABLE_bools; value = value && eval; .) } . CRelExp<TABLE_types &r, int &value> = (. TABLE_types a; int aval; CGEN_operators op; .) CAddExp<r, value> [ RelOp<op> CAddExp<a, aval> (. if (r == TABLE_bools || a == TABLE_bools) SemError(218); r = TABLE_bools; switch (op) { case CGEN_opeql : value = value == aval; break; case CGEN_opneq : value = value != aval; break; case CGEN_oplss : value = value < aval; break; case CGEN_opleq : value = value <= aval; break; case CGEN_opgtr : value = value > aval; break; case CGEN_opgeq : value = value >= aval; break; default : break; } .) ] . CAddExp<TABLE_types &a, int &value> = (. TABLE_types m; int mval; CGEN_operators op; .) CMultExp<a, value> { AddOp<op> CMultExp<m, mval> (. if (!(arithtypes.memb(a) && arithtypes.memb(m))) { SemError(218); a = TABLE_none; } else switch (op) { case CGEN_opadd : value += mval; break; case CGEN_opsub : value -= mval; break; default : break; } .) } . CMultExp<TABLE_types &m, int &value> = (. TABLE_types u; int uval; CGEN_operators op; .) CUnaryExp<m, value> { MulOp<op> CUnaryExp<u, uval> (. if (!(arithtypes.memb(m) && arithtypes.memb(u))) { SemError(218); m = TABLE_none; } else switch (op) { case CGEN_opmul : value *= uval; break; case CGEN_opmod : if (uval != 0) value %= uval; else SemError(224); break; case CGEN_opdvd : if (uval != 0) value /= uval; else SemError(224); break; default : break; } .) } . CUnaryExp<TABLE_types &u, int &value> = CFactor<u, value> | "+" CUnaryExp<u, value> (. if (!arithtypes.memb(u)) { SemError(218); u = TABLE_none; } .) | "-" CUnaryExp<u, value> (. if (!arithtypes.memb(u)) { SemError(218); u = TABLE_none; } value = - value; .) | "!" CUnaryExp<u, value> (. if (!booltypes.memb(u)) SemError(218); value = ! value; u = TABLE_bools; .) . CFactor<TABLE_types &f, int &value> = (. TABLE_entries entry; TABLE_alfa name; bool found; value = 1; // play safe .) Ident<name> (. Table->search(name, entry, found); if (!found) { SemError(202); /* force declaration */ strcpy(entry.name, name); entry.type = TABLE_none; entry.idclass = TABLE_consts; entry.c.value = 1; Table->enter(entry); } if (entry.idclass == TABLE_consts) value = entry.c.value; else SemError(206); f = entry.type; .) | Number<value> (. f = TABLE_ints; .) | "TRUE" (. f = TABLE_bools; value = 1; .) | "FALSE" (. f = TABLE_bools; value = 0; .) | "SIZE" "(" /* ========== this small addition needed == */ Ident<name> (. Table->search(name, entry, found); if (!found) { SemError(202); /* force declaration */ strcpy(entry.name, name); entry.type = TABLE_none; entry.idclass = TABLE_vars; entry.v.scalar = true; entry.v.size = 1; entry.v.offset = 0; } if (entry.idclass == TABLE_vars) value = entry.v.size; else SemError(206); f = TABLE_ints; .) ")" | "(" CExpression<f, value> ")" . AddOp<CGEN_operators &op> = (. op = CGEN_opnop; .) ( "+" (. op = CGEN_opadd; .) | "-" (. op = CGEN_opsub; .) ) . MulOp<CGEN_operators &op> = (. op = CGEN_opnop; .) ( "*" (. op = CGEN_opmul; .) | "%" (. op = CGEN_opmod; .) | "/" (. op = CGEN_opdvd; .) ) . RelOp<CGEN_operators &op> = (. op = CGEN_opnop; .) ( "=" (. op = CGEN_opeql; .) | "<>" (. op = CGEN_opneq; .) | "<" (. op = CGEN_oplss; .) | "<=" (. op = CGEN_opleq; .) | ">" (. op = CGEN_opgtr; .) | ">=" (. op = CGEN_opgeq; .) ) . Ident<char *name> = identifier (. LexName(name, TABLE_alfalength); .) . String<char *str> = string (. char local[100]; LexString(local, sizeof(local) - 1); int i = 0; while (local[i]) /* strip quotes */ { local[i] = local[i+1]; i++; } local[i-2] = '\0'; i = 0; while (local[i]) /* find internal quotes */ { if (local[i] == '\'') /* single quote */ { int j = i; while (local[j]) { local[j] = local[j+1]; j++; } } i++; } strcpy(str, local); .) . Number <int &num> = number (. char str[100]; int i = 0, l, digit, overflow = 0; num = 0; LexString(str, sizeof(str) - 1); l = strlen(str); while (i <= l && isdigit(str[i])) { digit = str[i] - '0'; i++; if (num <= (maxint - digit) / 10) num = 10 * num + digit; else overflow = 1; } if (overflow) SemError(200); .) . /* The productions below are specific to this application */ ASSEM = (. arithtypes = typeset(TABLE_none, TABLE_ints); booltypes = typeset(TABLE_none, TABLE_bools); .) "ASSEM" Block "." (. Table->checklabels(stdout); .) . StatementSequence = { Statement } . Statement = SYNC ( OneWord | Branch | DSPorLIT | LValue | WriteString | Label ) . Label = (. TABLE_alfa name; TABLE_entries entry; bool found; .) Ident<name> (. Table->search(name, entry, found); if (!found) { // new label to be defined strcpy(entry.name, name); entry.type = TABLE_none; entry.idclass = TABLE_labels; CGen->storelabel(entry.l.address); entry.l.defined = true; Table->enter(entry); } else { // only now defined but had forward reference if (entry.idclass != TABLE_labels) SemError(206); else { // must not already been defined if (entry.l.defined) SemError(230); else { // fix the forward references CGen->backpatchlist(entry.l.address); // and update the symbol table CGen->storelabel(entry.l.address); entry.l.defined = true; Table->update(entry); } } } .) . OneWord = ( "ADD" | "AND" | "DUP" | "DVD" | "EQL" | "GEQ" | "GTR" | "HLT" | "INB" | "IND" | "INN" | "INX" | "LEQ" | "LSS" | "MMM" | "MUL" | "NEG" | "NEQ" | "NLN" | "NOP" | "NOT" | "ORR" | "PPP" | "PRB" | "PRN" | "REM" | "STK" | "STO" | "SUB" | "VAL" ) (. char mnemonic[4]; LexName(mnemonic, sizeof(mnemonic) - 1); CGen->oneword(mnemonic); .) . DSPorLIT = (. char mnemonic[4]; TABLE_types type; int value; .) ( "DSP" | "LIT" ) (. LexName(mnemonic, sizeof(mnemonic) - 1); .) CExpression<type, value> (. CGen->twoword(mnemonic, value); .) . WriteString = (. char str[600]; CGEN_labels startstring; .) "PRS" String<str> (. CGen->stackstring(str, startstring); CGen->writestring(startstring); .) . Branch = ( "BRN" | "BZE" | "BAN" | "BOR" ) (. int destination; char mnemonic[4]; LexName(mnemonic, sizeof(mnemonic) - 1); .) ( Number<destination> | Target<destination> ) (. CGen->twoword(mnemonic, destination); .) . Target<int &destination> = (. TABLE_alfa name; TABLE_entries entry; bool found; .) Ident<name> (. Table->search(name, entry, found); if (found) { if (entry.idclass == TABLE_labels) { destination = entry.l.address; if (!entry.l.defined) { // another forward reference CGen->storelabel(entry.l.address); Table->update(entry); } } else SemError(206); } else { // first forward reference strcpy(entry.name, name); entry.type = TABLE_none; entry.idclass = TABLE_labels; CGen->storelabel(entry.l.address); entry.l.defined = false; destination = CGen->undefined; Table->enter(entry); } .) . LValue = (. int value, offset; .) "ADR" ( SignedNumber<value> (. CGen->twoword("ADR", value); .) | Designator<offset> (. CGen->twoword("ADR", -offset); .) ) . SignedNumber<int &value> = (. int sign = 1 .) [ "+" | "-" (. sign = -1; .) ] Number<value> (. value = sign * value; .) . /* This next one is very similar to the designator used in the CS301 compiler but needs to return the offset address of the variable or array element */ Designator<int &offset> = (. TABLE_alfa name; TABLE_types exptype; TABLE_entries entry; bool found; int value; .) Ident<name> (. Table->search(name, entry, found); if (!found) { SemError(202); // force declaration strcpy(entry.name, name); entry.type = TABLE_none; entry.idclass = TABLE_vars; entry.v.scalar = true; entry.v.size = 1; entry.v.offset = 0; } if (entry.idclass == TABLE_vars) offset = entry.v.offset; else { SemError(206); offset = 0; return; } .) ( "[" (. if (!found) { entry.v.scalar = false; Table->enter(entry); } if (entry.v.scalar) SemError(204); .) CExpression<exptype, value> (. if (!arithtypes.memb(exptype)) SemError(227); offset += value; if (value < 0 || value >= entry.v.size) SemError(223); .) "]" | (. if (!found) Table->enter(entry); .) ) . END ASSEM.