When I speak to you, how do you understand what I am saying? It is important that we communicate in a shared language, say, English, and it is important that I speak grammatically correctly (e.g., ``Eaten house horse before.'' is an incorrect, useless communication). Finally, you must know how to attach meanings to the words and phrases that I use.
The same ideas are as important when you talk to a computer, by means of a programming language: the computer must ``know'' the language you use. This includes:
This chapter gives techniques for stating precisely a language's syntax and semantics in terms of parser and interpreter programs.
In the 1950s, Noam Chomsky realized that the syntax of a sentence can be represented by a tree, and the rules for building syntactically correct sentences can be written as an equational, inductive definition. Chomsky called the definition a grammar. (John Backus and Peter Naur independently discovered the same concept, and for this reason, a grammar is sometimes called BNF (Backus-Naur form) notation.)
A grammar is a set of equations (rules), where each equation defines a set of phrases (strings of words). The best way to learn this is by example.
=================================================== EXPRESSION ::= NUMERAL | ( EXPRESSION OPERATOR EXPRESSION ) OPERATOR ::= + | - NUMERAL is a sequence of digits from the set, {0,1,2,...,9} ===================================================The words in upper-case letters (nonterminals) name phrase and word forms: an EXPRESSION phrase consists of either a NUMERAL word or a left paren followed by another (smaller) EXPRESSION phrase followed by an OPERATOR word followed by another (smaller) EXPRESSION phrase followed by a right paren. (The vertical bar means ``or.'')
If the third equation is too informal for you, we can replace it with
these two:
NUMERAL ::= DIGIT | DIGIT NUMERAL
DIGIT ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
but usually the spelling of individual words is stated informally,
like we did originally.
Using the rules, we can verify that this sequence of symbols is
a legal EXPRESSION phrase:
(4 - (3 + 2))
Here is the explanation, stated in words:
Next, here is the justification, stated as a derivation,
where we calculate with the equations (rules) the phrase, (4 - (3 + 2)):
EXPRESSION => ( EXPRESSION OPERATOR EXPRESSION )
=> ( NUMERAL OPERATOR EXPRESSION )
=> (4 OPERATOR EXPRESSION)
=> (4 + EXPRESSION)
=> (4 + ( EXPRESSION OPERATOR EXPRESSION ))
=> (4 + ( NUMERAL OPERATOR EXPRESSION ))
=> (4 + ( 3 OPERATOR EXPRESSION ))
=> . . .
=> (4 + (3 - 2))
The derivation is compactly drawn
as a derivation tree:
===================================================
===================================================
Indeed, a sequence of words is an EXPRESSION phrase if and only
if one can build a derivation tree for the words with the grammar rules.
In this way, we use grammars to define precisely the syntax of a language. This is a crucial first step towards building programs --- compilers and interpreters --- that read sentences (programs) in the language and calculate their meanings (execute them).
Exercise: Write a derivation and derivation tree for ((4 - 3) + 2). Why is 4 - 3 + 2 not a legal EXPRESSION phrase?
=================================================== PROGRAM ::= COMMANDLIST COMMANDLIST ::= COMMAND | COMMAND ; COMMANDLIST COMMAND ::= VAR = EXPRESSSION | print VARIABLE | while EXPRESSION : COMMANDLIST end EXPRESSION ::= NUMERAL | VAR | ( EXPRESSION OPERATOR EXPRESSION ) OPERATOR ::= + | - NUMERAL is a sequence of digits VAR is a sequence of letters but not 'print', 'while', or 'end' ===================================================The definition says that a program is a list (sequence) of commands, which can be assignments or prints or while-loops. The body of a while-loop is itself a list of commands. The grammar does not explain what the phrases mean, so we have no information that tells us how a command like, while x : x = (x - 1) end, operates --- this is an issue of semantics, which we consider momentarily.
Here is the derivation tree for the program,
x = (3 + 2); print x:
===================================================
===================================================
Exercise: Draw the derivation trees for these PROGRAMs:
Here is the operator tree produced from the
derivation for (4 - (3 + 2)):
It is called an ``operator tree'' because the operators rest
in the places where the phrase names once appeared.
It's lots more compact than the original derivation tree,
but it has the same branching structure, which is the crucial part.
When we program with grammars and trees, we should
use a dynamic data-structures language, like Python or Scheme or Ruby or ML or Ocaml,
that lets us build nested lists. Then, we can represent an operator
tree as a nested list. Here is the nested-list representation
of the above operator tree:
["-", "4", ["+", "3", "2"]]
Here is a second example: For ((2+1) - (3-4)), we have this operator
tree (nested list):
["-", ["+", "2", "1"], ["-", "3", "4"]]
Finally, this program,
x = 3; while x : x = (x -1) end; print x,
has this operator tree:
which can be coded as this nested list:
[["=", "x", "3"],
["while", "x", [["=", "x", ["-", "x", "1"]]]],
["print", "x"]
]
(You can read it all on one line, if you wish:
[["=", "x", "3"], ["while", "x", [["=", "x", ["-", "x", "1"]]]], ["print", "x"]}.)
The COMMANDLIST is formatted as a list of command operator trees.
Exercise: Draw graphical and nested-list operator trees for all the derivation trees you have drawn in the previous exercises.
The meanings of the subtrees (subphrases) are combined to determine the tree's overall meaning.
What about commands? The meaning of a command is a storage-update function.
That is, we determine a command's meaning once we supply it a storage vector to examine and update. Here is an example semantics for the compound command,
x = 2; y = (x + 1), where the value of global storage is initially empty
(represented by {}):
The cyan arrows indicate the progress of the storage vector through the semantics calculation. You can pretend the store is an argument or a parameter to the subtrees, or you can pretend there is an access-right/capability/pointer/handle to the storage vector that is given to the subtrees, or you can pretend the store is a global variable that can be inspected and altered by the subtrees in turn ---
we have all these options for implementing the semantics in code.
Finally, the semantics of a loop command causes multiple calculations of the loop's test and body.
Consider the loop, while x : x = (x - 1) end, with the input store,
{'x':two}: (Note: read the test, x, as saying "x != 0").
The loop starts with
input store, {'x':two}, and evaluates the test, x. Since x's value is nonzero, the body computes its meaning from the input store, producing
{'x':one}. Now the loop must calculate meaning for
{'x':one}, which makes the body compute a second time, producing the meaning,
{'x':zero}. The loop's test now computes to zero, and the output meaning of the loop is {'x':zero}.
Our job is to write an interpreter that calculates semantics of operator trees like the ones suggested by the pictures.
When a compiler or interpreter processes a computer program, it first builds the program's operator tree. Then, it calculates the semantics of the tree with a recursively defined tree-traversal function, like the ones you learned in CIS300.
The list-based
operator trees for arithmetic have just two forms, which we can define
precisely with a grammar rule, like this:
TREE ::= NUMERAL | [ OP, TREE, TREE ]
OP ::= + | -
NUMERAL ::= a string of digits
That is, every operator tree is either just a single numeral
or a list holding an operator symbol and two subtrees/sublists.
We will write a function, eval, that implements recursive calls
that match the recursions in the grammar rule.
For example,
===================================================
for tree, "2", eval("2") returns 2;
for tree, ["+", "2", "1"], eval(["+", "2", "1"])
calls eval("2") which returns 2
calls eval("1") which returns 1
and returns 2 + 1 == 3
for tree, ["-", ["+", "2", "1"], ["-", "3", "4"]],
eval( ["-", ["+", "2", "1"], ["-", "3", "4"]] )
calls eval(["+", "2", "1"]) which returns 3
calls eval(["-", "3", "4"]]) which returns -1
and returns 3 - (-1) == 4
===================================================
The interpreter (written in Python)
is a function that calls itself:
===================================================
def eval(t) :
"""pre: t is a list-represented TREE,
where TREE ::= NUMERAL | [ OP, TREE, TREE ]
OP ::= "+" | "-"
NUMERAL ::= a string of digits
post: ans is the numerical meaning of t
returns: ans
"""
if isinstance(t, str) and t.isdigit() : # is t a string of all digits?
ans = int(t) # cast the string to an int
else : # t is a list, [op, t1, t2]
op = t[0]; t1 = t[1]; t2 = t[2]
ans1 = eval(t1) # assert: ans1 is the numerical meaning of t1
ans2 = eval(t2) # assert: ans2 is the numerical meaning of t2
if op == "+" :
ans = ans1 + ans2
else : # op == "-":
ans = ans1 - ans2
return ans
===================================================
The function's recursions match exactly the recursions in the grammar
rule that defines the operator trees.
The recursion computes the numerical meanings
of the subtrees and combines them to get the meaning of
the complete tree.
Exercise: Copy the function into a file named eval.py, and start Python on it: python -i eval.py. Call the function: eval( ["-", ["+", "2", "1"], ["-", "3", "4"]] ). Try more examples.
Here is a sketch of the execution of
eval(["-", ["+", "2", "1"], ["-", "3", "4"]]),
which computes to 3 - (-1) = 4.
Each time eval is called, we copy freshly its body, to mimick how a computer
restarts eval (and creates a new activation record).
The way we write this is called
copy-rule semantics.
===================================================
eval( ["-", ["+", "2", "1"], ["-", "3", "4"]] )
=> op = "-"
t1 = ["+", "2", "1"]
t2 = ["-", "3", "4"]
ans1 = eval(t1) => op = "+"
t1 = "2"
t2 = "1"
ans1 = eval(t1) => ans = 2
= 2
ans2 = eval(t2) => ans = 1
= 1
ans = 2+1 = 3
= 3
ans2 = eval(t2) => op = "-"
t1 = "3"
t2 = "4"
ans1 = eval(t1) => ans = 3
= 3
ans2 = eval(t2) => ans = 4
= 4
ans = 3-4 = -1
= -1
ans = 3 - (-1) = 4
= 4
===================================================
Each => represents a recursive call (restart)
of eval on a subtree of the original
operator tree. Each restart keeps its own namespace (activation record)
of its local variables, which it uses to compute the answer for its subtree.
Eventually, the answers are returned and combined.
The pattern of calls to eval matches
the pattern of structure in the original operator tree.
A compiler (translator) is a program that converts a program into an operator tree and then converts the operator tree into a program in a different programming language (e.g., machine language or byte code) that will compute the semantics. Think about a compiler for C, which translates a C program into a machine-language program that, when executed, does the semantics of the original C program.
We use the coding style shown above to compile
an operator tree for arithmetic into its
postfix-string representation, which these days is called
``byte code.'' This is essentially what a Java compiler does ---
translates Java programs into byte code:
===================================================
def postfix(t) :
"""pre: t is a TREE, where TREE ::= NUM | [ OP, TREE, TREE ]
post: ans is a string holding a postfix (operator-last) sequence
of the symbols within t
returns: ans
"""
if isinstance(t, str) and t.isdigit() : # is t a string holding an int?
ans = t #(*) the postfix of a NUM is just the NUM itself
else : # t is a list, [op, t1, t2]
op = t[0]
t1 = t[1]
t2 = t[2]
ans1 = postfix(t1)
# assert: ans1 is a string holding the postfix form of t1
ans2 = postfix(t2)
# assert: ans2 is a string holding the postfix form of t2
# concatenate the subanswers into one string:
if op == "+" :
ans = ans1 + ans2 + "+" #(*)
elif op == "-" :
ans = ans1 + ans2 + "-" #(*)
else :
print "error:", t, "is illegal"
raise Exception # stops the program
return ans
===================================================
This code is exactly the code for eval except for the positions marked
#(*), where postfix strings are generated instead of numerical semantics.
For example, postfix(["+", ["-", "2", "1"], "4"]) builds the postfix
string, "21-4+". The string states what to compute for the semantics
but in postfix style.
We draw the compilation like
this:
===================================================
postfix(["+", ["-", "2", "1"], "4"])
=> op = "+"
t1 = ["-", "2", "1"]
t2 = "4"
ans1 = postfix(t1) => op = "-"
t1 = "2"
t2 = "1"
ans1 = postfix(t1) => ans = "2"
= "2"
ans2 = postfix(t2) => ans = "1"
= "1"
ans = "2" + "1" + "-"
= "21-"
= "21-"
ans2 = postfix(t2) => ans = "4"
= "4"
ans = "21-" + "4" + "+"
= "21-4+"
===================================================
Again,
each => represents a recursive call (restart)
of postfix on a subtree of the original
operator tree.
If the previous explanation was not enough for you, you can
insert print commands into postfix so that the computer
shows you the path it takes to translate the tree:
===================================================
def postfixx(t, level = 0, trace = True) :
"""pre: t is a TREE, where TREE ::= INT | [ OP, TREE, TREE ]
level is an int, indicating at what depth t is situated
in the overall tree being postfixed. The default value is 0.
trace is an optional boolean parameter that states whether
the function should print a trace of its computation.
If not supplied, the param defaults to True.
post: ans is a string holding a postfix (operator-last) sequence
of the symbols within t
returns: ans
"""
if trace : print level * " ", "Entering subtree t =", t
if isinstance(t, str) : # is t a numeral?
ans = str(t)
else : # t is a list, [op, t1, t2]
op = t[0]
t1 = t[1]
t2 = t[2]
ans1 = postfixx(t1, level + 1)
ans2 = postfixx(t2, level + 1)
ans = ans1 + ans2 + op # the answer combines the two subanswers
if trace : print level * " ", "Exiting subtree t =", t, " ans =", ans, "\n"
return ans
===================================================
If you use Python and call this function, say,
postfixx(["+", ["-", "3", "2"], "4"]),
you will see this printout:
===================================================
Entering subtree t = ['+', ['-', '3', '2'], '4']
Entering subtree t = ['-', '3', '2']
Entering subtree t = 3
Exiting subtree t = 3 ans = 3
Entering subtree t = 2
Exiting subtree t = 2 ans = 2
Exiting subtree t = ['-', '3', '2'] ans = 32-
Entering subtree t = 4
Exiting subtree t = 4 ans = 4
Exiting subtree t = ['+', ['-', '3', '2'], '4'] ans = 32-4+
'32-4+'
===================================================
The computer descended into the levels of the tree
from left to right, computing answers for its leaves that were combined
into an answer for the entire tree.
The Java compiler works like the above example: a Java program is parsed
into its tree form and the tree is translated into postfix (byte code).
The Java-byte code for the above example is wordier and looks like this:
loadconst 3
loadconst 2
subtract
loadconst 4
add
Here is the postfix function reformatted to produce byte code:
def translateETREE(e, trace = False) :
"""pre: e is an expression represented as an ETREE:
ETREE ::= NUMERAL | [OP, ETREE1, ETREE2]
where OP is either "+" or "-"
trace is an optional boolean parameter that states whether
the function should print a trace of its computation.
If not supplied, the param defaults to False.
post: ans holds the byte code that computes the meaning of e
"""
if trace :
print "enter with", e
if isinstance(e, str) and e.isdigit() : # a numeral
ans = "LOADCONST " + e + "\n"
else : # [op, e1, e2]
op = e[0]
ans = translateETREE(e[1], trace) + translateETREE(e[2], trace)
if op == "+" :
ans = ans + "ADD\n"
elif op == "-" :
ans = ans + "SUB\n"
else :
print "illegal arithmetic operator"
raise Exception # stop computation
if trace :
print "exit with", e, "whose meaning is", ans
return ans
We can also do translations from one form of tree to another form of tree.
For example, maybe we wish to reformat our operator trees so that the operator
comes last in the list representation:
PTREE ::= NUM | [ PTREE1, PTREE2, OP ]
so that postfixTree(["+", "2" , ["-", "3" , "4"]]) returns this PTREE:
["2" , ["3" , "4", "-"], "+"]. Here's how we use the recursion pattern:
===================================================
def postfixTree(t) :
"""pre: t is a TREE, where TREE ::= NUM | [ OP, TREE1, TREE2 ]
post: ans is t reformatted as a PTREE so that the operators are last:
PTREE ::= NUM | [ PTREE1, PTREE2, OP ]
"""
if isinstance(t, str) : # is t an instance of a NUM (a simple string)?
ans = t # just the NUM itself
else : # t is a list, [op, t1, t2], that is, isinstance(t, list)
op = t[0]
ans1 = postfixTree(t[1])
# assert: ans1 is the PTREE form of t[1]
ans2 = postfixTree(t[2])
# assert: ans2 is the PTREE form of t[2]
ans = [ans1, ans2, op] # combine the subanswers into a PTREE
# assert: ans holds the PTREE form of t
return ans
===================================================
=================================================== PROGRAM ::= COMMANDLIST COMMANDLIST ::= COMMAND | COMMAND ; COMMANDLIST COMMAND ::= VAR = EXPRESSSION | print VARIABLE | while EXPRESSION : COMMANDLIST end EXPRESSION ::= NUMERAL | VAR | ( EXPRESSION OPERATOR EXPRESSION ) OPERATOR is + or - NUMERAL is a sequence of digits from the set, {0,1,2,...,9} VAR is a string beginning with a letter; it cannot be 'print', 'while', or 'end' ===================================================For program,
x = 3 ; while x : x = x - 1 end ; print xhere is its operator tree:
[["=", "x", "3"], ["while", "x", [["=", "x", ["-", "x", "1"]]]], ["print", "x"] ]The operator tree has a commandlist level, a command level, and an expression level. A program is just a commandlist. Here is the grammar of operator trees:
PTREE ::= CLIST CLIST ::= [ CTREE+ ] where CTREE+ means one or more CTREEs CTREE ::= ["=", VAR, ETREE] | ["print", VAR] | ["while", ETREE, CLIST] ETREE ::= NUMERAL | VAR | [OP, ETREE, ETREE] where OP is either "+" or "-"So, we write four interpreter functions (why?), and the functions define the semantics of the computer language.
The interpreter has a simple virtual-machine architecture:
The input is an operator tree, which is traversed by the
interpret functions, which share a global variable, ns, that models a storage vector.
Here is the coding:
===================================================
"""Interpreter for a mini-language with variables and loops.
There is one crucial data structure:
"""
ns = {} # ns is a namespace/dictionary --- it holds the input program's
# variable names and associated values.
# For example, if
# ns = {'x': 2, 'y': 0}, then variable x names int 2
# and variable y names int 0.
def interpretPTREE(program) :
"""pre: program is a PTREE ::= CLIST
post: ns holds all the updates commmand by program
"""
global ns # Python requires that we state intention to alter global var, ns
ns = {}
interpretCLIST(program)
print "final namespace =", ns
def interpretCLIST(p) :
"""pre: p is a program represented as a CLIST ::= [ CTREE+ ]
where CTREE+ means one or more CTREEs
post: ns holds all the updates commanded by program p
"""
for command in p :
interpretCTREE(command)
def interpretCTREE(c) :
"""pre: c is a command represented as a CTREE:
CTREE ::= ["=", VAR, ETREE] | ["print", VAR] | ["while", ETREE, CLIST]
post: ns holds all the updates commanded by c
"""
operator = c[0]
if operator == "=" : # assignment command, ["=", VAR, ETREE]
var = c[1] # get left-hand side
exprval = interpretETREE(c[2]) # evaluate the right-hand side
ns[var] = exprval # do assignment. NOTE: we are altering part of ns,
# not ns itself, so "global ns" not required (!)
elif operator == "print" : # print command, ["print", VAR]
var = c[1]
if var in ns : # see if variable name is defined
ans = ns[var] # fetch its value
print ans
else :
crash("variable name undefined")
elif operator == "while" : # while command, ["while", ETREE, CLIST]
expr = c[1]
body = c[2]
if interpretETREE(expr) != 0 : # check if text expr is nonzero
interpretCLIST(body) # yes, so execute loop body
interpretCTREE(c) # repeat by restarting semantics with updated ns
else : # test was false, so exit loop
pass # means "do nothing"
else : # error
crash("invalid command")
def interpretETREE(e) :
"""pre: e is an expression represented as an ETREE:
ETREE ::= NUMERAL | VAR | [OP, ETREE, ETREE]
where OP is either "+" or "-"
post: ans holds the numerical value of e
returns: ans
"""
if isinstance(e, str) and e.isdigit() : # a numeral, all digits
ans = int(e)
elif isinstance(e, str) and e[0].isalpha() : # var name, all letters
if e in ns : # see if variable name is defined
ans = ns[e] # look up its value
else :
crash("variable name undefined")
else : # [op, e1, e2]
op = e[0]
ans1 = interpretETREE(e[1])
ans2 = interpretETREE(e[2])
if op == "+" :
ans = ans1 + ans2
elif op == "-" :
ans = ans1 - ans2
else :
crash("illegal arithmetic operator")
return ans
def crash(message) :
"""pre: message is a string
post: message is printed and interpreter stopped
"""
print message + "! crash! core dump: ", ns
raise Exception # stops the interpreter
===================================================
The interpretPTREE function starts the interpreter:
interpretPTREE([["=", "x", "3"],
["while", "x", [["=", "x", ["-", "x", "1"]], ["print", "x"]]],
])
This program
assigns 3 to "x";
decrements "x"'s value to 2 to 1 to 0;
prints each decrement;
then stops the loop.
Exercise: install and test the interpreter.
Another name for the interpreter is a virtual machine, because the interpreter adapts the computer hardware we have into a machine that understands the instruction set of a new language.
As we saw in the section on postfix expressions, we can revise the
interpreter for commands so that it generates a compiled program.
Here is the command interpreter revised the same way that we did earlier
to generate
postfix, bytecode expressions:
===================================================
"""Compiler for a mini-language with variables and loops.
Translates a program into byte code using these instructions:
LOADNUM n # int n is pushed on top of the value stack
LOAD v # looks up value of var v and pushes it
ADD # pops top two ints, adds them, pushes answer
SUBTRACT # like ADD
STORE v # pops top int and stores it in v's memory cell
PRINT v # prints value in v's memory cell
IFZERO EXITLOOP # pops top int and if =0 exits innermost enclosing loop
BEGINLOOP # marks the code of a loop test+body
ENDLOOP # goto the corresponding BEGINLOOP
"""
symboltable = [] # list of variable names that will be stored in runtime memory;
# used for simple "declaration checking" of the program
def compileCLIST(p) :
"""pre: p is a program represented as a CLIST ::= [ CTREE+ ]
where CTREE+ means one or more CTREEs
post: ans holds the compiled byte code for p
returns: ans
"""
ans = ""
for command in p :
ans = ans + compileCTREE(command)
return ans
def compileCTREE(c) :
"""pre: c is a command represented as a CTREE:
CTREE ::= ["=", VAR, ETREE] | ["print", VAR] | ["while", ETREE, CLIST]
post: ans is a string that holds the compiled byte code for c
returns: ans
"""
operator = c[0]
if operator == "=" : # assignment command, ["=", VAR, ETREE]
var = c[1] # get left-hand side
exprcode = compileETREE(c[2])
symboltable.append(var) # remember the var in the symboltable list
ans = exprcode + "STORE " + var + "\n"
elif operator == "print" : # print command, ["print", VAR]
var = c[1]
if var in symboltable : # see if variable name is defined
ans = "PRINT " + var + "\n"
else :
crash("variable name undefined")
elif operator == "while" : # while command
expr = c[1]
body = c[2]
exprcode = compileETREE(c[1])
bodycode = compileCLIST(c[2])
ans = "BEGINLOOP \n" + exprcode \
+ "IFZERO EXITLOOP \n" + bodycode + "ENDLOOP \n"
else : # error
crash("invalid command")
return ans
def compileETREE(e) :
"""pre: e is an expression represented as an ETREE:
ETREE ::= NUMERAL | VAR | [OP, ETREE, ETREE]
where OP is either "+" or "-"
post: ans is a string that holds the compiled byte code for e
returns: ans
"""
if isinstance(e, str) and e.isdigit() : # a numeral
ans = "LOADNUM " + e + "\n"
elif isinstance(e, str) and e[0].isalpha() : # var name
if e in symboltable : # see if variable name is defined
ans = "LOAD " + e + "\n"
else :
crash("variable name undefined")
else : # [op, e1, e2]
op = e[0]
ans1 = compileETREE(e[1])
ans2 = compileETREE(e[2])
if op == "+" :
ans = ans1 + ans2 + "ADD \n"
elif op == "-" :
ans = ans1 - ans2 + "SUBTRACT \n"
else :
crash("illegal arithmetic operator")
return ans
def crash(message) :
"""pre: message is a string
post: message is printed and interpreter stopped
"""
print message + "! crash!", symboltable
raise Exception # stops the interpreter
def compilePTREE(program) :
"""pre: program is a PTREE ::= CLIST
post: code holds the compiled byte code for program
"""
global symboltable # ns is global to main
symboltable = []
code = compileCLIST(program)
print code
===================================================
For example, the program,
x = 2; y = (x + 1); print y,
would be compiled like this:
compilePTREE([['=','x','2'], ['=','y',['+','x','1']], ['print','y']])
LOADNUM 2
STORE x
LOAD x
LOADNUM 1
ADD
STORE y
PRINT y
The compiler traversed the operator tree and translated its meaning
into a sequence of byte-code instructions that can be executed
by a processor that understands byte code.
There is a small but important technical point to notice: The original interpreter collected all the assignments within a namespace structure, ns. But when we compile a program, the assignments are not done by the compiler, but by the machine that executes the bytecode. For this reason, ns is removed from the compiler. But in its place, there is a new structure, symboltable, that remembers which variables might be created and assigned when the byte-code program executes.
The symboltable is a kind of ``ghost'' that foreshadows the run-time memory, and it keeps track of variable declarations and can help spot illegal uses of variables. In this way, the compiler can print error messages that indicate when the program will fail at run-time. (If the command language contained declaration statements with data types, this information would be saved in symboltable and used to type check the program's commands during the translation process.)
Finally, note that the byte code generated for while-loops is overly
simplistic --- a real-life compiler would generate jump commands and
labels when translating a loop.
For example,
x = 2; while x : x = x - 1; print x
compiles to where a realistic compiler produces this:
LOADNUM 2 LOADNUM 2
STORE x STORE x
BEGINLOOP LABEL1:
LOAD x LOAD x
IFZERO EXITLOOP JUMPZERO LABEL2
LOAD x LOAD x
LOADNUM 1 LOADNUM 1
SUBTRACT SUBTRACT
STORE x STORE x
ENDLOOP JUMP LABEL1
PRINT x LABEL2:
PRINT x
We have omitted label generation to keep the compiler
small and simple.
=================================================== """Interpreter for a mini-language with variables and loops. """ def interpretCLIST(p, ns) : """pre: p is a program represented as a CLIST ::= [ CTREE+ ] where CTREE+ means one or more CTREEs ns is a dictionary, the current value of the storage (namespace) returns: ns, holds all the updates commanded by program p """ nextns = ns for command in p : nextns = interpretCTREE(command, nextns) return nextns def interpretCTREE(c, ns) : """pre: c is a command represented as a CTREE: CTREE ::= ["=", VAR, ETREE] | ["print", VAR] | ["while", ETREE, CLIST] ns is a dictionary, the current value of the storage (namespace) returns: ansns, holds all the updates commanded by program p """ operator = c[0] if operator == "=" : # assignment command, ["=", VAR, ETREE] var = c[1] # get left-hand side exprval = interpretETREE(c[2], ns) # evaluate the right-hand side ansns = dict.copy(ns); ansns[var] = exprval # build updated dictionary elif operator == "print" : # print command, ["print", VAR] var = c[1] if var in ns : # see if variable name is defined ans = ns[var] # look up its value print ans else : crash("variable name undefined") ansns = ns elif operator == "while" : # while command, ["while", ETREE, CLIST] expr = c[1] body = c[2] if interpretETREE(expr, ns) != 0 : nextns = interpretCLIST(body, ns) ansns = interpretCTREE(c, nextns) # repeat else : # exit loop ansns = ns else : # error crash("invalid command") return ansns def interpretETREE(e, ns) : """pre: e is an expression represented as an ETREE: ETREE ::= NUMERAL | VAR | [OP, ETREE, ETREE] where OP is either "+" or "-" ns is a dictionary, the current value of the storage (namespace) post: ans holds the numerical value of e returns: ans """ if isinstance(e, str) and e.isdigit() : # a numeral ans = int(e) elif isinstance(e, str) and e[0].isalpha() : # var name if e in ns : # see if variable name is defined ans = ns[e] # look up its value else : crash("variable name undefined") else : # [op, e1, e2] op = e[0] ans1 = interpretETREE(e[1], ns) ans2 = interpretETREE(e[2], ns) if op == "+" : ans = ans1 + ans2 elif op == "-" : ans = ans1 - ans2 else : crash("illegal arithmetic operator") return ans def crash(message) : """pre: message is a string post: message is printed and interpreter stopped """ print message + "! crash! core dump: ", ns raise Exception # stops the interpreter def interpretPTREE(program) : """pre: program is a PTREE ::= CLIST post: ns holds all updates within program """ ns = interpretCLIST(program, {}) # start with ns == {} print "final namespace =", ns ===================================================
If time-and-space performance requirements are critical, revise the interpreter into a compiler: The compiler does the tedious steps of operator-tree traversal/disassembly, and it performs the routine tests related to variable declarations, expression compatibilities, etc. This leaves only the primitive computation steps that must be translated into the compiled code. The result will be a smaller, faster program to execute.
If the program must be placed in firmware or burned into a chip, make the interpreter into a compiler so that the process of operator-tree disassembly can be done once and for all and need not be included in the code placed in the firmware/chip.
If program correctness is critical, stay with the interpreter --- It is easier to analyze and prove an interpreter correct than it is a compiler.
If the language will be repeatedly revised or extended, stay with the interpreter. Revising compilers is a painful job.
For example, say we want a program, parseExpr, that can read a line of text, like ((2+1) - (3 - 4) ), and build the operator tree, ["-", ["+", "2", "1"], ["-", "3", "4"]]. The program's algorithm will go like this:
For the first step, here is a little function that disassembles
a line of text and makes a list of words that were found in the text:
===================================================
def scan(text) :
"""scan splits apart the symbols in text into a list.
pre: text is a string holding a program phrase
post: answer is a list of the words and symbols within text
(spaces and newlines are removed)
"""
OPERATORS = ("(", ")", "+", "-", ";", ":", "=") # must be one-symbol operators
SPACES = (" ", "\n")
SEPARATORS = OPERATORS + SPACES
nextword = "" # the next symbol/word to be added to the answer list
answer = []
for letter in text:
# invariant: answer + nextword + letter equals all the words/symbols
# read so far from text with SPACES removed
# see if nextword is complete and should be appended to answer:
if letter in SEPARATORS and nextword != "" :
answer.append(nextword)
nextword = ""
if letter in OPERATORS :
answer.append(letter)
elif letter in SPACES :
pass # discard space
else : # build a word or numeral
nextword = nextword + letter
if nextword != "" :
answer.append(nextword)
return answer
===================================================
For example, scan("((2+1) - (3 - 4) )") returns as its answer,
['(', '(', '2', '+', '1', ')', '-', '(', '3', '-', '4', ')', ')'].
Now, we use the grammar rule to guide writing the
function that reads the list of words and constructs the operator tree.
Here is the grammar rule for arithmetic expressions:
EXPRESSION ::= NUMERAL | VAR | ( EXPRESSION OPERATOR EXPRESSION )
where OPERATOR is "+" or "-"
NUMERAL is a sequence of digits
VAR is a string of letters
For each construction in the grammar rule,
there is an operator tree to build:
for NUMERAL, the tree is NUMERAL
for VAR, the tree is VAR
for ( EXPRESSION1 OPERATOR EXPRESSION2 ), the tree is [OPERATOR, T1, T2]
where T1 is the tree for EXPRESSION1
T2 is the tree for EXPRESSION2
We write a function, parseEXPR, that reads the words of an arithmetic
expression and builds the tree, based on the words. Like the eval function
seen earlier, the grammar rules show us what to do.
There is a critical property of the EXPRESSION grammar rule that lets us build parseEXPR --- notice that each construction of EXPRESSION begins with a distinct "keyword". That is, the first construction, NUMERAL, begins with a digit-word; the second construction, VAR, begins with a letter-word, and the third construction, ( EXPRESSION OPERATOR EXPRESSION ), begins with "(". These leading keywords tell parseEXPR which form of tree to build.
We will use these global variables and a helper function
to parcel out the input words one at a time to parseEXPR:
===================================================
# say that inputtext holds the text that we must parse into a tree:
wordlist = scan(inputtext) # holds the remaining unread words
nextword = "" # holds the first unread word
# global invariant: nextword + wordlist == all remaining unread words
EOF = "!" # a word that marks the end of the input words
getNextword() # call this function to move the first word into nextword:
def getNextword() :
"""moves the front word in wordlist to nextword.
If wordlist is empty, sets nextword = EOF
"""
global nextword, wordlist
if wordlist == [] :
nextword = EOF
else :
nextword = wordlist[0]
wordlist = wordlist[1:]
===================================================
The function that builds expression-operator trees
uses if commands to learn the leading keyword and
choose the correct tree to build:
===================================================
def parseEXPR() :
"""builds an EXPR operator tree from the words in nextword + wordlist,
where EXPR ::= NUMERAL | VAR | ( EXPR OP EXPR )
OP is "+" or "-"
also, maintains the global invariant (on wordlist and nextword)
"""
if nextword.isdigit() : # a NUMERAL ?
ans = nextword
getNextword()
elif isVar(nextword) : # a VARIABLE ?
ans = nextword
getNextword()
elif nextword == "(" : # ( EXPR op EXPR ) ?
getNextword()
tree1 = parseEXPR()
op = nextword
if op == "+" or op == "-" :
getNextword()
tree2 = parseEXPR()
if nextword == ")" :
ans = [op, tree1, tree2]
getNextword()
else :
error("missing )")
else :
error("missing operator")
else :
error("illegal symbol to start an expression")
return ans
def isVar(word) :
"""checks whether word is a legal variable name"""
KEYWORDS = ("print", "while", "end")
ans = ( word.isalpha() and not(word in KEYWORDS) )
return ans
def error(message) :
"""prints an error message and halts the parse"""
print "parse error: " + message
print nextword, wordlist
raise Exception
===================================================
We tie the pieces together like this to make a parser that builds
EXPRESSION trees:
===================================================
# global invariant: nextword + wordlist == all remaining unread words
nextword = "" # holds the first unread word
wordlist = [] # holds the remaining unread words
EOF = "!" # a word that marks the end of the input words
def main() :
global wordlist
# read the input text, break into words, and place into wordlist:
text = raw_input("Type an arithmetic expression: ")
wordlist = scan(text)
# do the parse:
getNextword()
tree = parseEXPR()
print tree
if nextword != EOF :
error("there are extra words")
===================================================
COMMAND ::= VAR = EXPRESSSION | print VAR | while EXPRESSION : COMMANDLIST endare VAR, print, and while. Since these are distinct, the parseCOMMAND function can build the correct form of tree:
=================================================== def parseCOMMAND() : """builds a COMMAND operator tree from the words in nextword + wordlist, where COMMAND ::= VAR = EXPRESSSION | print VARIABLE | while EXPRESSION : COMMANDLIST end also, maintains the global invariant (on wordlist and nextword) """ if nextword == "print" : # print VARIABLE ? getNextword() if isVar(nextword) : ans = ["print", nextword] getNextword() else : error("expected var") elif nextword == "while" : # while EXPRESSION : COMMANDLIST end ? getNextword() exprtree = parseEXPR() if nextword == ":" : getNextword() else : error("missing :") cmdlisttree = parseCMDLIST() if nextword == "end" : ans = ["while", exprtree, cmdlisttree] getNextword() else : error("missing end") elif isVar(nextword) : # VARIABLE = EXPRESSION ? v = nextword getNextword() if nextword == "=" : getNextword() exprtree = parseEXPR() ans = ["=", v, exprtree] else : error("missing =") else : # error -- bad command error("bad word to start a command") return ans ===================================================We finish with the function that collects the commands in a COMMANDLIST:
=================================================== def parseCMDLIST() : """builds a COMMANDLIST tree from the words in nextword + wordlist, where COMMANDLIST ::= COMMAND | COMMAND ; COMMANDLIST that is, one or more COMMANDS, separated by ;s also, maintains the global invariant (on wordlist and nextword) """ anslist = [ parseCOMMAND() ] # parse first command while nextword == ";" : # collect any other COMMANDS getNextword() anslist.append( parseCOMMAND() ) return anslist def main() : """reads the input mini-program and builds an operator tree for it, where PROGRAM ::= COMMANDLIST Initializes the global invariant (for nextword and wordlist) """ global wordlist text = raw_input("Type the program: ") wordlist = scan(text) getNextword() # assert: invariant for nextword and wordlist holds true here tree = parseCMDLIST() # assert: tree holds the entire operator tree for text print tree if nextword != EOF : error("there are extra words") ===================================================This style of parsing is also called top-down parsing, predictive parsing, and LL-parsing because it constructs the operator tree from the root at the tree's top downwards towards the leaves, predicting the correct structure by looking at the words in the input program, one at a time. You can see how important it is to have exactly the correct number of keywords and brackets at the exactly correct positions in the input program so that this technique will succeed. Parsing theory is the study of how to write grammars and parsers successfully.
Once you have mastered writing parsers by hand, you realize that the process is almost completely mechanical --- starting from the grammar definition, you mechanically write the correct code. Now, you are ready to use a tool called a parser generator to do the code writing for you: The input to a parser generator is the set of grammar rules and the output is the parser. Yacc is a well known parser generator, and PLY is a version of Yacc coded to work with Python. Antlr is another popular parser generator.
Exercise: Copy the scanner and parser into one file and the command interpreter into another. In a third file, write a driver program that reads a source program and then imports and calls the scanner-parser and interpreter modules.
Exercise: Add an if-else command to the parser and interpreter.
Exercise: Add parameterless procedures to the language:
COMMAND ::= . . . | proc I : C | call I
In the interpreter, save the body of a newly defined procedure
in the namespace. (That is, the semantics of
proc I: C is similar to I = C.) What is the semantics of
call I? Implement this.
Someday, you will be asked to design a language of your own. Indeed, this happens each time you design a piece of software, because the inputs to the software must arrive in a sensible order --- syntax --- for the software to process them. Software used by humans requires an input language so that a human knows the rules (grammar) for communicating with the software. Sometimes, the grammar is just a matter of the order of mouse drags and clicks; this is a kind of point-and-shove language that a human might use to another human when the two people are unwilling to speak words to each other.
But when a human wishes to speak (type) words to a program, a real language of words, phrases, and sentences is required. What should this language look like? What operations, data, control are needed within it? When a GUI must map a sequence of mouse drags and clicks into target code, what kind of code should it generate? If you are designing a piece of software, you must design its input language, and you must write the parser and the interpreter for the language. This is why we must learn the technques in this chapter and this course.
Programming languages that are designed to solve problems in some specialty area (e.g., avionics, telecommunications, word processing, database access, game playing) must have operations that are tailored to the specialty area and must have data and control structures that support the forms of computation in the area. A language oriented towards a specialty area is called a domain-specific programming language. By the end of this course, you will have basic skills to design such languages.
By the way, grammar (BNF) notation is a domain-specific programming language --- for writing parser programs! There are automated systems (Yacc, PLY, Antlr, Bison, ...) that can read a grammar definition and automatically write the parser code that we wrote by hand in this chapter. So, when you write a grammar, you have, for all practical purposes, already written its parser --- what's left to do is purely mechanical.