Copyright © 2008 David Schmidt

Chapter 8:
Grammars and trees


8.1 Grammars (BNF)
    8.1.1 Example: arithmetic expressions
    8.1.2 Example: logic propositions
    8.1.3 Example: mini-programming language
8.2 Operator trees
8.3 Semantics of operator trees
8.4 Parsing: How to construct an operator tree


When I speak to you, how do you understand what I am saying? First, it is important that we communicate in a common language, say, English, and it is important that I speak in grammatically correct English (e.g., ``Eaten house horse before.'' is a grammatically incorrect, useless communication). Finally, you must know how to attach meanings to the words and phrases that I use.

The same ideas are just as important when you talk to a computer, by means of a program written in a programming language. For the computer to understand what you say, the computer must have knowledge of the language you use. This includes:

  1. syntax: the spelling and grammatical structure of the computer language
  2. semantics: the meanings of the words and phrases.

In the 1950s, Noam Chomsky realized that the syntax of a sentence (or computer program) can be represented as a tree, and the rules for building syntactically correct sentences can be written as an equational, inductive definition. Chomsky called the definition a grammar. (John Backus and Peter Naur independently discovered the same notation, and for this reason, a grammar is sometimes called BNF (Backus-Naur form) notation.)


8.1 Grammars (BNF)

Grammars are best introduced by example.


8.1.1 Example: arithmetic expressions

Say that we wish to define precisely how to write an arithmetic expression. We might say that such expressions consist of numerals composed with addition and subtraction operators. But we should be more precise. Here are the equations (``grammar rules'') that define the syntax of arithmetic expressions:
===================================================

EXPRESSION ::= NUMERAL  |  ( EXPRESSION OPERATOR EXPRESSION )
OPERATOR is  +  or   -
NUMERAL  is a sequence of digits from the set, {0,1,2,...,9}

===================================================
The words in upper-case letters (nonterminals) name phrase and word forms: an EXPRESSION phrase consists of either a NUMERAL word or a left paren followed by another (smaller) EXPRESSION phrase followed by an OPERATOR word followed by another (smaller) EXPRESSION phrase followed by a right paren. (The vertical bar means ``or.'')

(We can also write equations for OPERATOR and NUMERAL, like this:

OPERATOR ::=  +  |  -
NUMERAL ::=  DIGIT  |  DIGIT NUMERAL
DIGIT ::=  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
but usually the spelling of individual words is stated informally, like we did above.)

Using the rules, we can verify that this sequence of words is a legal EXPRESSION phrase:

(4 - (3 + 2))
There is a precise formal justification:
  1. 4 is a NUMERAL (as are 3 and 2)
  2. all NUMERALs are legal EXPRESSION phrases, so (3 + 2) is an EXPRESSION phrase, because + is an OPERATOR and 3 and 2 are EXPRESSIONS
  3. (4 - (3 + 2)) is an EXPRESSION, because 4 and (3 + 2) are EXPRESSIONs, and - is an OPERATOR.
This reasoning is nicely drawn as a derivation tree:
===================================================



===================================================
Indeed, a sequence of words is an EXPRESSION phrase if and only if one can build a derivation tree for the words using the grammar rules.


8.1.2 Example: logic propositions

The rules for writing correct propositions can be stated as a grammar:
===================================================

PROP ::= TERM BINOP TERM  |  TERM
TERM ::= UNOP FACTOR  |  FACTOR
FACTOR ::= PRIM  |  ( PROP )
BINOP ::=  ^  |  v  |  -->
UNOP ::= ~
PRIM  is a word that begins with a letter

===================================================
Here is the derivation tree for an example, (A v B) --> ~C:

As the previous examples show, spaces within the grammar rules do not imply that spaces are required within the phrases defined by the rules.


8.1.3 Example: mini-programming language

We can use a grammar to define the structure of an entire programming language. Here is the grammar for a mini-programming language:
===================================================

PROGRAM ::=  COMMANDLIST
COMMANDLIST ::=  COMMAND  |  COMMAND ; COMMANDLIST
COMMAND ::=  VAR = EXPRESSSION 
             |  print VARIABLE
             |  while EXPRESSION : COMMANDLIST end
EXPRESSION ::= NUMERAL  |  VAR  |  ( EXPRESSION OPERATOR EXPRESSION )
OPERATOR is  + or  -
NUMERAL  is a sequence of digits from the set, {0,1,2,...,9} 
VAR  is a string beginning with a letter, not  'print', 'while', or 'end'

===================================================
The definition says that a program is a list (sequence) of commands, which can be assignments or prints or while-loops. The body of a while-loop is itself a list of commands. The grammar does not explain what the phrases mean, so we cannot determine here how a command like, while x : x = (x - 1) end, operates.

You should draw the derivation trees for these PROGRAMs:


8.2 Operator trees

A derivation tree shows the internal structure of a sentence --- the phrases and words. The grouping of the words into phrases is what is important (and not the phrase names, EXPRESSION, TERM, COMMANDLIST, etc). There is a useful way to format a derivation tree so that the words and their groupings are preserved but the phrase names themselves are dropped. This is called an operator tree or abstract-syntax tree

Here is the operator tree that is produced from the derivation tree for (4 - (3 + 2)):


It is called an ``operator tree'' because the operators rest in the places where the phrase names once appeared. It's lots more compact than the original derivation tree, but it has the same branching structure, which is the crucial part.

When we program with grammars and trees, we can use a dynamic data-structures language, like Python, that lets us build nested lists. Then, we can represent an operator tree as a nested list. Here is the nested-list representation of the above operator tree:

["-", "4", ["+", "3", "2"]]
Here is another example: For ((2+1) - (3-4)), we have this operator tree (nested list):
["-", ["+", "2", "1"], ["-", "3", "4"]]

For the proposition example, (A v B) --> ~C, its operator tree is

[-->, [v, A, B], [~, C]]

Finally, this program, x = 3; while x : x = (x -1) end; print x, has this operator tree:

[["=", "x", "3"],
 ["while", "x",  [["=", "x", ["-", "x", "1"]]]],
 ["print", "x"]
]
Operator trees are easy for computer programs to use, and we will work with them from here on.


8.3 Semantics of operator trees

When a compiler processes a program, it first builds the program's operator tree. Then, it must calculate the meaning --- the semantics --- of the tree. The process of giving meaning can be done with a recursively defined tree-traversal function.

Let's review the pattern for processing a binary operator tree. A binary operator tree has two forms, which we can define precisely yet again with a grammar:

TREE ::=  NUMERAL  |  [ OP, TREE1, TREE2 ]
That is, every binary operator tree is either just a single numeral or a list holding an operator symbol and two subtrees. If we wish to examine (traverse) a binary operator tree and compute with all its substructures, we should write a function that implements a recursion that matches the recursion in the grammar rule.

The pattern of recursion looks like this:

def process(tree) :
    if tree is an instance of a NUMERAL :
        ans = ... compute with the NUMERAL tree ...
    else :  # tree is an instance of  [OP, TREE1, TREE2]
        op = tree[0]
        subans1 = process(tree[1])
        subans2 = process(tree[2]) 
        ans = ... assemble  op,  subans1,  and  subans2
    return ans
The process function uses recursion to process the smaller trees, tree[1] and tree[2] embedded within tree to get their answers, and then it computes on these answers for its own answer.

The classic use of this technique is for evaluation, that is computing the underlying meaning (semantics) of the tree. Evaluation is used for languages like Python, Lisp, and Prolog. Let's write an evaluator to compute the integer value represented by an operator tree. (For example, ["-", ["+", "2", "1"], ["-", "3", "4"]] computes to the integer, 4.) Here is the function that evaluates an operator tree to its integer meaning:

===================================================

def eval(t) :
    """pre:  t  is a TREE, 
             where TREE ::= NUMERAL | [ OP, TREE1, TREE2 ]
                   OP is  "+" or "-"
       post: ans is the numerical meaning of t
    """
    if isinstance(t, str) and t.isdigit() : # is  t  a string holding an int?
        ans = int(t)
    else :  # t is a list, [op, t1, t2]
        op = t[0]
        t1 = t[1]
        t2 = t[2]
        ans1 = eval(t1)
        # assert:  ans1  is the numerical meaning of t1
        ans2 = eval(t2)
        # assert:  ans2  is the numerical meaning of t2
        if op == "+" :
            ans = ans1 + ans2
        elif op == "-" :
            ans = ans1 - ans2
    return ans

===================================================
This time, we use the recursion to compute the numerical meanings of the subtrees, and combine these meanings to reach the meaning of the complete tree.

Here is a sketch of the execution of eval( ["-", ["+", "2", "1"], ["-", "3", "4"]] ), which computes to 3 - (-1) = 4:

===================================================

eval( ["-", ["+", "2", "1"], ["-", "3", "4"]] )

=>  op = "-"
    t1 = ["+", "2", "1"]
    t2 = ["-", "3", "4"]
    ans1 = eval(t1)      =>  op = "+"
                             t1 = "2"
                             t2 = "1"
                             ans1 = eval(t1)  =>  ans = 2
                                  = 2
                             ans2 = eval(t2)  =>  ans = 1
                                  = 1
                             ans = 2+1 = 3
         = 3
    ans2 = eval(t2)      =>  op = "-"
                             t1 = "3"
                             t2 = "4"
                             ans1 = eval(t1)  =>  ans = 3
                                  = 3
                             ans2 = eval(t2)  =>  ans = 4
                                  = 4
                             ans = 3-4 = -1
         = -1
    ans = 3 - (-1) = 4
= 4

===================================================
Again, each => represents a recursive call (restart) of eval on a subtree of the original operator tree. Each restart keeps its own namespace of its local variables, which it uses to compute the answer for its subtree. Eventually, the answers are returned and combined. You can see that the pattern of calls to eval matches the pattern of structure in the original operator tree.

This style of processing can be used on operator trees for any grammar at all. Let's review the grammar for the mini-programming language:

===================================================

PROGRAM ::=  COMMANDLIST
COMMANDLIST ::=  COMMAND  |  COMMAND ; COMMANDLIST
COMMAND ::=  VAR = EXPRESSSION
             |  print VARIABLE
             |  while EXPRESSION : COMMANDLIST end
EXPRESSION ::= NUMERAL  |  VAR  |  ( EXPRESSION OPERATOR EXPRESSION )
OPERATOR is  +  or  -
NUMERAL  is a sequence of digits from the set, {0,1,2,...,9}
VAR  is a string beginning with a letter; it cannot be  'print', 'while', or 'end'

===================================================
For program,
x = 3 ;  while x : x = x - 1 end ;  print x
here is its operator tree:
[["=", "x", "3"],
 ["while", "x",  [["=", "x", ["-", "x", "1"]]]],
 ["print", "x"]
]
The operator tree has three ``levels'' --- a commandlist level, a command level, and an expression level, so it is better for us to keep these separate and say that operator trees look like this:
CLIST ::=  [ CTREE+ ]
           where  CTREE+  means  one or more CTREEs
CTREE ::=  ["=", VAR, ETREE]  |  ["print", VAR]  |  ["while", ETREE, CLIST]
ETREE ::=  NUMERAL  |  VAR  |  [OP, ETREE1, ETREE2]
           where OP is either "+" or "-"
Now, we write three functions: one for CLIST trees, one for CTREEs, and one for ETREEs. This set of functions is called an interpreter. It is like the interpreter that underlies the computer implementation of Python and Java. It defines the semantics of the sentences in the computer language.

Here is the interpreter for the programming language; it reads an operator tree and computes the meaning (executes the commands) of the tree. It would be best for you to note first the global variable, memory, then study interpretETREE (which works like eval above), and then study interpretCTREE, which executes the assignment, print, and while-commands:

===================================================

memory = {}  # a global variable that holds the values of the VARIABLES
             # used in program,  p.  It is implemented as a Python dictionary.

def interpretCLIST(p) :
    """pre: p  is a program represented as a  CLIST ::=  [ CTREE+ ]
                  where  CTREE+  means  one or more CTREEs
       post:  memory  holds all the updates commanded by program  p
    """
    for command in p :
        interpretCTREE(command)


def interpretCTREE(c) :
    """pre: c  is a command represented as a CTREE:
         CTREE ::= ["=", VAR, ETREE] | ["print", VAR] | ["while", ETREE, CLIST]
       post:  memory  holds all the updates commanded by  c
    """
    operator = c[0]
    if operator == "=" :   # assignment command
        var = c[1]
        exprval = interpretETREE(c[2])
        memory[var] = exprval
    elif operator == "print" :   # print command
        print memory[c[1]]
    elif operator == "while" :   # while command
        expr = c[1]
        body = c[2]
        while (interpretETREE(expr) > 0) :
            interpretCLIST(body)
    else :   # error
        crash("invalid command")


def interpretETREE(e) :
    """pre: e  is an expression represented as an ETREE:
           ETREE ::=  NUMERAL  |  VAR  |  [OP, ETREE1, ETREE2]
                      where OP is either "+" or "-"
      post:  ans  holds the numerical value of  e
    """
    if isinstance(e, str) and  e.isdigit() :   # a numeral
        ans = int(e)
    elif isinstance(e, str) and len(e) > 0  and  e[0].isalpha() :  # var name
        if e in memory :
            ans = memory[e]
        else :
            crash("variable name undefined")
    else :   #  [op, e1, e2]
        op = e[0]
        ans1 = interpretETREE(e[1])
        ans2 = interpretETREE(e[2])
        if op == "+" :
            ans = ans1 + ans2
        elif op == "-" :
            ans = ans1 - ans2
        else :
            crash("illegal arithmetic operator")
    return ans


def crash(message) :
    """pre: message is a string
       post: message is printed and interpreter stopped
    """
    print message + "! crash! core dump: ", memory
    raise Exception   # stops the interpreter

===================================================
To start the interpreter with a program, type interpretCLIST(p), where p is a program represented as an operator tree. For example,
interpretCLIST([["=", "x", "3"],
                ["while", "x",  [["=", "x", ["-", "x", "1"]]]],
                ["print", "x"]
               ])
will assign 3 to "x" in global variable, memory; will next decrement "x"'s value from 3 to 2 to 1 to 0 and stop the loop; and will print x's final value in memory, namely, 0. The interpreter works just like Python's and Java's work. In particular, intepretCOMMAND shows that the semantics of the while-command is that the expression part is repeatedly evaluated, then the body part, as long as the expression part evaluates to an int that is positive.

Another use of recursively defined functions is for translation, the systematic rewriting of a program from one language to another. Here is an example, where we process an operator tree to assemble the postfix-string representation of the tree:

===================================================

def postfix(t) :
    """pre:  t  is a TREE,  where TREE ::= NUM | [ OP, TREE1, TREE2 ]
       post: ans  is a string holding a postfix (operator-last) sequence
             of the symbols within  t
    """
    if isinstance(t, str) :  # is  t  an instance of a NUM (a simple string) ?
        ans = t              # the postfix of a NUM is just the NUM itself
    else :  # t is a list, [op, t1, t2], that is,  isinstance(t, list)
        op = t[0]
        t1 = t[1]
        t2 = t[2]
        ans1 = postfix(t1)
        # assert:  ans1  is a string holding the postfix form of t1
        ans2 = postfix(t2)
        # assert:  ans2  is a string holding the postfix form of t2
        ans = ans1 + ans2 + op  # the answer combines the subanswers
    # assert: ans holds the postfix form of  t
    return ans

===================================================
The function's recursions matches exactly the recursions in the grammar rule that defines the set of operator trees.

For example, postfix(["+", ["-", 2, 1], 4]) builds the postfix string, "21-4+". We can draw the results of the function call like this:

===================================================

postfix(["+", ["-", "2", "1"], "4"])

=> op = "+"
   t1 = ["-", "2", "1"]
   t2 = "4"
   ans1 = postfix(t1)  =>  op = "-"
                               t1 = "2"
                               t2 = "1"
                               ans1 = postfix(t1) => ans = "2"
                                    = "2"
                               ans2 = postfix(t2) => ans = "1"
                                    = "1"
                               ans = "2" + "1" + "-"
                                   = "21-"
        = "21-"
   ans2 = postfix(t2)  => ans = "4"
        = "4"
   ans = "21-" + "4" + "+"
       = "21-4+"

===================================================
Each => represents a recursive call (restart) of postfix on a subtree of the original operator tree. Each restart keeps its own copy (namespace) of its local variables, which it uses to compute the answer for its subtree. Eventually, the answers are returned and combined. You can see that the pattern of calls to postfix matches the structure in the original operator tree.

If the previous explanation was not enough for you, you can insert print commands into processTree so that the computer shows you the path it takes to analyze an operator tree:

===================================================

def postfixx(level, t) :
    """pre:  t  is a TREE,  where TREE ::= INT | [ OP, TREE1, TREE2 ]
             level is an int, indicating at what depth  t  is situated
               in the overall tree being postfixxed
       post: ans  is a string holding a postfix (operator-last) sequence
             of the symbols within  t
    """
    print level * "   ", "Entering subtree t=", t
    if isinstance(t, str) :  # is  t a numeral?
        ans = str(t)
    else :  # t is a list, [op, t1, t2]
        op = t[0]
        t1 = t[1]
        t2 = t[2]
        ans1 = postfixx(level + 1, t1)
        ans2 = postfixx(level + 1, t2) 
        ans = ans1 + ans2 + op  # the answer combines the two subanswers
    print level * "   ", "Exiting subtree t=", t,  "  ans=", ans
    print
    return ans

===================================================
If you call this function, say, postfixx("0", ["+", "2" , ["-", "3" , "4"]]), you will see this printout:
===================================================

 Entering subtree t= ['+', '2', ['-', '3', '4']]
    Entering subtree t= 2
    Exiting subtree t= 2   ans= 2

    Entering subtree t= ['-', '3', '4']
       Entering subtree t= 3
       Exiting subtree t= 3   ans= 3

       Entering subtree t= 4
       Exiting subtree t= 4   ans= 4

    Exiting subtree t= ['-', '3', '4']   ans= 34-

 Exiting subtree t= ['+', '2', ['-', '3', '4']]   ans= 234-+

===================================================
This shows that the computer descended into the levels of the tree from left to right, computing answers for its leaves that were combined into an answer for the entire tree.

A compiler for a programming language like Java or C# does a series of such translations to convert Java code into byte code: (i) the original Java program is translated into an operator tree; (ii) the operator tree is translated into a nested-list representation, known as three-address code; (iii) the tree-address code is translated into a long string called byte code.


8.4 Parsing: How to construct an operator tree

Because operator trees are processed so easily by recursively defined functions, it is best to write a program that reads the original input text and builds the operator tree straightaway. (The derivation tree itself is never built!) This activity is called parsing. The compiler or interpreter for every programming language does parsing before it does translation or interpretation.

For example, say we want a program, parseExpr, that can read a line of text, like ((2+1) - (3 - 4) ), and build the operator tree, ["-", ["+", "2", "1"], ["-", "3", "4"]]. The program's algorithm will go like this:

  1. Separate the symbols in the text line into their individual operators and words, discarding blanks. Make a list of the words.
  2. Read the words in the list one by one, using the grammar rules to guide building the operator tree.
The second step looks like a serious challenge, but we can use the same technique seen in the previous section: we write a family of functions, one per grammar rule, that reads the words and builds the tree. This technique is called recursive-descent parsing.

For the first step, here is a little function that disassembles a line of text and makes a list of words that were found in the text:

===================================================

def scan(text) :
    """scan splits apart the symbols in  text  into a list.

       pre:  text is a string holding a proposition
       post: answer is a list of the words and symbols within  text
             (spaces and newlines are removed)
    """
    OPERATORS = ("(", ")", "+", "-", ";", ":", "=")  # must be one-symbol operators
    SPACES = (" ", "\n")
    SEPARATORS = OPERATORS + SPACES
 
    nextword = ""  # the next symbol/word to be added to the answer list
    answer = []

    for letter in text:
        # invariant:  answer + nextword + letter  equals all the words/symbols
        #     read so far from  text  with  SPACES  removed

        # see if  nextword  is complete and should be appended to answer:
        if letter in SEPARATORS  and  nextword != "" :
            answer.append(nextword)
            nextword = ""

        if letter in OPERATORS :
            answer.append(letter)
        elif letter in SPACES :
            pass  # discard space
        else :    # build a word or numeral
            nextword = nextword + letter

    if nextword != "" :   
        answer.append(nextword)

    return answer

===================================================
For example, scan("((2+1) - (3 - 4) )") returns as its answer, ['(', '(', '2', '+', '1', ')', '-', '(', '3', '-', '4', ')', ')'].

Now, we use the grammar rule to guide us to writing the function that reads the list of words and constructs the operator tree. Here is the grammar rule for arithmetic expressions:

EXPRESSION ::= NUMERAL  |  VAR  | ( EXPRESSION OPERATOR EXPRESSION )
               where OPERATOR is  "+"  or "-"
                     NUMERAL  is a sequence of digits
                     VAR  is  a string of letters
For each construction in the grammar rule, there is an operator tree to build:
for NUMERAL,  the tree is  NUMERAL

for VAR,  the tree is  VAR

for ( EXPRESSION1 OPERATOR EXPRESSION2 ),  the tree is [OPERATOR, T1, T2]
      where T1 is the tree for EXPRESSION1
            T2 is the tree for EXPRESSION2
We write a function, parseEXPR, that reads the words of an arithmetic expression and builds the tree, based on the words. Like the eval function seen earlier, the grammar rules show us what to do.

It is simplest to use these global variables and a helper function to parcel out the input words one at a time:

===================================================

# say that  inputtext  holds the text that we must parse into a tree:

wordlist = scan(inputtext)  # holds the remaining unread words
nextword = ""  # holds the first unread word
# global invariant:  nextword + wordlist == all remaining unread words

EOF = "!"      # a word that marks the end of the input words
getNextword()   # call this function to move the first word into  nextword:

def getNextword() :
    """moves the front word in  wordlist  to  nextword.
       If wordlist is empty, sets  nextword = EOF
    """
    global nextword, wordlist
    if wordlist == [] :
        nextword = EOF
    else :
        nextword = wordlist[0]
        wordlist = wordlist[1:]

===================================================
The function that builds expression-operator trees looks like this:
===================================================

def parseEXPR() :
    """builds an EXPR operator tree from the words in  nextword + wordlist,
          where  EXPR ::=  NUMERAL  |  VAR  |  ( EXPR OP EXPR )
                 OP is "+" or "-"
      also, maintains the global invariant (on wordlist and nextword)
    """
    if  nextword.isdigit() :   # a NUMERAL ?
        ans = nextword        
        getNextword()
    elif  isVar(nextword) :    # a VARIABLE ?
        ans = nextword
        getNextword()
    elif nextword == "(" :     # ( EXPR op EXPR ) ?
        getNextword()
        tree1 = parseEXPR()
        op = nextword
        if op == "+"  or  op == "-" :
            getNextword()
            tree2 = parseEXPR()
            if nextword == ")" :
                ans = [op, tree1, tree2]
                getNextword()
            else :
                error("missing )")
        else :
            error("missing operator")
    else :
        error("illegal symbol to start an expression")

    return ans


def isVar(word) :
    """checks whether  word  is a legal variable name"""
    KEYWORDS = ("print", "while", "end")
    ans = ( word.isalpha()  and  not(word in KEYWORDS) )
    return ans


def error(message) :
    """prints an error message and halts the parse"""
    print "parse error: " + message
    print nextword, wordlist
    raise Exception

===================================================
Function parseEXPR uses the grammar rule to ask the appropriate questions about nextword, the next input word, to decide which form of operator tree to build. It is not an accident that the grammar rule for EXPRESSION is defined so that each of the three choices for an expresion begins with a unique word/symbol. This is the key to choosing the appropriate form of tree to build.

We tie the pieces together like this:

===================================================

# global invariant:  nextword + wordlist == all remaining unread words
nextword = ""  # holds the first unread word
wordlist = []  # holds the remaining unread words
EOF = "!"      # a word that marks the end of the input words

def main() :
    global wordlist
    # read the input text, break into words, and place into wordlist:
    text = raw_input("Type an arithmetic expression: ")
    wordlist = scan(text)
    # do the parse:
    getNextword()
    tree = parseEXPR()
    print tree
    if nextword != EOF :
       error("there are extra words")

===================================================
We can use the above functions to build a parser for the mini-programming language. Using the same technique, we make functions that parse commands and lists of commands:
===================================================

def parseCOMMAND() :
    """builds a COMMAND operator tree from the words in  nextword + wordlist,
       where  COMMAND ::=  VAR = EXPRESSSION
                        |  print VARIABLE
                        |  while EXPRESSION : COMMANDLIST end
      also, maintains the global invariant (on wordlist and nextword)
    """
    if nextword == "print" :      # print VARIABLE ?
        getNextword()
        if isVar(nextword) :
            ans = ["print", nextword]
            getNextword()
        else :
            error("expected var")
    elif nextword == "while" :    # while EXPRESSION : COMMANDLIST end ?
        getNextword()
        exprtree = parseEXPR()
        if nextword == ":" :
            getNextword()
        else :
            error("missing :")
        cmdlisttree = parseCMDLIST()
        if nextword == "end" :
            ans = ["while", exprtree, cmdlisttree]
            getNextword()
        else :
            error("missing end")
    elif isVar(nextword) :       # VARIABLE = EXPRESSION ?
        v = nextword
        getNextword()
        if nextword == "=" :
            getNextword()
            exprtree = parseEXPR()
            ans = ["=", v, exprtree]
        else :
            error("missing =")
    else :                       # error -- bad command
        error("bad word to start a command")
    return ans

===================================================
We finish with the function that collects the commands in a COMMANDLIST:
===================================================

def parseCMDLIST() :
    """builds a COMMANDLIST tree from the words in  nextword + wordlist,
       where  COMMANDLIST ::=  COMMAND  |  COMMAND ; COMMANDLIST
                          that is,  one or more COMMANDS, separated by ;s
       also, maintains the global invariant (on wordlist and nextword)
    """
    anslist = [ parseCOMMAND() ]   # parse first command
    while nextword == ";" :        # collect any other COMMANDS
        getNextword()
        anslist.append( parseCOMMAND() )
    return anslist

def main() :
    """reads the input mini-program and builds an operator tree for it,
       where  PROGRAM ::= COMMANDLIST
       Initializes the global invariant (for nextword and wordlist)
    """
    global wordlist
    text = raw_input("Type the program: ")
    wordlist = scan(text)
    getNextword()
    # assert: invariant for nextword and wordlist holds true here
    tree = parseCMDLIST()   
    # assert: tree holds the entire operator tree for  text
    print tree
    if nextword != EOF :
       error("there are extra words")

===================================================