Copyright © 2010 David Schmidt

Chapter 2:
Two virtual-machine architectures for imperative languages


2.1 The C-virtual machine
    2.1.1 An interpreter for a C-like language
    2.1.2 Data-type declarations
    2.1.3 A compiler for C
2.2 The object virtual machine
    2.2.1 An interpreter for an object-oriented language


The previous chapter showed how to implement a language's syntax with a parser and the semantics with an interpreter. The former converts a sequence of textual words into an operator tree, which we represent as a nested list. The latter traverses the operator tree and updates internal data structures (e.g., the namespace).

The intepreter in the previous chapter is overly naive. In this chapter we study two more realistic virtual-machine architectures, both for assignment-based ("imperative") languages. One is for C and one is for Smalltalk. Here are the key features of each:


2.1 The C-virtual machine

To understand the previous paragraph, we look at the virtual machine for C. The the controller ("processor unit") is implemented by the interpreter functions, and there are two data structures: an environment ("symbol table") and linear memory. Data is stored in the memory's cells, and the location numbers of the cells are saved in the environment. Here is a sample configuration:


The symbol table maps each variable name to a location number in memory. The memory holds ints in its cells.

The picture shows that variable x names location 0 in the memory; y names location 1, and z names location 4. In the memory itself, location 0 holds 8 and locations 3 and 4 hold 9. Perhaps the configuration was constructed by these commands:

int x;
int[3] y;
int z;
x = 8;  z = x + 1;  y[2] = z;
The first three lines are declarations, which in C are storage allocation commands. (''Data types'' in C tell the C compiler how many storage cells to reserve in memory.) The locations of the cells are saved in the environment. So, x names 0, and y names 1, which is the starting, or "base" address of a three-celled array. z names 4.

A C-programmer must be conscious of a variable's meaning (its location number) and the dereferencing of the meaning (the number stored in the cell indexed by the location).

For example, x = 8 says ``store int 8 in the location that x means'', that is, store 8 in location 0.

But the meaning of x in the expression, x + 1, on the right-hand side of the assignment, z = x + 1, is not 0, but 8. When a variable appears in an expression, its meaning is its location dereferenced to extract the int stored in the corresponding memory cell. So, the meaning of z = x + 1 is ``store int 8+1 = 9 in location 4.''

The meaning of the left-hand side of an assignment in C is always a location. The meaning of y[2] in the last assignment, y[2] = z, is location 1 + 2 = 3 --- y's base address plus offset 2. This means location 3 receives int, 9, which was dereferenced from z's location, 4.

C allows a programmer to write code about locations and deferencing via the operations, & and * --- & means "the location of" and * means "do a dereference of." These two operators are used with "pointer variables". Say that we extend the above example with these lines of C-code:

*int p = &x;    // pointer  p's  meaning is location 5,
                //   whose cell receives  0
y[0] = p;       // location 1's cell receives  0,
                //   the value from dereferencing location 5

y[1] = *p;      // location 2's cell receives  8, 
                //   the value dereferenced from dereferencing location 5
y[2] = &p + 1;  // location 3 holds 6, which is  p's location plus one (!)

C lets you use locations the same way you use ints. You can even do arithmetic with them. They are used a lot in systems programming, and it is critical that you understand them.


2.1.1 An interpreter for a C-like language

Here is the (syntax of the) declaration-free command language from the previous chapter augmented by the two operations, &L (``the location number named by variable L'') and *L (''the value stored in the memory cell labelled by location number L'').
===================================================

P : Program 
CL : CommandList           L : LefthandSide
C : Command                I : Variable
E : Expression             N : Numeral

P ::=  CL

CL ::= C  |  C ; CL
C ::=  L = E  |  while E : CL end  |  print L 

E ::=  N  |  ( E1 + E2 )  |  L  |  & L

L ::=  I  |  * L

N ::=  string of digits
I ::=  strings of letters, not including the keywords,  while,  print,  end

===================================================
There is a new syntax domain, LefthandSide, which names the phrases that can appear on the left-hand side of an assignment. The semantics of a lefthand-side phrase, L, is a location number. The semantics of an expression, E, is an int (or a location number).

Here is a sample configuration:


Perhaps the configuration was constructed by these commands:
y = 5;  z = 0;  x = (6 + y);
Since there are no declarations in our little language, the environment is updated and memory cells are allocated when new variables are first mentioned. (See the Exercise at the end of this section for declarations.)

This is the crucial feature of our C-like language: Each variable has a location number, and each location number labels a memory cell that holds an int. For this assignment,

x = x + 1;
The meaning of x on the left-hand side of the assignment is location 2 --- call this x's L-value. The meaning of x on the right-hand side of the assignment is 11 (the int stored in location 2). Call this x's R-value. Thus, location 2's cell will be updated to 12.

Now, return to the above picture. That same environment-memory layout is constructed by these commands, which manipulate location numbers as well as ints:

===================================================

y = 4;
z = &y;       # store in  z's  cell the location named by  y
x = (7 + *z)  # add 7 to the int stored at the location stored in  z's  location
              #  and store the sum in the location named by  x  (whew!)
*z = (y + 1)  # add 1 to the int in  y's  cell and store the sum at the location saved in
              #  z's location

===================================================
The first command works like usual, but the second stores y's location (0) into z's location --- read &y as ``y's location.'' Some people say that ``z points to y.'' The third command adds 7 to the integer found at location 0:
  1. z's value is 0
  2. *z's value is 4, which is the int stored at location 0.
The sum, 11, is stored in location 2. You can read *z as ``the value in the location pointed to by z.'' The * is called a dereferencing operator.

The final command computes y + 1, that is, 5, and stores it at location 0, because z names location 1 and location 1 (that is, *z) holds 0.

The syntax of a C-assignment is expressed like this:

L = E
where both L and E can be compound phrases. The meaning of the Lefthandside L-phrase is called an L-value; the meaning of the Expression E-phrase is called an R-value. In our version of C, an L-value is always a location, and an R-value is always an int (locations are ints in C!). For example, for
z = &y
The L-value is location 1 and the R-value is (int/location) 0. For
*z = (y + 1)
the L-value is location 0 and the R-value is int 5.

We study the interpreter to understand & and *. Here are the operator trees the interpreter uses; they are the same as in the last chapter, plus the two new forms:

===================================================

PTREE ::=  [ CTREE+ ]
           where  CTREE+  means  one or more CTREEs

CTREE ::=  ["=", LTREE, ETREE]  |  ["while", ETREE, CLIST]  |  ["print", VAR]

ETREE ::=  NUM  |  ["+", ETREE, ETREE]  |  ["&", LTREE]  |  LTREE

LTREE ::=  VAR  |  ["*", LTREE]

NUM ::=  string of digits

VAR ::=  string of letters but not  "while" or "print" or "end"

===================================================
For example, the PTREE for the earlier example, y = 4; z = &y; x = (7 + *z); *z = (y + 1), looks like this:
[["=", "y", "4"],
 ["=", "z", ["&", "y"]],
 ["=", "x", ["+", "7", ["*", "z"]],
 ["=", ["*", "z"], ["+", "y", "1"]
]

Here is the coded interpreter --- the entire virtual machine, controller plus data structures. Read the comments at the top and then read the coding of interpretLTREE, which calculates the meaning of an assignment's left-hand side, which computes to a location number.

Next, study interpretETREE to see how ["&", LTREE] computes to a meaning different from just LTREE. This is crucial, because the former computes to a location number and the latter computes to the number stored at the location number. It may look strange, but that's how it's done in C!

===================================================

"""Interpreter for a C-like mini-language with pointers.
   There are two crucial data structures:
"""
memory = []  # a global variable that models primary storage.
             # It is a list (array), where the indexes to the array are
             # meant to model addresses.  For example, if
             #   memory = [5,0,11],  then location 2 of  memory  holds  11.

env = {}     # a global variable that holds the program's variable names
             # and the locations that each denotes. 
             # It is a Python hash table (dictionary).  For example, if
             #   env = {'x':2, 'y':0, 'z':1}, then variable x names location 2
             # in  memory,  and  memory[2] == 11.   This is how computer
             # storage is used in an implementation of C.   

def interpretPTREE(program) :
    """pre: program  is a program represented as a CLIST
       post:  memory  holds all the updates commanded by program  p
    """
    global memory, env       # these variables are global to main
    memory = [];  env = {}   # reset them both
    try :
      interpretCLIST(program)
    except Exception :
        print "Due to error, interpreter must quit prematurely.  Sorry."
    print "final env =", env
    print "final memory =", memory


def interpretCLIST(p) :
    """pre: p  is a program represented as a  CLIST ::=  [ CTREE+ ]
                  where  CTREE+  means  one or more CTREEs
       post:  memory  holds all the updates commanded by program  p
    """
    for command in p :
        interpretCTREE(command)


def interpretLTREE(x) :
    """pre: x  is an L-value represented as an LTREE:
         LTREE ::= VAR | ["*", LTREE]
       post: ans  holds the location named by  x
       returns:  ans
    """
    if isinstance(x, str) :   # a VAR ?
        if x in env :
            ans = env[x]  # look up its location
        else :  # it's a brand new var, so allocate a memory cell for it:
            memory.append(0)  # add a cell at the end of  memory
            env[x] = len(memory) - 1  # remember the location
            ans = env[x]
    else :  # a pointer dereference,  ["*", LTREE]
        y = interpretLTREE(x[1])   # get value of  LTREE, a location number
        ans = memory[y]    # dereference it and return the location therein
    return ans


def interpretCTREE(c) :
    """pre: c  is a command represented as a CTREE:
         CTREE ::= ["=", LTREE, ETREE] | ["print", LTREE] | ["while", ETREE, CLIST]
       post:  memory  holds all the updates commanded by  c
    """
    operator = c[0]
    if operator == "=" :   # assignment command, ["=", LTREE, ETREE]
        lval = interpretLTREE(c[1])
        exprval = interpretETREE(c[2])  # evaluate the right-hand side
        memory[lval] = exprval  # do the assignment
    elif operator == "print" :   # print command,  ["print", VAR]
        num = interpretETREE(c[1])
        print num
    elif operator == "while" :   # while command
        expr = c[1]
        body = c[2]
        while (interpretETREE(expr) != 0) :
            interpretCLIST(body) 
    else :   # error
        crash("invalid command")


def interpretETREE(e) :
    """pre: e  is an expression represented as an ETREE:
          ETREE ::=  NUMERAL  |  LTREE  |  ["&", LTREE]  |  [OP, ETREE1, ETREE2]
                     where OP is either "+" or "-"
      post:  ans  holds the value of  e
    """
    if isinstance(e, str) and  e.isdigit() :   # a NUMERAL
        ans = int(e)
    elif isinstance(e, list) and (e[0] == "+" or e[0] == "-"):  # [OP, E1, E2]
        op = e[0]
        ans1 = interpretETREE(e[1])
        ans2 = interpretETREE(e[2])
        if op == "+" : 
            ans = ans1 + ans2
        elif op == "-" : 
            ans = ans1 - ans2
        else :
            crash("illegal arithmetic operator")
    elif isinstance(e, list) and e[0] == "&" :  # ["&", LTREE] 
        ans = interpretLTREE(e[1])   # return the location named by the LTREE
                                     # DO NOT DEREFERENCE !
    else:  # is an LTREE that MUST BE DEREFERENCED:
        x = interpretLTREE(e)
        ans = memory[x]
    return ans


def crash(message) :
    """pre: message is a string
       post: message is printed and interpreter stopped
    """
    print message + "! crash! core dump: ", memory
    raise Exception   # stops the interpreter

===================================================

Exercise: Test the interpreter with these programs:

x = 3

interpretPTREE([["=", "x", "3"]])


x = 2;  p = &x;  y = *p

interpretPTREE([["=", "x", "2"],
      ["=", "p", ["&", "x"]],
      ["=", "y", ["*", "p"]]
     ])


x = 1;  p = &x;  *p = 2

interpretPTREE([["=", "x", "1"],
      ["=", "p", ["&", "x"]],
      ["=", ["*", "p"], "2"]
     ])


x = 0;  y = (6 + &x);  *x = 999;  print y

interpretPTREE([["=", "x", "0"],
      ["=", "y", ["+", "6", ["&", "x"]]],
      ["=", ["*", "x"], "999"], 
      ["print", "y"]
     ])

Exercise: In C, the primary purpose of a declaration is to allocate memory. In particular, arrays are laid out in memory as a sequence of cells. Say we add int and array declarations, e.g.,

int x;
int[4] r;
x = 2;
r[x] = x + 1;
int y;
would allocate one memory cell to hold a int, four memory cells to hold a linear array, and one cell for another int. The resulting environment and memory are
env = {"x": 0,  "r": (1,4),  "y": 5}

memory = [2, 0, 0, 3, 0, 0]
Note that array r has a location and a length. (Actually, x has a location and a length, too: (0,1). So does y: (5,1).)

Implement the syntax and semantics of simple int and array declarations:

D : Declaration
C : Command

D ::=  int I  |  int [ N ] I
C ::=  D  |  L = E  | ...
L ::=  I  |  I [ E ]  |  * L
E ::=  & L  |  N  |  ...
Should your interpreter check to see if the index, E, in I[E] is "in bounds"? (You can switch on or off this feature in a C compiler.)


2.1.2 Data-type declarations

Since C's virtual machine mixes mixes ints and location numbers in memory, there is often trouble. For this reason, C lets you declare variables with a type tag that indicates how the variable will be used and the compiler will check some (not all) compatibilities. Here is the program seen earlier with C-style declarations embedded into it:
int y;
int* z;
y = 5;
z = &y;
int x = 6 + *z;
The declaration for z indicates that z should be used only as a pointer, that is, its R-value is a location number. The following program will generate a warning/error from the C compiler, which announces that y is misused in the assignment:
int x = 0 ;
int y;
y = &x;   // should not assign a location number to an  int  var
Although it is not so common, it is legal to declare a variable to be a pointer to a pointer to ...:
int* y;
int** z = &y;
The levels of pointers will make the interpreter work hard to check consistency of expressions and assignments. Here is the language with declarations:
===================================================

P ::=  CL
CL ::= C  |  C ; CL
C ::=  L = E  |  while E : CL end  |  print L  |  T I
T ::=  int  |  T *
E ::=  N  |  ( E1 + E2 )  |  L  |  & L
L ::=  I  |  * L
N ::=  string of digits
I ::=  strings of letters, not including the keywords,  while,  print,  end


PTREE ::=  [ CTREE+ ]
           where  CTREE+  means  one or more CTREEs

CTREE ::=  ["=", LTREE, ETREE]  |  ["while", ETREE, CLIST]  |  ["print", VAR]
           | ["declare", VAR, TYPE]

TYPE ::=  a list holding a sequence of  "ptr"  and  "int"  strings, e.g.,
          ["int"],  ["ptr", "int"],  ["ptr", "ptr", "int"], etc.

ETREE ::=  NUM  |  ["+", ETREE, ETREE]  |  ["&", LTREE]  |  LTREE

LTREE ::=  VAR  |  ["*", LTREE]

NUM ::=  string of digits

VAR ::=  string of letters but not  "while" or "print" or "end" or "declare" or "int"

===================================================

The interpreter remembers the type tags, in the environment. The tags are checked every time a variable is used. Here is a picture of the configuration generated by the first example program in this section:


Since z is a pointer variable, its type tag, *int, is saved internally as ["ptr", "int"]. Now, read the interpreter's code to see how the declarations create type tags and how interpretCTREE uses them to check the correctness of assignments and how interpretETREE uses them to check correctness of additions:
===================================================

"""Interpreter for a C-like mini-language with variables, pointers,
   declarations and loops.    There are two crucial data structures:
"""
memory = []  # a global variable that models primary storage.
             # It is a list (array), where the indexes to the array are
             # meant to model addresses.  For example, if
             #  memory = [5,0],  then location 0 of  memory  holds  5.

env = {}     # a global variable that holds the program's variable names
             # and the  type,location  that each denotes. 
             # It is a Python hash table (dictionary).  For example, if
             #   env = {'x': (["int"],0),  'p': (["ptr", "int"], 1}, 
             # then variable  x  names location 2, which holds an int,
             # and  p  names location 1, which holds a pointer to an int;
             # that is, location 1 holds a location that holds an int.


def interpretPTREE(p) :
"""pre: p  is a program represented as a  CLIST ::=  [ CTREE+ ]
                  where  CTREE+  means  one or more CTREEs
   post:  memory  holds all the updates commanded by program  p
"""
    global memory, env  # these variables are global to main
    memory = []         # reset them both
    env = {}
    interpretCLIST(p)
    print "final env =", env
    print "final memory =", memory


def interpretCLIST(p) :
    """pre: p  is a program represented as a  CLIST ::=  [ CTREE+ ]
                  where  CTREE+  means  one or more CTREEs
       post:  memory  holds all the updates commanded by program  p
    """
    for command in p :
        interpretCTREE(command)


def interpretLTREE(x) :
    """pre: x  is an L-value represented as an LTREE:
         LTREE ::= VAR | ["*", LTREE]
       returns:  datatype,location  pair named by  x
    """
    if isinstance(x, str) :   # a VAR ?
        if x in env :
            ans = env[x]  # look up its location
        else :  # undeclared
            crash("variable " + x + " undeclared")
    else :  # a pointer dereference,  ["*", LTREE]
        datatype, loc = interpretLTREE(x[1])  
        if datatype[0] == "ptr" :
            ans = (datatype[1:], memory[loc])    # dereference it
        else :
            crash("variable not a pointer")
    return ans
    


def interpretCTREE(c) :
    """pre: c  is a command represented as a CTREE:
         CTREE ::= ["=", LTREE, ETREE] | ["print", LTREE]
                   |  ["while", ETREE, CLIST]  |  ["declare", VAR, TYPE]
         where  TYPE ::= [ "ptr"*, "int" ]
           and  "ptr"*  means  zero or more occurrences of  "ptr"
                separated by commas

       post:  memory  holds all the updates commanded by  c
    """
    operator = c[0]
    if operator == "=" :   # assignment command, ["=", LTREE, ETREE]
        type1,lval = interpretLTREE(c[1])
        type2,exprval = interpretETREE(c[2])
        if type1 == type2 :
            memory[lval] = exprval  # do the assignment
        else :
            crash("incompatible types for assignment")
    elif operator == "print" :   # print command,  ["print", VAR]
        type,num = interpretETREE(c[1])
        print num
    elif operator == "while" :   # while command
        expr = c[1]
        body = c[2]
        type,val = interpretETREE(expr)
        while (val > 0) :
            interpretCLIST(body) 
    elif operator == "declare" :  # declaration
        x = c[1]
        if x in env :
            crash("variable " + x + " redeclared")
        else :
            memory.append("err")  # add a cell at the end of  memory
            env[x] = (c[2], len(memory) - 1)  # save type and location for x
    else :   # error
        crash("invalid command")


def interpretETREE(e) :
    """pre: e  is an expression represented as an ETREE:
          ETREE ::=  NUMERAL | [OP, ETREE1, ETREE2] | ["&", LTREE] | LTREE
                     where OP is either "+" or "-"
      post:  ans  holds the datatype and value of  e
      returns:  ans,  a  datatype,value  pair
    """
    if isinstance(e, str) and  e.isdigit() :   # a NUMERAL
        ans = (["int"], int(e))
    elif isinstance(e, list) and (e[0] == "+" or e[0] == "-"):  # [OP, E1, E2]
        op = e[0]
        type1, ans1  = interpretETREE(e[1])
        type2, ans2 = interpretETREE(e[2])
        if type1 == ["int"]  and  type2 == ["int"] :
            if op == "+" : 
                ans = (["int"], ans1 + ans2)
            elif op == "-" : 
                ans = (["int"], ans1 - ans2)
            else :
                crash("illegal arithmetic operator")
        else :
            crash("cannot do arithmetic on non-ints")
    elif isinstance(e, list) and e[0] == "&" :  # ["&", LTREE] 
        type0,val0 = interpretLTREE(e[1]) 
        ans = (["ptr"] + type0,  val0)
    else:  # is an LTREE that must be dereferenced:
        type0,loc0 = interpretLTREE(e)
        ans = (type0, memory[loc0])
    return ans


def crash(message) :
    """pre: message is a string
       post: message is printed and interpreter stopped
    """
    print message + "! crash! core dump: ", memory
    raise Exception   # stops the interpreter

===================================================
When you read interpretETREE, you see that the meaning of an expression must be a pair: a data type and a value. For example, 3 computes to (["int"], 3). The type tag ensures that the 3 is assigned only to a variable whose type is ["int"]. A pointer to an int will have type ["ptr", "int"]. For example, if we have declare x : int, which allocates location 0 for x, then ["&", "x"] evaluates to (["ptr", "int"], 0).)

Exercises:

  1. Install the interpreter and run these programs:
    int x int;  x = 3;  int y;  y = (x + 1)
    
    main([["declare", "x", ["int"]],
          ["=", "x", "3"],
          ["declare", "y", ["int"]],
          ["=", "y", ["+", "x", "1"]]
        ])
    
    
    int x;  int* p;  x = 2;  p = &x;  int y;  y = *p
    
    main([["declare", "x", ["int"]],
          ["declare", "p", ["ptr", "int"]],
          ["=", "x", "2"],
          ["=", "p", ["&", "x"]],
          ["declare", "y", ["int"]],
          ["=", "y", ["*", "p"]]
         ])
    
    int x;  int* p;  x = 1;  p = &x;  x = *x
    
    main([["declare", "x", ["int"]],
          ["declare", "p", ["ptr", "int"]],
          ["=", "x", "1"],
          ["=", "p", ["&", "x"]],
          ["=", "x", ["*", "x"]]
         ])
    
    

  2. Write the parser for the syntax and combine it with the interpreter. (Note: If you write the parser by hand, then change the syntax of variable declaration to this:
    C ::=  ... |  declare I : T
    T ::=  int  |  * T
    
    This is easier to parse.)

  3. Try every combination of pointers and ints you can imagine to try to break the interpreter.

  4. Try this example:
    int x ;
    int* y ;
    int** z ;
    z = &y ;
    y = &x ;
    **z = 99
    
    What happens?

  5. Try this example:
    int x;
    x = 3 ;
    while x:  int y; y = 1; x = (x - y); print x  end ;
    print x
    
    What happens? Is the declaration of y in the loop body legal in C, even when the loop repeats ? (Yes.) Repair the interpreter so that it is legal here, too.

  6. Modify the syntax of declarations so that you can declare and initialize a variable together:
    C ::= . . .  |  declare I : T = E
    
    Revise both the parser and interpreter.

  7. Add procedures to the language:
    C ::=  . . .  |  proc I : C
    
    Note: if you worked the previous exercise, you have basically worked this one, too, in this format:
    C ::=  . . .  |  declare I : proc = C
    
    Add procedure types to the type tags:
    T ::=  int  |  proc  |  * T
    
    Revise the interpreter so that a variable can point to procedure code:
    declare x : int ;
    proc p :  x = (x + 1) ;
    declare y : * proc ;
    y = &p ;
    call *y
    
    C lets you do this. (Have you heard of a ``function pointer''?)


2.1.3 A compiler for C

The usual implementation for C is a compiler, which translates a program's operator tree into executable code. The architecture of a C-compiler looks like this: (Say that the input program is int y; int* z; int x; y = 5; ....)

The translate functions traverse the operator tree and emit target-code instructions that state the actions meant by the tree. The translate functions construct the symbol table (environment) so that the data type and memory location of each program variable is known while the target code is generated. This means the translate functions can check all data-type constraints and can replace all variable names by location numbers. As a result, the executable target code contains no types, no names --- only loads and stores with memory locations.

For example, the C program,

int y;
*int z;
int x;
y = 5;
z = &y;
x = (6 + *z) 
generates the symbol table shown above and translates into this target code:
malloc 3          # allocate cells for y,z,x

loadaddr 0        # addr of y
loadconst 5
store             # y = 5

loadaddr 1        # addr of z
loadaddr 0        # addr of y
store             # z = &y

loadaddr 2        # addr of x
loadconst 6
loadaddr 1        # addr of z
load              # z
load              # *z
add
store             # x = (6 + *z)
When the CPU reads and performs the instructions, it updates memory as shown in the diagram above.

Earlier in the chapter, we saw how a compiler is made from an interpreter by replacing the interpreter's computation actions on memory by code-generating instructions. Since the C-interpreter already has the operator-tree traversal strategy and already has the declaration processing and type-checking actions, we can convert it into a compiler by replacing all the interpreter's actions on memory by target-code generations:

===================================================

def translateETREE(e) :
    """pre: e  is an expression represented as an ETREE:
          ETREE ::=  NUMERAL | [OP, ETREE1, ETREE2] | ["&", LTREE] | LTREE
                     where OP is either "+" or "-"
      returns:  ans,  a  datatype,code  pair
      post:  code  is a string, the target code for computing  e's  meaning
    """
    if isinstance(e, str) and  e.isdigit() :   # a NUMERAL
        ans = (["int"], "loadconst " + e + "\n")  # replaces  int(e)
    elif isinstance(e, list) and (e[0] == "+" or e[0] == "-"):  # [OP, E1, E2]
        op = e[0]
        type1, code1  = translateETREE(e[1])
        type2, code2 = translateETREE(e[2])
        if type1 == ["int"]  and  type2 == ["int"] :
            if op == "+" : 
                ans = (["int"], code1 + code2 + "add\n")  # replaces  v1 + v2
            elif op == "-" : 
                ans = (["int"], code1 - code2 + "sub\n")
            else :
                crash("illegal arithmetic operator")
        else :
            crash("cannot do arithmetic on non-ints")
    elif isinstance(e, list) and e[0] == "&" :  # ["&", LTREE] 
        type0,code0 = translateLTREE(e[1]) 
        ans = (["ptr"] + type0, code0)
    else:  # is an LTREE that must be dereferenced:
        type0,code0 = translateLTREE(e)
        ans =  (type0, loc0 + "load\n")   # replaces  memory[loc0]
    return ans


def translateCTREE(c) :
    """pre: c  is a command represented as a CTREE:
         CTREE ::= ["=", LTREE, ETREE] | ["print", LTREE]
                   |  ["while", ETREE, CLIST]  |  ["dec", VAR, TYPE]
         where  TYPE ::= [ "ptr"*, "int" ]
           and  "ptr"*  means  zero or more occurrences of  "ptr"
                separated by commas

       post: returns  code,  a string that holds the target code that does
             the actions commanded by  c
    """
    operator = c[0]
    if operator == "=" :   # assignment command, ["=", LTREE, ETREE]
        type1,lcode = translateLTREE(c[1])
        type2,ecode = translateETREE(c[2])
        if type1 == type2 :
            code = lcode + ecode + "store\n"  # replaces  memory[lval] = eval
        else :
            crash("incompatible types for assignment")
    elif operator == . . .
     . . .
    return code

. . .

===================================================

This example shows that even when a language is meant to have a compiler as its standard implementation, an interpreter should be built first, because the interpreter

  1. defines the run-time machine data structures (e.g., memory layout)
  2. defines the architecture of the compiler's translation functions (each interpret function becomes a translate function)
  3. defines all the type- and constraint checking that should be done before target code is generated (see the type checking above)
  4. shows where actions on the run-time data structures should be replaced by generated target code (see where the memory references are replaced above)
The conversion of an interpreter into a compiler is largely mechanical, but the compiler writer must be an expert in the target language so that good, fast translations will be generated by the compiler.


2.2 The object virtual machine

The second important architecture for imperative languages is the object-oriented one, first used by Smalltalk. This architecture differs from the one for C because it replaces the linear memory by a heap, which is a collection of namespaces. Recall that a namespace (also called a dictionary) is a dynamic record that pairs fieldnames to values --- it is a ''struct'' that can grow.

Here is a baby object-oriented program that creates two objects:

x = 7;
y = new {f, g};  // constructs a new object with two fields
y.f = x;
y.g = new {x};
y.g.x = 1 + y.f;
The two global variables are x and y, the former an int and the latter an object with two fields, f and g. A second object, holding variable x, is created and assigned to y's g field. Although this program lacks classes and methods, it is easy to guess what should be allocated in the heap. The virtual machine consists of a controller (the interpreter's functions), the heap, and a cell that remembers the address ("handle") of the namespace that holds the program's global variables:

The program has its own namespace/object for global variables, at handle α, and the two objects are saved at handles β and γ. We use the term, ``handle'', to talk about the location of an object, e.g., y's handle is β. (Also, you cannot do arithmetic with handles, unlike in C!)

Assignment in object language works with handles and variables, like this: For

x = 7;
the meaning of the left-hand side, here x, is the base-offset pair, (α, 'x'). That is, the ``coordinates'' for storing 7 in the heap lead to object α, index x. For
y.f = x;
the meaning of the left-hand side is calculated in stages:
  1. The meaning of y is (α, 'y'), which is dereferenced to the handle, β.
  2. The meaning of y.f is the base-offset pair, (β, 'f').
The meaning of the assignment's right-hand side is the dereferencing of the pair, (α, 'x') to int 7. So, 7 is stored in the heap at the coordinates, (β, 'f').

In an object language, the "dot" dereferences a base-offset pair into a handle and uses the handle to make a base-offset pair. There is no other "arithmetic" on handles!


2.2.1 An interpreter for an object-oriented language

We must understand how the base-offset pairs are constructed and used. Here is the syntax of the baby object language:

===================================================

P : Program          
CL : CommandList            L : LefthandSide
C : Command                 I : Variable
E : Expression              N : Numeral

P ::=  CL

CL ::= C  |  C ; CL
C ::=  L = E  |  if E : CL1 else CL2  end  |  print L  |

E ::=  N  |  ( E1 + E2 )  |  L  |  new { I,*  }
       where  I,*  means zero or more I-phrases, separated by commas

L ::=  I  |  L . I

N ::=  string of digits
I ::=  strings of letters, not including keywords

===================================================
The new constructions are new (allocates a new object) and field indexing, represented by .. In a later chapter, we add methods, classes, etc.

Once again, an assignment has the form, L = E. But the L-values and R-values used in an object language are markedly different from those used in C:

For example, for x = 7, the L-value is (α,x) and the R-value is 7. The R-value, 7, is stored in the heap in the object named by α at position x.

For y.f = x, the L-value is (β,f) and the R-value is 7 (because the value saved at position x in object α is 7). This means 7 will be stored in the field f of the object labelled by β.

For y.g.x = y, the L-value is (γ, x) (because y.g has R-value γ!), and the assignment's R-value is β (because β is stored at position (α, y)). This means β will be stored in the field x of the object labelled by γ. (It also means that y and y.g.x are aliases to one and the same object.)

The operator trees for the language are

===================================================

PTREE ::=  CLIST

CLIST ::=  [ CTREE+ ]
           where  CTREE+  means  one or more CTREEs

CTREE ::=  ["=", LTREE, ETREE]  |  ["if", ETREE, CLIST, CLIST]
        |  ["print", LTREE]

ETREE ::=  NUM  |  ["+", ETREE, ETREE] |  ["deref", LTREE]  |  ["new", OB ]

OB ::=  [ ID* ]
            where ID*  means zero or more IDs
LTREE ::=   ID  |  ["dot", LTREE, ID]

NUM   ::=  a nonempty string of digits
ID    ::=  a nonempty string of letters

===================================================
Study the forms of ETREE; they are important. Also, note that an LTREE is a sequence of variable/field names that must be traversed to find a location.

For example, the CTREEs for a = new {f,g}; a.f = 3; a.g = (1 + a.f) look like this:

[ ["=", "a", ["new", ["f","g"]]],
  ["=", ["dot", "a", "f"], "3"],
  ["=", ["dot", "a", "g"], ["+", "1", ["deref", ["dot", "a", "f"]]]]  ]

Here is the interpreter; it uses a heap and a var, ns, that remembers the handle to the namespace with the global variables. The heap is grouped with a collection of maintenance functions, and if the code was any larger, the heap and its functions should be moved into a module by themselves.

===================================================

"""Interpreter for object-oriented language:

PTREE ::=  CLIST
CLIST ::=  [ CTREE+ ]
           where  CTREE+  means  one or more CTREEs
CTREE ::=  ["=", LTREE, ETREE]  |  ["if", ETREE, CLIST, CLIST]
        |  ["print", LTREE]
ETREE ::=  NUM  |  ["+", ETREE, ETREE] |  ["deref", LTREE]  |  ["new", OB ]
OB ::=  [ ID+ ]
        where ID+  means one or more IDs
LTREE ::=   ID  |  ["dot", LTREE, ID]
NUM   ::=  a nonempty string of digits
ID    ::=  a nonempty string of letters
"""

### THE HEAP:

heap = {}
heap_count = 0  # how many objects stored in the heap

"""The program's heap is a dictionary that maps handles to namespaces.
   An object is a namespace.

      heap : { (HANDLE : NAMESPACE)+ }
             where  HANDLE = a string of form, "hn", where n is a numeral
                    NAMESPACE = a dictionary that maps var names to values
   Example:
     heap = { "h0": {"x":7, "y":"h1", "z":"h2"},
              "h1": {"f":"nil", "g":5, "h":"h2"}, 
              "h2": {"x":12}  }
     heap_count = 3
        is an example heap, where handle "h0" names a namespace
        whose  x  field holds int 7, "y" field holds handle "h1"
        (a handle to the object at heaploc "1"), "z" holds handle "h2".
       "h1" labels a two-field namespace, and
       "h2" labels a namespace with a single field, "x"

   The above example heap was generated from this sample program:
        x = 7;
        y = new {f, g, h};
        y.g = 5;
        z = new {x};
        z.x = (y.g + x);
        y.h = z
"""
# Maintenance functions for the heap:

def clearTheHeap():
    """resets the heap to be empty"""
    global heap_count, heap
    heap_count = 0
    heap = {}


def allocateNS() :
    """allocates a new, empty namespace in the heap and returns its handle"""
    global heap_count
    newhandle = "h" + str(heap_count)
    heap[newhandle] = {}
    heap_count = heap_count + 1
    return newhandle


def lookup(lval) :
    """looks up the value of  lval  in the heap
       param: lval is a pair,  (handle,fieldname),

       returns: Say that  lval = (h,f).  The function extracts the
               object at  heap[h], indexes it with f,  and returns
               the value found there:   return  heap[h][f]
               For example, for the heap at the top of this program,
                 lookup(("h0","x")) returns 7
                 lookup(("h1","f")) returns "nil"
    """
    ans = "nil"
    handle,fieldname = lval
    if handle in heap  and  fieldname in heap[handle] :  # a legal heaploc
        ans = heap[handle][fieldname]
    else :
        crash("dereference error; lval =" + str(lval))
    return ans


def store(lval, rval) :
    """stores the  rval  into the heap at  lval
       params:  lval -- a pair,  (handle, fieldname), as described above
                rval -- an int or a handle

       Say that  lval = (h,f).  The function finds the object at  h
         and saves  rval  at index  f  within that object:  heap[h][f] = rval.
       For example, for the heap at the top of this program,
                 store(("h1","f"), 98)  would replace "nil"  by 98.
    """
    handle,fieldname = lval
    if handle in heap :  # a legal heaploc
        heap[handle][fieldname] = rval
    else : crash("store error; lval =" + str(lval))



### NAMESPACE (environment):  is the handle to the namespace in the heap
#                             that holds the program's global variables

ns = "nil"  # This will be initialized in the main function, interpretPTREE, as
            #    ns = allocateNS()       See below.


### INTERPRETER FUNCTIONS:

def interpretPTREE(tree) :
    """interprets a complete program tree
       pre: tree is a  PTREE ::= [ CLIST ]
       post: final values are deposited in memory and env
    """
    # initialize heap and ns:
    global ns   # asking for permission to assign to global var,  ns
    clearTheHeap()
    ns = allocateNS()

    interpretCLIST(tree)
    print "final namespace =", ns
    print "final heap =", heap
    raw_input("Press Enter key to terminate")


def interpretCLIST(clist) :
    """pre: clist  is a program represented as a   CLIST ::=  [ CTREE+ ]
                  where  CTREE+  means  one or more CTREEs
       post:  memory  holds all the updates commanded by program  p
    """
    for command in clist :
        interpretCTREE(command)


def interpretCTREE(c) :
    """pre: c  is a command represented as a CTREE:
       CTREE ::=  ["=", LTREE, ETREE]  |  ["if", ETREE, CLIST, CLIST2] 
        |  ["print", LTREE] 
       post:  heap  holds the updates commanded by  c
    """
    operator = c[0]
    if operator == "=" :   # , ["=", LTREE, ETREE]
        lval = interpretLTREE(c[1])
        rval = interpretETREE(c[2])
        store(lval, rval)
    elif operator == "print" :   # ["print", LTREE]
        v = dereference(interpretLTREE(c[1]))
        print  v 
    elif operator == "if" :   # ["if", ETREE, CLIST1, CLIST2]
        test = interpretETREE(c[1])
        if test != 0 :
            interpretCLIST(c[2])
        else :
            interpretCLIST(c[3])
    else :   # error
        crash("invalid command")


def interpretETREE(etree) :
    """interpretETREE computes the meaning of an expression operator tree.
         ETREE ::=  NUM  |  ["+", ETREE, ETREE] |  ["deref", LTREE] 
                 |  ["new", OB ]
         OB = [ ID* ]
        post: updates the heap as needed and returns the  etree's value,
              either an int, a handle, or "nil"
    """
    ans = "nil"
    if isinstance(etree, str) and etree.isdigit() :  # NUM
      ans = int(etree) 
    elif  etree[0] == "+" :      # ["+", ETREE, ETREE]
        ans1 = interpretETREE(etree[1])
        ans2 = interpretETREE(etree[2])
        if isinstance(ans1,int) and isinstance(ans2, int) :
            ans = ans1 + ans2
        else : crash("addition error --- nonint value used")
    elif  etree[0] == "deref" :
        lval = interpretLTREE(etree[1])
        ans = lookup(lval)
    elif  etree[0] == "new" :
        handle = allocateNS()
        fields = etree[1]
        for f in fields :
            store((handle,f), "nil")
        ans = handle  
    else :  crash("invalid expression form")
    return ans


def interpretLTREE(ltree) :
    """interpretLTREE computes the meaning of a lefthandside operator tree.
          LTREE ::=  ID  |  LTREE . ID
       Here are some example  ltree values:
         "y"                                  #  y
         ["dot", "y", "h"]                    #  y.h
         ["dot", ["dot", "y", "h"], "r"]      #  y.h.r

       Returns a pair, (handle, field), where  handle is the location
         in the heap where an object/namespace is stored,
         and  field is an ID that indexes into the object.

       Look again at the example heap at the very top of this program. 
       For that heap, here is what the function returns:
          "y"                #  returns ("h0","y")
          ["dot", "y", "f"]  #  returns ("h1", "f")
          [["dot", "y", "h"], "r"]  #  returns ("h2", "r")
    """
    if isinstance(ltree, str) :  #  just a single ID ?
        ans = (ns, ltree)        #  use  ns  as handle to field  ltree
    else :  # ltree has form,  ["dot", LTREE1, ID]
        lval = interpretLTREE(ltree[1]) 
        handle = lookup(lval)
        ans = (handle, ltree[2])
    return ans


def crash(message) :
    """pre: message is a string
       post: message is printed and interpreter stopped
    """
    print message + "! crash! ns=", ns, "heap=", heap
    raise Exception   # stops the interpreter

===================================================
Exercise: Here are three sample programs. Use the interpreter on each and study the final configurations:
  1. x = 2;
    y = new {f, g};
    y.f = x;
    y.g = new {h};
    print y.f
    
    interpretPTREE( [ ["=", "x", "2"],
            ["=", "y", ["new", ["f","g"]]],
            ["=", ["dot", "y","f"], ["deref", "x"]],
            ["=", ["dot", "y", "g"], ["new", ["h"]]],
            ["print", ["dot", "y", "f"]]
          ]
        )
    
  2. x = 7;
    y = new {f,x};
    y.f = x;
    y.x = {r};
    y.x.r = y.f
    
    interpretPTREE( [ ["=", "x", "7"],
            ["=", "y", ["new", ["f","x"]]],
            ["=", ["dot", "y", "f"], ["deref", "x"]],
            ["=", ["dot", "y", "x"], ["new", ["r"]]],
            ["=", ["dot", ["dot", "y", "x"], "r"], ["deref", ["dot", "y", "f"]]],
          ])
    
  3. y = new {f,g};
    x = y;
    y.h = 5;   # the object grows as needed!
    print y
    
    interpretPTREE( [ ["=", "y", ["new", ["f","g"]]],
            ["=", "x", ["deref", "y"]],
            ["=", ["dot", "y", "h"], "5"],
            ["print", "y"]
          ])
    

Exercise: Notice there is no trouble in reusing a global variable name as a field name:

x = 2;
y = new {x, y};
y.x = x;  y.y = y;  x = y.x
The indexing makes clear which variable should be consulted.
  1. Now, let's make the language look closer to an object language. Say that the fields in a new object can be initialized, like this: For example,
    x = 2; 
    y = new {a = x;  b = a}
    
    This says that y's field a is set to x's value, and field b is set to a's value.

    We will use this syntax for initialization commands:

    E ::=  . . .  |  new { F* }
           where  F*  means zero of more occurrences of  F,  separated by  ;
    F ::=  I = E
    
    Revise the parser and interpreter for these new constructions. How does your semantics operate on the example? (That is, does the semantics locate the value of x when it is initializing a? Does it locate the value of a when it is initializing b? Make certain that it does. Hint: in the interpreter, replace variable ns by a stack of handles to namespaces. Use the stack when computing the semantics of a LefthandSide.

  2. In most object-oriented languages, there are pronouns, this and super, which can be used inside an object to refer to a local field and a nonlocal field. For example,
    x = 2;  y = new {x = super.x;  y = this.x}
    
    states that y's field x is initialized with the value of global variable x and field y is initialized with the value of field x in this object being constructed.

    Revise the syntax of Lefthandside to read:

    L ::=  I  |  L . I  |  this  |  super
    
    and modify parser and interpreter to implement the new constructions.

  3. If you have solved the previous two problems, you are ready for this one: Modify the syntax of new{F} to be new{CL}, that is:
    E ::=  N  |  ( E1 + E2 )  |  L  |  new { CL }
    
    That is, when a new object is allocated, the commands, CL, execute. Here is an example:
    x = 7;
    y = { if x :
            a = x
          else
            b = 0 end;
          c = x;
          print c
        }
    
    This code sets x to 7 and sets y to {a:7, c:7}. You see that we are not far from modern objects, which contain a mix of declarations and executable constructor code.