Copyright © 2010 David Schmidt

Chapter 9:
Domain-Specific Languages


9.1 Domain-specific software architecture
9.2 Domain-specific language
9.3 Domain-specific programming languages as ``little languages''
    9.3.1 Top-down (``external'') DSPL
    9.3.2 Bottom-up (''internal'') DSPL
9.4 Designing a top-down DSPL
9.5 Developing a bottom-up DSPL
9.6 Hybrid DSPL
9.7 Implementing a DSPL
9.8 Further reading


It is unlikely that you will ever design a general-use language like Fortran, C++, ML, or Prolog, but if you become a professional software engineer or software architect, it is highly likely that you will specialize in some problem area, like telecommunications, aviation, banking, or gaming. You will become expert at building systems in your problem area, and you may well design a notation, a language, that helps you and others write solutions to their problems in this area. In this case, you are a designer of a domain-specific language that is used to build domain-specific software architectures.

This chapter introduces these concepts, applying the concepts already learned.


9.1 Domain-specific software architecture

Every large system is built from software and hardware components; the pattern of layout and connection of the components is called its architecture. A software architecture is the layout of software components. The software architecture is deployed (installed) on the hardware architecture.

Specific problem areas, e.g., flight-control or telecommunications or banking, use specific hardware architectures, and they also use specific software architectures. When a new model of airplane is designed, the hardware architecture (the airplane hardware, including its computers) is based on a hardware design that has succeeded in the past. (It is too great of a risk to start from scratch; it is also better to build on and refine what is known to work.) The software architecture for the plane will also be based on some standard layout that is known to work well.

Software architects use a collection of concepts, techniques, and patterns to build a new system in an established problem area; this collection is called a domain-specific software architecture:

  1. application domain: this is the problem area, and it is understood in terms of (i) fundamental concepts and terminology (words, phrases, and actions that the clients, designers, and builders use to discuss the problem); (ii) customer requirements (what must the system do); (iii) scenarios (examples of behaviors); (iv) configuration models (the high-level blueprints of the system and its operation --- entity-relationship (dependency) diagrams, data flow diagrams, deployment diagrams, etc.)

  2. reference requirements: These are the ``features'' or ``customizations'' or ``attributes'' or ``ordering options'' that the clients (customers/users) select to configure the desired system. (Think about all the choices you make when you order a brand new car from an auto dealer --- colors, engine options, accessories --- these are the reference requirements for the car you want.)

    Strictly speaking, the reference requirements form part of the terminology of the application domain, but they are often specially identified because they are often treated specially in the implementation methodology.

  3. reference architecture: the software and hardware architectures that will support the implementation, usually documented by blueprints (e.g., UML)

  4. supporting environment/infrastructure: hardware and software languages, libraries, frameworks, tools, and platforms for modelling, designing, implementating, and evaluating the system.

  5. a process or methodology for designing, implementing, and evaluating the system using the requirements, reference architecture, and environment.
We can't study all these topics here. But it is important to know that professionals use domain-specific software architecture to build complex, working systems. What we will study here is primarily the first item, namely the terminology/language one uses to discuss and solve problems within an application domain. This language is called a domain-specific language (DSL). The DSL helps us use, define, develop, and improve the reference architecture, the environment, and the methodology.


9.2 Domain-specific language

English is a general-purpose language. Legal English is a special-purpose language, dedicated to the writing of contracts and laws --- it is specific to the domain of contracts and laws. Algebra is a domain-specific language for stating numerical relationships.

A language that is designed for discussing problems, behaviors, and solutions within a problem domain is a domain-specific language (DSL). The language's vocabulary includes concepts and notation from the problem domain: the nouns, pronouns, adjectives, verbs, and adverbs of the language. The language lets participants (people and machines) discuss and implement solutions within the domain. Because its vocabulary is limited to the specific domain, a DSL is often useless for discussing and solving problems outside the domain.

A DSL uses concepts familiar to people who work in the domain. For example, say that you must install an alarm system in an office building, and you must discuss the setup with the building's owners and employees. A DSL for sensor-alarm networks would discuss

Here is a scenario, stated in the DSL:
``when a movement detector detects an intruder in a room, it generates a movement-event for a camera and sends a message to a guard....''
The DSL lets you talk about the behaviors of the alarm system so that you can extract, design, and even implement the system using the DSL's vocabulary.

Compare the lingo of sensor alarms to the lingo you write in Java --- in the latter, the ``nouns'' are numbers, arrays, objects, and variables that name numbers, arrays, objects, etc. The ``adjectives'' are data types and other declaration modifiers. The ``operations'' are arithmetic, data-structure indexing, method call, etc. ``Actions'' are commands, or groups of commands. ``Events'' can be GUI events or a call to a method to start execution. Java is a ``DSL'' for computation on numbers and arrays and objects.

Here is a second, less ambitious example --- a scenario expressed in ``recipe language.'' I found this example at http://weblog.jamisbuck.org/2006/4/20/writing-domain-specific-languages; it is a scenario for making a sandwich:

===================================================

title:
PBJ Sandwich

ingredients:
- two slices of bread
- one tablespoon of peanut butter
- one teaspoon of jam

instructions:
1. spread peanut butter on one side of one slice of bread until evenly distributed
2. spread desired amount of jam on top of peanut butter
3. place other slice of bread on top 

servings: 1
prep time: 2 minutes

===================================================
In the above scenario, there are domains, features, operations, and actions. (The ``event'' that triggers the above action is ``when hungry...''.) Notice how conditionals and repetitions are expressed implicitly (''spread desired amount'' and ``until evenly distributed'') so that sequencing is the only explicit control structure.

The example points out that ``recipe English'' is a DSL for communicating with a chef or a cooking machine.

Now, think again of a domain like sensor alarms, or network protocols, or music composition, or gaming, or cooking. What are the domains of interest, their elements, the features, operations, actions, and events? How many of these are directly implemented (that is, ``understood'') by a computer? How many must be ``refined'' to be computational (understood by a computer)?

A DSL lets stakeholders (the participants in a systems project) communicate their ideas (needs, suggestions, solutions, implementations, orders). The DSL is a is a modelling language that lets us discuss models, structures, and behaviors specialized to a problem domain like telecommunications, banking, transportation, gaming, algebra, typesetting, etc.

If the computer is a ``participant,'' that is, we can use the DSL to tell the computer what to do --- we can program the computer --- then the DSL is a domain-specific programming language (DSPL).

Other problem domains and their DSLs

Domain-specific languages are especially useful for describing reactive systems --- alarm systems, telecommunications systems, vending machines, multi-player games, and protocols --- hence the previous example and its classification of the language into events, actions, features, nouns, and operations.

But not all computational mechanisms are reactive. For example, the equational language of algebra is a DSL, and the computation underlying its equation sets are simplification laws.

Yet another variation is a domain related to constraint solving, such as crossword puzzles or Sudoku or spreadsheets or database queries, where the domain language is a set of clues or constraints that must be computed to solve (fill in) a puzzle.

In these cases, the appropriate DSL might not be ``event-action oriented'' but in any case, it will certainly remain the appropriate language that the stakeholders use to discuss their problems and the solutions.

General-purpose languages

Why are languages like C, Java, and Prolog called ``general purpose'' languages? After all, each such language is specific to data domains like numbers, strings, tables, structs, objects, relations, and so on.

One might argue that a general-purpose computer language is ``domain un-specific'' because it favors no one application domain very much over another. As a result, a user of a general-purpose language must become an expert modeller of real-life application domains in the domains of the general-purpose language.

When the complexities in domain modelling become too great, the general-purpose language must be abandoned for a domain-specific one.


9.3 Domain-specific programming languages as ``little languages''

One uses a domain-specific programming language to tell a computer how to solve a problem in a domain. The examples in the previous section suggest that one might define a DSPL by extracting from a DSL its computational part. This treats the computer as but one stakeholder in the community of participants.

But there is another origin of DSPLs that comes from totally within the programming world: It is inconvenient to drag out a general-purpose language to code a solution to something small and simple. (For example, do you code a Java program each time you have to do some calculator arithmetic? No --- you use a calculator language instead.) Indeed, it is better to use a smaller, simpler language --- a DSPL --- that is customized to solving exactly the problem you face.

For this reason, programmers sometimes call DSPLs ``little languages'' (e.g., ``here is a little language for drawing figures''; ``here is a little language for linking files''.) Here is a short list of ''little language'' DSPLs that have/had wide use:

  1. Make --- for linking files
  2. Matlab (and Mathematica) --- for doing linear algebra
  3. SQL --- for doing lookup queries and updates in databases
  4. VHDL, Verilog, and VHSIC -- for laying out hardware circuits
  5. Yacc, Bison, Antlr --- for programming a parser
  6. Excel --- for laying out and computing a spreadsheet
  7. HTML, CSS, PHP --- for generating web-browser documents
  8. groff and LaTex --- for typesetting documents
  9. eqn --- for typesetting math formulas
Admittedly, the current versions of many of these ``little languages'' aren't so little anymore, but almost all them came about because someone thought,
''It would be nice to have a little language to help me do ...this little job....''
So, that person designed a little language to do the little job.

In terms of domain-specific software architecture, someone might ask you,

''It would be nice to have a little language to help ...somebody... do ...some little job in this domain.... Can you put together something for us?''

For example, ''It would be nice to have a little language to help us lay out the wiring and sensors for a building's alarm system.''.

Or, ''It would be nice to have a little language to help us write the protocols for how the movement detectors send/receive messages to/from the other devices and people in the network.''

This kind of wishful thinking can lead to a domain-specific programming language, in particular, a top-down domain-specific programming language.


9.3.1 Top-down (``external'') DSPL

Each of the languages in the list in the previous section does one thing well in one application domain; by no means should any of these languages be used for general-purpose computing. All the examples in the list are called top-down (or ``external'') DSPLs because they are designed as stand-alone languages that implement domain concepts and nothing more. Since a top-down DSL is a ``little language,'' it should be easier to learn and use than a general-purpose language. (If it isn't, then it is a failure!) In many cases, less-experienced and maybe even non-programmers should be able to use a top-down DSPL to write solutions.

Excel is a good example --- it has a nice mix of graphical and textual notations that falls within the grasp of a user who has rudimentary math and problem-solving skills. The user can lay out a spreadsheet that computes totals of rows and columns. (If you have never used Excel or a spreadsheet tool, you can read a tutorial here.)

Another good example is Yacc --- a user writes the BNF rules of a language, and this gives the information the Yacc compiler uses to build a parser matching the BNF rules. (Here is a tutorial.) The programmer can attach semantic-processing components to the generated parser by pairing the components with the BNF rules.

Another good example is SQL --- without knowing the internal layout of a data base, a user can write a query in terms of an implicit logic of sets and set operations, and the SQL interpreter executes the query as if it were a data-structures lookup algorithm. (There is a demo and tutorial here.)

HTML lets a user format a web document in terms of paragraphs, lists, and fonts and hides the details of spacing, line breaks, and painting text and pictures. (Use the View/PageSource menu option on your web browser to see this chapter's HTML coding. Here is a tutorial.) CSS is another little language, used with HTML, to set default layouts, fonts, and colors for an HTML file.

There is one critical standard for the success of a top-down DSPL:

Programmers must ``see'' their domain and the actions within it in the DSPL.
That is, the top-down DSPL lets the programmer think and talk directly in the problem domain; there are no distracting complications. (For example, a typical Excel user sees (figuratively and literally) a spreadsheet and does not want the complication of writing for-loops to columns of numbers in the spreadsheet!)

Upon first hearing, it sounds like top-down DSPLs are wonderful --- a language for just my problem that lets me say exactly what I want! --- but in reality, a top-down DSPL is a ``mixed bag'' of assets and drawbacks:

For these reasons, top-down DSPLs are not always the best tool for solving domain-specific problems.

Script versus editor

Computer programmers treat language as script that must be typed. But all programmers use development tools, such as text editors, IDEs, and debuggers; indeed, many software engineers don't write programs as script --- they interact with an IDE, choosing menu options and completing templates until the IDE announces that a program is completed.

Users of top-down DSPLs are even more ``IDE-dependent'' than programmers. For example, an Excel user will interact with Excel's GUI to insert data into cells of a spreadsheet and write equations that are embedded into the spreadsheet's ``logic'' so that the row-and-column totals are correctly computed and displayed. Exactly where is the ``program''?

In a note at http://martinfowler.com/articles/languageWorkbench.html Martin Fowler coins the term, ``projecting editor'' for the GUI portion of a DSPL's IDE:


Here, the editor keeps an abstract representation of the program that is filled in bit by bit, not necessarily sequentially, not always as script. The program's abstract representation might be a parse-tree-plus-symbol-table or some other data structure that stores the program's semantic intent. The tool must have a back end that can interpret the abstract program or can generate a script that can be interpreted. (The ``storage representation'' in the diagram is some file format that archives the abstract representation at the end of the IDE session.)

If you are an Emacs or a vi or an Eclipse or a Visual Studio user, you are using a DSPL for document generation, presented in the format of a GUI (even a command-window GUI). The most extreme view is that any user interface for any application is a DSPL.

If you are developing a top-down DSPL for non-programmer users, then you are almost certainly forced to develop an IDE to go with it.


9.3.2 Bottom-up (''internal'') DSPL

There is another variant of DSPL, one that is used by an experienced programmer who wants to ``extend'' a general-purpose language with concepts specific to the problem domain. In this situation, the DSPL is programmed in the general-purpose, host language as a library of data structures and operations.

This is called a bottom-up (or ``internal'') DSPL.

Background: GUI-building frameworks

Frameworks are ``not-quite bottom-up DSPLs.'' (We will explain the remark later.) For example, consider these libraries for GUI-building: Each library is married to a host language, because a GUI by itself is useless --- the GUI must be connected to components that do something.

These libraries are called frameworks, because each has their own collection of components that implement nouns (''window,'' ''frame,'' ''button,'' ''layout,'' ...) and verbs (''setTitle,'' ''getText,'' ''paint,'' ...) of the GUI domain. They usually come with sample programs that suggest patterns for assembling the components. But they are implemented in their host languages, and a programmer must write (lots of) code in the host language assemble a working GUI from the GUI framework.

A GUI framework is an almost-DSPL for GUIs, because it is a library of that implements GUI-domain data structures that a programmer assembles to make a GUI --- there is no ``programming language'' for GUI building, only components for assembly, where assembly is written in host-language code.

A GUI framework is often ``married'' to its host language by means of a visual editor; Visual Basic and Visual C++ are standard examples. The visual editor tries to fill the gap between framework and DSPL by giving the user some guidance in GUI buiding.

A bottom-up DSPL evolves over time

Experienced programmers naturally become bottom-up DSPL designers, because over time they assemble a library of custom data structures, control structures (templates/macros), and linking code that they use over and over again to solve problems in the same domain. Eventually, the programs they write consist mostly of the components of their library, their control-structure macros, and pre-written linking code and less and less of new code. Finally, the programmer reaches the point where the host programming language acts merely as minimal ``glue code'' for connecting the library's components, control structures, and link code.

At this point, the host language plus the library is a bottom-up DSPL, because the library has become ``more important'' to problem solving than the host language itself. What has happened is this:

The programmer has extended the host language ``upwards'' towards the problems to be solved.
This makes the host(glue)-language-plus-its-library a bottom-up DSPL.

The custom-written library for the problem area is written in the host language, and it is oriented towards encoding ``domain-concepts-as-code'' (nouns as data structures, verbs as operations/control-structure templates) so that the scenarios discussed in the domain's DSL can be readily converted into programs in the bottom-up DSPL. Experienced programmers have good instincts for coding domain concepts as code and saving them as libraries. It is almost a matter of survival --- there is never enough time to build a new solution completely from scratch!

Many of the design ideas from object-oriented design can be applied in the development of a bottom-up DSPL, where classes and methods are used as nouns and verbs and adjectives. Languages like Scheme (via lambda abstraction and hygienic macros) and Smalltalk and Ruby (via blocks and macros) let a programmer easily define custom control structures and templates directly in source-code syntax. But any general-purpose language can serve as a host language.

A bottom-up DSPL has its strengths and weaknesses also:


9.4 Designing a top-down DSPL

Here are some rough hints to get you started.

First, become an expert in the problem domain: learn the vocabulary --- nouns, verbs, and adjectives. Develop many scenarios (case studies) within the domain. Extract from the scenarios patterns or schemes of structure, behavior, computation.

You can approach top-down DSPL design as a little exercise in language design and develop the language in stages:

  1. data structures: What are the elements/data/actors (the ``nouns'' of the DSL) that participate in the computations? What are the operations (the ''verbs'') that transform the elements? How are elements collected/structured together?

  2. control structures: What patterns of operations and behaviors (the ``adverbs'' and ``adjectives'' of the DSL) commonly appear?

  3. component structures: How are partial solutions (the ``paragraphs'' of the DSL), which reference global names and other solution parts, connected/linked into a complete solution?
There is one last, critical question to ask:
Who are the intended users of the DSPL?
This is critical: the concepts expressed within the DSL must be natural and comprehensible to the language's users. The language must be friendly towards these persons' views of the domain. If the top-down DSPL is for non-expert programmers (e.g., like Excel or HTML), then you must de-emphasize assignments, imperative control, and data-structures and use more definitional concepts, like equations, arithmetic-style functions, and Prolog-style predicates, which appear more often in other areas of science and technology.

Most non-experts have difficulty with control structures of any form --- sequencing is about the most they can handle. Repetition is often a challenge.

Data structures must be kept simple, resembling real-life, physical structures (a sheet of graph paper, a chest of drawers, a filing cabinet, a dictionary) or resembling the structures that are fundamental to the problem domain (hallways, buildings, wiring bundles...).

Keep this directive in mind, always:

The programmers must ``see'' their domain and the actions within it in the DSPL!

If the DSPL's users are forced to code in notation and concepts that lie outside their problem domain, the users will get lost. (That's why non-programmers don't use Java as a DSPL for spreadsheet building!) In a serious development effort, you will design an IDE-like tool as well.


9.5 Developing a bottom-up DSPL

Some of the following was already stated but bears repeating:

Experienced programmers are the natural users of a bottom-up DSPL because they design it themselves, over time, as a library of custom data structures, control structures/templates, and link code. Eventually the host programming language acts merely as ``glue'' for connecting the components selected from the library: The programmer has extended the host language ``upwards'' towards the problems to be solved.

If you work in the same problem domain on a regular basis, you will do well to consciously organize your work into a library that evolves into a bottom-up DSPL. To plan for this, you can develop code that fits into a standard format for any programming language:

  1. data structures: You will write data structures and operations that model the nouns and verbs of the problem domain. The structures and operations are often coded as classes (or modules). You want a good match between each concept in the domain and a piece of code, so that you can convert somewhat mechanically from concepts to software. Your data-structures library is a framework for the problem domain.

  2. control structures: As you build more and more systems with your framework and the host language, note the patterns of coding you do --- what patterns of control (loops, patterns of function call, recursion, inductively defined processing of structures, iterators) appear over and over again. These patterns should be extracted as templates that you code as separate macros or higher-order functions. In future system building, you use your macro/higher-order function instead of recoding the same control pattern from scratch.

  3. component structures: No one builds an nontrivial program in one sitting, as one file. There are always pieces, or partial solutions, and each piece refers to ``global variables'' or other partial solutions not yet defined. Eventually, these pieces must be assembled or linked into a complete program. It is useful to have some linking or ``weaving'' program, that connects together partial solutions into a final program. Many host languages have primitive devices for global-variable dereference and assembly (e.g., module/package import or interface declaration) but for a specific problem domain, you can often do much better than this by writing your own linker/weaver program that assembles files in a standard way. Do it.

Once you start your library of data structures and control structures, force yourself to use the library (and improve it!) as much as possible, instead of writing from scratch something similar. You should program by selecting code from your library and ``gluing'' it together with minimal code from the underlying host language.

Your ultimate goal is to make the library ``stand-alone,'' where you write programs just with your library and with almost zero new glue code from the host language. This means you will use the underlying host language only as a ``trap door'' to ``escape'' from the problem-domain area to execute code from some other library or application.

Implicit in the previous paragraphs are the notions of framework and product line from mainstream Software Engineering: A framework is a data-structures library that often comes with sample program skeletons, written in the host language, that one studies to learn to use the framework. The programmer can modify an appropriate skeleton and fill in gaps with a mix of framework code and custom-written code. GUI libraries, server libraries, and protocol libraries are almost always organized as frameworks. You can find some Python-coded frameworks for mail servers and networking at http://effbot.org/librarybook/.

A product line is a family of programs based on one, fixed program skeleton, differing only in minor customizations. Consider a product line of cars all based on the same engine-chassis assembly. Now, consider the control software for the varieties of engine that can be installed in the car. The software is almost the same for all the engine variations, and each engine controller is generated from a standard parameterized program that is instantiated by data for the intended engine. Another example is Notepad/Wordpad/Word, which are based on the same word-processor structure but have different degrees of customizations for font choices, formatting, and file formats.

A product line of software is built from a library when the one and the same program skeleton is used for all software products, and the gaps in the skeleton are filled by other library components. This is a kind of bottom-up DSPL, where the selections of data structures, control structures, and components structures are tightly restricted.


9.6 Hybrid DSPL

Most DSPLs use a mix of top-down and bottom-up concepts.

Mostly top-down

We might say that a DSPL is ``mostly top-down'' if it is designed top-down to express scenarios-in-code and has its own parser (or IDE editor) and interpreter/translator. But perhaps the DSPL also lets you call library components and execute code written in an implementation language.

To do this, we must add a ``trap door'' to the DSPL so that the execution of the DSPL program can be paused and the implementation-language code can be executed instead. Many scripting languages provide such a trap door, in the guise of an eval operation, which takes as its argument a string that holds executable code --- eval runs the code. Here are three useful forms of trap door in Python:

  1. A eval-like function that executes a string as Python script: For example, exec("y = 2 + x; print y"). executes the program y = 2 + x; print y, using the variables that are visible at the position where exec appears. If one does not want the executed string to affect existing variables, one can invent a namespace exclusively for the string's use, like this: exec("y = 2 + x; print y") in {'x':0}.

    Here is a program that builds a string and runs it:

    x = 2;  y = 3;  z = 5
    invar = raw_input("Type name of variable (x, y, or z) to zero out: ")
    if invar in ("x", "y", "z") :
        code = invar + " = 0"
    else :
        code = "pass"
    exec(code)
    
    The exec command can also read and execute the contents of an opened text file:
    handleToCodefile = open("MyPythonProgram.py", "r")  # open a readable file
    exec(handleToCodefile)  # execute its contents 
    

  2. Python's os package contains procedures for querying the operating system and performing simple OS commands, e.g.,
    import os
    
    cwd = os.getcwd()     # get current working directory
    if os.path.basename(cwd) == "MyPictures": # is the lowest-level dir "MyPictures" ?
         # then, move up one level to parent directory:
         os.chdir(os.pardir)
    print "Current path is ", os.getcwd()
    

  3. We can pause execution and execute any external program we wish:
    # run an external program from within Python code:
    import subprocess
    # general format:   subprocess.call(["program-name", "param1", "param2", ...])
    subprocess.call(["C:/Python26/Python.exe", "MyPythonPgm.py"])
    
A mostly-top-down DSPL will include some form of trap door so that bottom-up defined components written in the implementation language can be executed.

Mostly bottom-up

A DSPL is ``mostly bottom-up'' if it consists of a library of host-language-coded components that model the problem domain. But perhaps the language is supposed to have some custom control- or linking-structures that express key domain concepts and look like they are built into the host language. (We want to avoid writing ugly dot-notation, like packageName.objectName.methodName(arg1, arg2, ...), each time we use a custom-coded domain concept.) This way, the programmer pretends that the domain concepts are built into the host language.

A good host language will give you a technique to add custom control or linking structures and/or abbreviations for ugly dot-notation. Here is a simple example:

Say that your problem domain has lots of solutions that use the phrase, ``repeat ACTION until CONDITION holds''. This is a control structure that should be added to the DSPL library. Some languages let you define higher-order functions (functions that take code/closures as parameters) in mix-fix keyword notation, like this:

def repeat(action)until(condition)holds :
    """executes the command,  action,  until expression,  condition,  is true"""
    action()         # do the action step
    if condition():  # finished ?
        return
    else:            # do it again:
        repeat(action)until(condition)holds 
This defines a function named, repeat..action..until. The function is used in a program like this:
...
repeat([x = x - 1])until([x == 0])end
...
The brackets, [..], are quoting the code .., that is, constructing a closure holding the code. Functional languages, like Scheme and Haskell, support this approach, as do Ruby and Smalltalk to a lesser degree.

For older programming languages, the traditional way to add custom control structures is with a macro processor (``preprocessor''). A macro processor is a program that reads as input a program in the host language that has the custom structures mixed into the code. The macro processor locates the occurrences of the custom structures and replaces them with the instructions in the host language that perform the intended operations.

C's preprocessor is a standard but not-too-exciting example. A segment of C code like this,

#define PI 3.14159
#define Double(x)  (x + x)
// now,  PI  and  Double  act like they are built-in C functions:
y = Double(PI * 5) ;
defines two macros, PI and double, which look like functions and can be called like functions. When the above code is input to C's preprocessor, this text is the output:
y = (3.14159 * 5 + 3.14159 * 5) ;
The macro definitions are removed, and the calls are replaced by C-text, giving a program in pure C.

When a macro is called, its arguments are text and not computed values! At a macro call, the text argument is bound to the parameter and the text is inserted for occurrences of the parameter in the macro body. The text computed by the macro's body is copied back in place ofthe macro call. In the example, y = Double(PI * 5) is rewritten to y = (PI * 5 + PI * 5), which is rewritten to y = (3.14159 * 5 + PI * 5), which is rewritten to y = (3.14159 * 5 + 3.14159 * 5). The example shows why the macro processor must be a separate program, run first, before the parser, interpreter or translator. There is a preprocessor, called GPP, that can be used stand-alone to process any program that contains C-like macros. Like C's preprocessor, GPP requires that a macro call look like a function call, of the form, MACRONAME(ARG1, ... ARGn). The m4 macroprocessor lets its user write macro definitions whose calls look somewhat like the mix-fix notation seen in the previous repeat..until..holds example.

In a future section, we will see how to use a language's regular-expression library to code a simple but useful macro processor.

Here are some references for existing macro processors:

Ruby supports a ``block'' construction (the [..] syntax) that makes it possible to code simple customized control structures directly in Ruby. There are some Ruby-implementation approaches at http://weblog.jamisbuck.org/2006/4/20/writing-domain-specific-languages


9.7 Implementing a DSPL

Perhaps this is obvious, but the first question to ask is: What language is understood by the hardware platform that you use in the problem domain? If the hardware language lets you code a parser and interpreter, then you can readily implement a top-down DSPL on the hardware. (Note: a hardware platform might ``understand'' several languages, if there already exist quality interpreters or compilers for the languages on the hardware.)

If the hardware language is not expressive enough, or it is limited in space and speed, you must protoype the top-down DSPL interpreter in a different language and then convert the interpreter into a compiler that translates into the hardware language. Do this as a last resort, since compiler development and maintenance are expensive tasks.

In the case of a bottom-up DSPL, you should select a host language that either (i) is directly understood by the hardware or (ii) has an efficient compiler from the host language to the hardware language. In all cases, the chosen host language must support components and libraries, so that you can extend the host language bottom up.

Although it is almost never done, it is an excellent project to implement a DSPL one way and then use the acquired knowledge to implement it the ``inverse way.'' That is, if you designed a DSPL top-down, try to extract from its interpreter the parts that become components for a bottom-up implementation. Dually, if you first built a bottom-up DSPL, then next use the components as the ``logic'' within a top-down, interpreter implementation. The second version of the language might be the one that you prefer!

Top-down DSPLs and trap doors

If you have designed a top-down DSPL, you should add a ``trap door'' so that code in the implementation language can be embedded in the programs you write. The simplest way to do this is to use an implementation language that has an eval/exec operation.

Here is a small example. Perhaps you have designed a game-app for a cell phone, where a child can tell birds to eat bugs. The game has a GUI front-end, but the mouse moves and clicks on the GUI generate code in this syntax:

===================================================

CL : CommandList       A : Atom
C : Command            S : String

CL ::=  C |  C . CL
C  ::=  A1 eats A2 | do S
A  ::=  bird  |  bug
S  is a quoted string

===================================================
An example program that the GUI might generate is
bird eats bug.
bird eats bug.
bug eats bird
The game has limited functionality (haha), but notice the do command, which is a trap door that lets a programmer insert Python code that directly manipulates the language's interpreter, say, like this:
bird eats bug.
do "census['cat'] = 1\ncensus['bird'] = 0\nprint 'uh oh!'"
The string holds Python code:
census['cat'] = 1
census['bird'] = 0
print 'uh oh!'
Here is the interpreter for the bird-cage language:

===================================================

"""Interpreter for mini top-down DSL for bird-cage domain of birds and bugs.
   Includes trap-door operation,  do S,   for embedding Python source code.  
   Source language syntax to be parsed:
     CL : CommandList       A : Atom
      C : Command           S : String 
           CL ::=  C |  C . CL
           C  ::=  A1 eats A2 | do S
           A  ::=  bird  |  bug
           S  is a quoted string

   Operator-tree structures resulting from the parser:
      CLIST ::=  [ C* ]
      CTREE ::=  ["eat", A1, A2 ]  |  ["do", S]
      A     ::=  "bird"  |  "bug"  
      S     ::=  a quoted string
"""
# Global variable: remembers count of entities in bird cage:
census = {"bird": 9,  "bug": 99}

def interpretCLIST(p) :
    """interprets CLIST  p"""
    for command in p :
        interpretCTREE(command)

def interpretCTREE(c) :
    """interprets CTREE  c"""
    operator = c[0]
    if operator == "eat" :
        eater = c[1] 
        lunch = c[2]
        if census[eater] > 0 and census[lunch] > 0 :
            census[lunch] = census[lunch] - 1
    elif operator == "do" :  # trap-door ``eval'' operation ---
        exec(c[1])    # executes  c[1]  as python code.  Can affect  census,
                      # add new global variables to interpreter's namespace,
                      # print trace information, etc.
    else :  
        crash("invalid command")

def crash(message) :
    print message + "! crash! core dump: ", census
    raise Exception  

def main(program) :
    """interprets the operator tree,  program"""
    interpretCLIST(program)
    print "final census =", census

===================================================
Here are some sample uses of the interpreter:
python -i top.py
>>> main([["eat", "bird", "bug"]])
final census = {'bird': 9, 'bug': 98}

>>> main([["eat", "bird", "bug"], ["do", "census['cat'] = 1\ncensus['bird'] = 0\nprint 'uh oh!'"]])
uh oh!
final census = {'bird': 0, 'bug': 97, 'cat': 1}
The do command lets a programmer escape from the limited functionality of the DSPL and use the operations of the implementation language.

Bottom-up DSPLs and macro expansion

If you have developed a bottom-up DSPL, you should also define control-structure patterns and linking patterns for your DSPL library. It is always best to use host-language facilities to do this.

Some host languages (e.g., Scheme and C) come with their own macro processors. Others (e.g., Smalltalk and Ruby) have flexible procedure-call syntax for defining new patterns. Others (e.g., Perl, PHP, Python, Ruby) supply regular-expression libraries that have powerful pattern-matching operations that you can use to write your own macro processor.

Here is an example of using regular-expression string matching in Python. We use Python's regular-expression module, re, to define a pattern, match the pattern in a string, and replace it. The comments in the code explain how this operates:

===================================================

import re    # re  is the module of regular-expression operations

# Here is a pattern that matches strings of form,  
#    @DOUBLE alpha END
# where  alpha  is some substring that holds no occurrences of  @ :
#     "(\\s*)@DOUBLE\\b([^@]*?)\\bEND\\b"
# where 
#     \\s  means a whitespace character
#     \\b  means a word boundary
#     E*   means match  E  zero or more times as much as possible for success
#     E*?  means match  E  zero or more times as little as possible for success
#     [^c] means match any character that is NOT character  c
# The parens mark _groups_ that are used below.

# p  is a string-matching object compiled from the pattern string:
p = re.compile("(\\s*)@DOUBLE\\b([^@]*?)\\bEND\\b")

# try this multi-line example:
source = """
x = 0
x =  @DOUBLE x END
print x
"""
print "source text ="
print source

# search for compiled pattern  p  in  source:
m = p.search(source)
print

# if the match succeeds,  m  is an object; else  m = None
print "match result =", m
# m  holds a list of substrings that matched parenthesized groups in the pattern:
print "matched groups =", m.groups()

# m  also holds the start and end indexes of the matched string:
print "span of matched text =", m.span()
# the start and end indexes can be referenced individually, too:
print "matched text =", source[m.start() : m.end()]
print

# let's replace the matched string by something else:
matches = m.groups()
source = source[:m.start()]  \
       + matches[0] + "(2 * " + matches[1] + ")" \
       + source[m.end():]
print "updated text ="
print source

# We have completed a simple macro-expansion of  "!DOUBLE alpha END"
# into  "(2 * alpha )", preserving any leading spacing

===================================================
Here is the output from the above script:
source text =

x = 0
x =  @DOUBLE x END
print x


match result = <_sre.SRE_Match object at 0x7ff3d6e0>
matched groups = ('  ', ' x ')
span of matched text = (10, 25)
matched text =   @DOUBLE x END

updated text =

x = 0
x =  (2 *  x )
print x
The example shows that patterns can be complex. There is a tutorial on writing patterns at http://docs.python.org/howto/regex.html and there is a mostly complete listing of pattern options at http://docs.python.org/library/re.html.

We now use the ideas in the example to write a macroprocessor in Python that searches for macro-call patterns and replaces them with expansions. Here are the two macro calls the processor will perform:

@REPEAT Code FOR Expr TIMES  ===>  newvar = Expr
                                   while newvar > 0 :
                                       Code
                                       newvar = newvar - 1

@DOUBLE Expr END  ===>  ((Expr) * 2) 
Each macro call on the left is coded as a pattern string, and each translation is done by a Python-coded handler function. The macro processor's main data structure is a list of (compiled-pattern, handler-function) pairs.

Here is the macro processor:

===================================================

"""Simplistic macroprocessor based on regular expressions.

   main data structure:
      macrotable : list of (COMPILED_PATTERN, HANDLER) pairs

   Example:
   macrotable = [ (re.compile("(\\s*)@REPEAT\\b(\\s*)([^@]*?)\\bFOR\\b([^@]*?)\\bTIMES\\b")
                   translateREPEAT),
                  (re.compile("@DOUBLE\\b([^@]*?)\\bEND\\b"), translateDOUBLE) ]

      holds these two macro definitions:
         indent1 @REPEAT indent2 alpha FOR beta TIMES  
                        =>  translateREPEAT(indent1,indent2,alpha,beta)
         @DOUBLE alpha END  =>  translateDOUBLE(alpha,)

   Compiled patterns, as written above, match macro-call symbol, @,
   followed by keywords (which are required to be separate words by  \\b )
   such that included text arguments do not include any call symbols, @.  

   Note that  E*?  denotes the minimal match of  E*  such that
   the overall pattern match succeeds.  Thus, the macro processor computes
   inside-out processing of macro calls so that nested calls are never confused.

   The pattern for @REPEAT  also records the amount of indentations via  (\\s*).

   Macro-processor algorithm:
   read  source
   repeat until no more macro matches:
       search  source  for each compiled pattern in  macrotable
       if successful match,
          then call accompanying  handler  function,
            which assembles appropriate translation
            insert translation in place of matched pattern in  source
   write source
"""
### This portion should be a separate module that holds the translation
### functions.  It is embedded here for simplicity.

#GENSYM function:
var_count = 0   # count of new names generated for expanded macros
def genNewVar() :
    """genNewVar is a gensym function, generating unique new names
       returns: a string of form, "_varN", where N is a unique nonneg int
    """
    global var_count
    newvar = "_var" + str(var_count)
    var_count = var_count + 1
    return newvar


def translateREPEAT(args) :
    """translateREPEAT  expands this macro call:
       indent1 @REPEAT 
               indent2 Code
               FOR Expr TIMES  =into=>  indent1 newvar = Expr
                                        indent1 while newvar > 0 :
                                                indent2 Code
                                                indent2 newvar = newvar - 1

       where indent1 = args[0]  and  indent2 = args[1]
             Code = args[2]     and  Expr = args[3]
       (indent1  and  indent2  are leading white-space)
       returns: ans, a string holding the macro-expanded call
    """
    indent1 = args[0]
    indent2 = args[1]
    bodycode = args[2]
    exprcode = args[3]
    # the call to REPEAT is replaced by this python code, as documented above:
    newvar = genNewVar()
    ans = indent1 + newvar + " = " + exprcode  \
          + indent1 +  "while " + newvar + " > 0:"  \
          + indent2 + bodycode  \
          + indent2 + newvar + " = " + newvar + " - 1"
    return ans


def translateDOUBLE(arg) :
    """translates   @DOUBLE(arg,) =into=>  '((arg) * 2)' """
    ans = "((" + arg[0] + ") * 2)"
    return ans

### END OF HANDLER-FUNCTION MODULE

### MACRO PROCESSOR CONTROL ALGORITHM:

import re   # import regular-expression module
# initialize macrotable:
macrotable = [ (re.compile("(\\s*)@REPEAT\\b(\\s*)([^@]*?)\\bFOR\\b([^@]*?)\\bTIMES\\b"),
                translateREPEAT), 
               (re.compile("@DOUBLE\\b([^@]*?)\\bEND\\b"), translateDOUBLE)
             ]

# read source:
import sys
if len(sys.argv) < 2 : 
    inputfilename = raw_input("Type input file to copy: ")
else :
    inputfilename = sys.argv[1]
input = open(inputfilename, "r")
source = input.read()
input.close()  

# replace all macro calls:
still_matching = True
while still_matching :
    still_matching = False
    for (pattern, handler) in macrotable :
        match = pattern.search(source)
        if match :  # != None
            replacement = handler(match.groups())
            source = source[:match.start()] + replacement + source[match.end():]
            still_matching = True

# write source:
index = inputfilename.find(".py")
outputfilename = inputfilename[:index] + "out" + ".py"
output = open(outputfilename, "w")
output.write(source)
output.close()

print
print "Contents of " + outputfilename + ":"
print source

===================================================
Say we have this file, test.py, whose contents are:
x = 0
@REPEAT
    x = @DOUBLE x + 1 END
    @REPEAT
        pass
    FOR 2 TIMES
FOR 3 TIMES
print x
When we use the macroprocessor to rewrite the file (python macrop.py test.py), we get this report:
Contents of testout.py:
x = 0
_var1 =  3 
while _var1 > 0:
    x = (( x + 1 ) * 2)
    _var0 =  2 
    while _var0 > 0:
        pass
    
        _var0 = _var0 - 1

    _var1 = _var1 - 1
print x
All macro calls are expanded.


9.8 Further reading

It is tough finding good tutorial material about DSL development. Here are two (dated) background papers.

When and how to develop domain-specific languages by M. Mernik, J. Heering, A.M. Sloane, CWI, Amsterdam, 2005.

Domain-Specific Languages: An Annotated Bibliography by A. van Deursen, P. Klint, and J. Visser, 2000.

There is also a text that might make some sense to you at this point: Domain-Specific Languages by Martin Fowler.