M.G.J. van den Branda,b,
P. Klinta,b,
C. Verhoefa
aUniversity of Amsterdam,
Programming Research Group
Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
bCentre for Mathematics and Computer Science
P.O. Box 4079, NL-1009 AB Amsterdam, The Netherlands
markvdb@cwi.nl, paulk@cwi.nl, x@wins.uva.nl
Term rewriting has a large potential for industrial applications, but these applications are always larger than one could ever dream of: huge sets of rewrite rules and gigantic terms to rewrite pose interesting challenges for implementors and theoreticians alike. We give a brief overview of the generation of term-rewriting-based tools as done in the ASF+SDF Meta-Environment and then we sketch two major applications of term rewriting: transformation of legacy COBOL systems and compilation of ASF+SDF to C. Based on these experiences we suggest the study of topics that could further advance the use of term rewriting in industrial applications: persistent term databases, generalized LR parsing versus parallel term rewriting, and coordination languages versus strategy languages. It will turn out that we have an ``alien'' view on research in term rewriting: properties like confluence and termination are of very limited use when selling term rewriting to industry.
Term rewriting is omnipresent in computer science. It has been used
to formalize the notion of computability, to study computational
procedures, to implement functional languages, to analyze and
implement abstract data types, to decide word problems, to proof
theorems, and so on [41]. In the early eighties an ESPRIT
project was started to support the generation of interactive
programming environments from language specifications using term
rewriting as one of the basic implementation paradigms [30].
This research has resulted in the design of the ASF+SDF formalism and
its support environment, the ASF+SDF Meta-Environment. ASF stands for Algebraic
Specification Formalism [6]. It is an implementation of
conditional term rewriting [37,8] possibly containing
negative premises [38,46,47,27]. It also uses a form of
prioritized rewriting [3,45]: there are default rewrite
rules that apply if other rewrite rules do not match. SDF stands for
Syntax Definition Formalism [29]. In fact, SDF is the
part to conveniently define the signature
of a term rewriting
system and in ASF this signature is used to specify the rewrite
rules R. Terms are many-sorted and varyadic constructor functions
are available for defining lists.
The formalism ASF+SDF and the ASF+SDF Meta-Environment have been described in a variety of papers and books [5,39,22]. In the context of this paper, the most significant properties of the ASF+SDF technology are:
The ASF+SDF view on the world is grammar-centric and can be characterized by the slogan
Every expression has a grammar, so it can be represented as a term .
All languages are treated equal, be it the language of truth values or the programming language COBOL. Simple Boolean expressions and COBOL programs are all treated as terms. The major surprise is that this paradigm scales up so well to large industrial applications, like:
An overview of industrial applications of the ASF+SDF Meta-Environment can be found in [9]. Other large, non-commercial, applications of the ASF+SDF technology are:
In Section 2 we briefly discuss issues that have to be taken into account when cooperating with IT industry. Next, we sketch the global approach we use for the automatic generation of tools and applications (Section 3). Then, we highlight two major applications: the construction of analysis and transformation tools for COBOL (Section 4) and the ASF+SDF compiler, which is written in ASF+SDF (Section 5). Based on these experiences we suggest in Section 6 the study of topics that could further advance the use of term rewriting in industrial applications: persistent term databases, generalized LR parsing versus parallel term rewriting, and coordination languages versus strategy languages. It will turn out that we have an ``alien'' view on research in term rewriting: properties like confluence and termination are of very limited use when selling term rewriting to industry.
We advocate the use of formal techniques in a limited application area: the construction of grammar-centric tools based on term rewriting. This is a modest goal compared to solving software engineering problems in general by means of formal techniques. The latter requires major paradigm shifts and re-education of the users of these formal techniques. The former limits the use of formal techniques to the tools that are being developed and these tools can be fitted in traditional software engineering frameworks. Even with this limited ambition, we have to face serious technology transfer problems.
Capers Jones states that the main cause of slow technology transfer appears to be based in the normal human psychology: the human mind establishes a set of paradigms about methods, tools, and technologies. Once such a paradigm is established, all evidence to the contrary tends to be discarded. As a result, the time from first creation of an idea to widespread deployment is usually no less than 14 years [35]. Perhaps it is human to fail to understand concepts when they are approached from a different viewpoint, in a different language. This phenomenon applies to using term rewriting in the IT industry as well. It is, however, interesting that the speed of adoption for new products seems to be increasing [48].
We give an indication of the time we measured from first contacts between CWI/University of Amsterdam and IT industry until the moment that a solid relation with the begin of technology transfer was established.
Adequate functionality, robustness and efficiency are, of course, important for industrializing new technologies. In our experience, however, the key inhibitor is the long time needed for technology transfer, industrialization and commercialization. The danger being that either researchers or industry (or both!) loose their interest in the transfer. This topic is extensively discussed in the excellent textbook [34].
In a first approximation, the ASF+SDF approach is based on the generation of tools as shown in Figure 1. Starting points are a context-free grammar defining the lexical, concrete, abstract and unparsing (prettyprinting) syntax of a language, analysis and transformation rules defining what information is to be extracted from programs and what changes are to be performed on them, and coordination rules defining how the analysis and transformation rules have to be glued together in the tool that is to be generated. These definitions are fed into a tool generator that produces the desired tool.
The generated tool takes source text in the defined language as input, applies the analysis and transformation rules to it as prescribed by the coordination rules and produces analysis results or complete (transformed) programs.
In a second approximation, the structure of the ASF+SDF Meta-Environment is as shown in Figure 2. At the bottom we see, once more, a generated tool but now its internal structure is shown. At the top, we see an interactive development and prototyping environment in which specifications can be edited and executed. In older versions of the ASF+SDF Meta-Environment the development and prototyping environment and the generated tools were closely intertwined. In newer versions, they are completely separated so that stand-alone tools can be generated.
Syntax, unparsing rules, and the specification modules
are all written in ASF+SDF. For the coordination rules various formalism
are in use (SEAL [42], TOOLBUS scripts [7]).
The modules Si are either created manually or they may be the result
of a generation process (this is not further detailed in the figure).
The generation process consists of the following steps. The context-free
grammar (possibly extended with unparsing rules) is transformed into
a parser and unparser by, respectively, a parser generator and an
unparser generator. The analysis and transformation rules are captured
by specification modules
. They are compiled to
executable components (rewriting systems)
by the
ASF+SDF compiler. The coordination rules are compiled to an executable
coordinator. The picture is completed by a number of standard
components
that provide common services like
persistent storage and syntax-directed editing. For instance, for the
user-interface of the new ASF+SDF Meta-Environment we use wish (a
user-interface builder) in combination with dot (a graph drawing
tool) as basic, off-the-shelf, components that can be effortlessly
integrated in the open TOOLBUS architecture using the component-based
software engineering philosophy described in [40].
Our first example of generated tools are renovation factories as needed for the analysis and remediation of legacy systems. Such factories are generated tools with the architecture as already shown in Figure 2. All the ingredients of such factories are generated from language specifications. Although our approach is completely generic, we will illustrate it here by way of examples for the COBOL language. We give now first a quick overview of the main ingredients of software renovation factories, then we discuss applications, and finally we give some measurements.
These problems do not occur when Generalized LR parser technology is used. The key property of GLR parsers is that they can parse the complete class of context-free languages. As a result, the composition of two grammars (an essential operation for modular structuring) can again be parsed by a GLR parser. In other words, GLR parsing is essential for parsing modular grammars.
Modular grammars make it possible to place dialect specific parts of a language in separate modules. When we need them we can use them at will, without creating a maintenance problem. We can also easily add new grammar rules to existing ones without having to solve conflicts: they are solved by the GLR algorithm.
In Section 6.2 we give a more detailed description of GLR parsing. Parsing technology in reengineering and its problems are further discussed in [19].
A lesson that we learned from IT industry is that these patterns should be designed very carefully. The ideal situation is that the patterns for recognizing code resemble the code itself as much as possible.
Given the strong correspondence between concrete and abstract syntax in ASF+SDF (and the use of GLR parsing, see below), we can implement this requirement quite effectively. Given a grammar (in SDF) for some language we generate a so-called native pattern language [55] for it. This contains, amongst others, appropriately named variable declarations for each language construct. An example of a native COBOL pattern is shown in Figure 3. As can be seen, the COBOL pattern uses all the keywords that are already available in the COBOL standard [1]. Variables like procedure-name-1 and Statement-1+ act as placeholders in the pattern. For more details we refer to [55].
The generation of these patterns is facilitated by the use of GLR parsing. In [51] it is observed that grammars are often structured to permit ease of parsing and not ease of conceptual understanding, i.e., they are parser-oriented. In many cases, the grammar has to be expressed in an unnatural form to satisfy the restrictions that the parser generator imposes, such as one token look-ahead, and certain conflicts during parsing (see Section 6.2 or [44]). This same observation was made in the design of SDF [29].
The conclusion that one can draw from these observations is that it is a bad idea to use such a parser-oriented grammar for constructing other functionality like, e.g., the generation of a native pattern language. Since we use GLR parsing technology, there are no parser-related restrictions on the form of the grammar rules, we can write ``natural'' grammar rules, and therefore, the underlying structure of the native patterns is natural as well.
To overcome this problem, we generate for a given grammar generic analysis and transformation functions [15] that can be specialized for different applications. For each sort in the grammar we generate a separate analysis and traversal function. The names of these functions are derived from the sort name used in the grammar. Here, again, we generate functionality that is as natural as possible for the people who need to work with these generated analysis and transformation functions in a renovation factory. As default behaviour, the generic analysis function performs a generic calculation for each subtree encountered, whereas the generic transformation function performs the identity transformation. This default behaviour can be easily redefined for each sort of subtree. In this way, it is only necessary to define what should happen for certain subtrees independently of where they occur.
An example may clarify the issues involved. Suppose we want to change
patterns of the form shown in Figure 3 into more
structured COBOL fragments not containing GO TO statements.
For this purpose we want to define a traversal function foo.
Our generator for analysis and transformation functions will produce
for each sort S in the COBOL grammar the syntax and
(transformation) semantics for the function foo_S. Its default
behaviour is the identity transformation. In our example we only need
to specify the behaviour on COBOL paragraphs (e.g.,
foo_Paragraph) as is done in Figure 4. This simple
transformation will transform all GO TO statements that are in
fact while loops. The idea is that only for the sort Paragraph
there is non-default behaviour: a change should only be made when this
specific GO TO pattern is matched. While loops are represented
in COBOL by way of the PERFORM UNTIL statement. Elimination
of GO TO statements in COBOL is further described
in [18].
The productivity gains using these tools can be considerable. On the one hand, we measured a productivity of 300 production rules per month for writing grammars by hand [17]: this involved the manual extraction of a COBOL grammar from language reference manuals. On the other hand, we measured a productivity of 3000 (generated) production rules using a CALE factory in one afternoon [54]: this involved the automatic extraction of the grammar for a language used by a telecommunications company from the source code of a compiler for that language. Obviously, it is no longer possible to process such large grammars without appropriate tool support.
Once we are convinced that the assembly lines have the intended behaviour, we start with the generation of a production environment. We compile all components to C using the ASF+SDF compiler (see Section 5). Finally, we combine all the components as discussed in Section 3 resulting in generated tools that make extensive use of rewriting.
As it comes to software renovation factories, we have automated some changes that are needed for correcting the Year 2000 problem. In [55] native patterns are used to remedy a difficult to recognize leap year problem. Another example is the rejuvenation of old systems. In [52] we report on transformations for the restructuring of COBOL/SQL systems. In [18] the control flow of a COBOL/CICS legacy system is normalized. This restructuring improves maintainability, improves performance, and enables connection to the Internet/Intranet using an commercial-off-the-shelf Java to CICS Internet gateway.
For the unparser we generate 700 production rules for additional syntax (needed by the unparser) and 700 rewrite rules. Usually, only a few of these generated rules have to be redefined by hand in order to satisfy specific formatting requirements.
The COBOL programs that we transform have at the moment sizes between small (100 LOC) and medium (5-10 KLOC). A transformation that matches 350 times in a 5000 LOC program takes less than a minute to transform.
Our second example of a generated tool is the ASF+SDF compiler itself.
The first goal of the ASF+SDF compiler is to provide a flexible compilation scheme for the incremental compilation of large modular specifications. The second goal is to generate C code which can process large terms in an efficient way regarding both execution time and memory usage. A third goal is the generation of stand-alone tools (generated C code) that can be shipped to IT industry, without having to ship the ASF+SDF Meta-Environment as well (see the example of the Risla flattener at the end of this section). We have used our own tools, techniques and formalism while developing the ASF+SDF compiler. This approach helps achieving the first goal mentioned above, since it makes us the first guinea pigs of our own compiler, The ASF+SDF specification of the compiler is circa 150 pages (1748 rewrite rules, see Table 1).
There are four aspects of ASF+SDF that have to be taken into account during compilation:
).
The translation of ASF+SDF specifications to C code is mainly influenced by the above four aspects of ASF+SDF. For instance, the fact that ASF+SDF is based on innermost rewriting is a pleasant convenience because of the call-by-value mechanism of C. This enables us to translate the right-hand side of rewrite rules directly to C function calls. The first aspect (modular structure) on the other hand implies extra work when generating C code. For each ASF+SDF function a separate C function will be generated. The corresponding equations are translated to conditional statements in this C function. This means that all equations have to be available before the C function can be generated and thus a reshuffling of equations is needed. One consequence of the distribution of equations over several modules is that modules can not serve as incremental compilation units. We have chosen functions as the incremental compilation unit. The C code generated for the default equations should always be executed after the code that is generated for the ordinary equations. List matching patterns are either transformed into ordinary term matching, if the pattern contains no or exactly one list variable, or the patterns are translated to (nested) while-statements in which all elements of the list are inspected until either a successful match is found or no elements are left. See [49] for a general treatment of this topic.
The global compilation strategy is as follows. An ASF+SDF specification is parsed and for each function with its equations a new ASF+SDF module is generated containing this function and all corresponding equations. We call this flattening of the specification followed by reshuffling of the equations. All constructor functions defined in one module are kept together. These modules are initially used by the compiler for the actual code generation but also when the specification is recompiled to decide for which functions new code should be generated. This phase is entirely independent of the code generation phase. Each new created module is fed to the compiler to generate the actual C code.
The generation of C code is performed in a number of (small) steps.
We start with the ASFIX representation of each module, essentially
the full parse tree of the module still containing layout, comments,
keywords and the like. After reshuffling the ASF+SDF module, it is
transformed into another intermediate representation that has been
designed to simplify the compilation process. This intermediate
formalism
ASF is a simplified prefix representation of ASF+SDF,
see Figure 6 for the
ASF representation of the
Booleans. Given this
ASF code, a number of transformations is
performed to simplify the actual code generation phase and to increase
the performance of the resulting code, e.g., list matching patterns
containing at most one list variable are transformed into ordinary
term matching. From the resulting transformed
ASF code the C code
is generated. The left-hand sides of the
ASF equations are
transformed into a matching automaton which is directly included in
the C function. The right-hand sides of the equations are directly
translated to function calls. See Figure 7 for an
example of generated code from the Booleans of
Figure 6.
The generated code depends heavily on a general term library (see Section
6.1), which ensures efficient manipulation of terms as well as
maximal sharing subterms. The term library provides the basic
functionality to construct normal forms, functionality to check the
outermost function symbols of terms, and functionality to manipulate the
list data structure. Note that the function calls check_sym
and
make_nf2 in Figure 7 are provided by this library.
The reshuffling of equations already discussed above, is performed by the component Module-DB and the C code generation is performed by the component Compiler (the actual ASF+SDF compiler). In the current implementation of the new ASF+SDF Meta-Environment only the compiler and the parser generator are compiled ASF+SDF specifications. The parser generator is not yet shown in Figure 8. The other components are either implemented in C or Tcl. We stress that some of these are off-the-shelf components.
For each specification we give the number of equations, the number of lines of ASF+SDF specification without comments, the number of lines of generated C code, the time needed by the ASF+SDF compiler to generate C code, and finally the time it took the C compiler (using the -O optimization option) to compile the generated code. We prefer the native cc compiler since it optimizes tail recursion, which is not the case for gcc and this yields a better performance in some cases. Measurements were performed on an ULTRA SPARC-IIi (300 MHz) with 576 Mb of memory. We are currently preparing detailed benchmarks comparing the efficiency of ASF+SDF with that of other, related, formalisms [14].
We now further explain the Risla flattener. Risla is a domain-specific language for specifying financial products developed by Cap Gemini in cooperation with MeesPierson (a Dutch business bank) and CWI and UvA [2,21]. The syntax and semantics of the Risla language have been developed using ASF+SDF. Several years after the initial design, the language was extended with modular constructs yielding the new language Modular Risla. Since there was already a compiler that generated COBOL from the original Risla language, Modular Risla is translated (``flattened'') to original, non-modular, Risla. Both Modular Risla and the flattener have been specified in ASF+SDF using the ASF+SDF Meta-Environment. The result is a Modular Risla programming environment.
Unfortunately, flattening large Modular Risla specifications took much time (about 40 minutes CPU time for an average Modular Risla specification and several hours for large ones) because the ASF+SDF specification defining the flattener is executed by an interpreter in the ASF+SDF Meta-Environment. The new ASF+SDF compiler was then used to compile the Risla flattener to a stand-alone C program. Cap Gemini incorporated this stand-alone C code within the original Risla compiler. The 40 minutes CPU time mentioned above reduce to about 30 seconds when using the generated stand-alone C code.
Before presenting a number of research issues that may interest people working in the field of term rewriting, we give first some further sobering observations on the use of formal techniques in industry,
As already indicated in Section 2, IT industry is wary for new concepts. Some researchers may think that, e.g., correctness proofs will be beneficial for IT industry. This is not true. The return on investment (ROI) is poor: $0.90 after one year $0.80 after two years, $0.70 after three years and $0.60 after four years [36, loc. cit. p. 108]. The ROI of the reusability of high-quality artefacts is excellent: $3.75 after one year $9.15 after two years, $21.75 after three years and $43.75 after four years [36, loc. cit. p. 105]. The list of ROI of selected software technologies contains 143 entries, reuse is number one and correctness proofs are number 136. So if one thinks that we might have an unusual viewpoint on the research issues, namely no termination proofs and no confluence proofs, note that this is confirmed by cost estimations from IT industry (this does not necessarily mean that we do not like proofs, see [27] for an example).
In the following subsections we propose therefore research issues that encourage the amalgamation of several theoretical directions. The goal is to achieve high-level, high-quality artefacts that can be reused in a wide range of applications.
Currently, we have found reasonable answers to most of the above questions in the design and implementation of the Annotated Term Format (ATF), a language-independent prefix format that is supported by efficient implementations in C and Java. However, several issues still need further research. For instance, how do we organize the access of several concurrent tools to a common term database? Initial work on waitfree algorithms to achieve this appears in [31].
LR parsing is based on shift/reduce parsing: symbols from the input stream are either ``shifted'' (i.e., placed on an auxiliary stack) or a ``reduce'' action is performed (i.e., the top symbols pushed on the stack represent one side of a grammar rule and can be replaced by the other side of the rule). For the class of LR-parsable languages, one can always make a unique choice between shifting or reducing. For the larger class of all context-free grammars this is no longer the case and there may shift/reduce conflicts (reading the next symbol or reducing the current stack) or reduce/reduce conflicts (the stack can be reduced according to different grammar rules).
The basic idea of GLR parsing is to fork the parse process at points where several choices are possible. All parses are synchronized on reading the next input symbol and parallel parses are joined whenever possible. In this manner, a parse forest is build that maximizes the sharing between the parallel parses.
A striking similarity exists between GLR parsing and parallel rewriting where a rewrite system may contain critical pairs and parallel reduction takes place for all possible redexes. In principle, the same (parallel) rewriting machinery could be used for both rewriting and GLR parsing.
In the context of the ASF+SDF Meta-Environment the benefits would be substantial:
One of the great challenges of software engineering is to find effective mechanisms for the evolutionary development and integration of software components. The basic premise is to separate computation (the basic operations and calculations to be performed) from coordination (the ways is which computations interact and cooperate). It turns out that this separation improves the flexibility and maintainability of software systems.
One example of this approach is the TOOLBUS coordination architecture, in which a process-oriented calculus is used to describe the cooperation protocol between different tools that may be written in different languages. Essentially, tools perform computation steps and the process script tells which steps are performed when by whom.
If we compare this approach to strategy languages for rewriting there is, again, a striking similarity: the strategy language describes when and where each rewrite rule has to be applied.
We have found that questions regarding the confluence of huge rewrite systems are seldomly posed: the specification writer has a mental model of the problem at hand that provides sufficient guidance for confluent execution of the specification as rewrite system. This is fortunate, since the actual determination of the confluence of these huge specifications is hardly feasible. Similar considerations hold for the termination of the term rewriting systems we encounter.
It is, however, another form of confluence that we have argued for in this paper: the confluence of seemingly disparate areas like parallel rewriting versus GLR parsing, and coordination languages versus strategy languages.
Since GLR parsing and coordination languages have turned out to be crucial for implementing industry-sized applications based on term rewriting, we expect that a further exploration and unification of these areas, as suggested in this paper, might be beneficial for both theory and for selling term rewriting to industry.