Alex Sellink and Chris Verhoef
University of Amsterdam,
Programming Research Group
Kruislaan 403, 1098 SJ Amsterdam, The Netherlands
alex@wins.uva.nl, x@wins.uva.nl
We discuss tools that aid in the development, the assessment and the reengineering of language descriptions. Languages description is understood as the syntactical description of a language here. We develop assessment tools that give an indication as to what is wrong with an existing language description, and give hints towards correction. From a correct and complete language description, it is possible to generate a parser, a manual, and on-line documentation. The parser is geared towards reengineering purposes, but is also used to parse the examples that are contained in the documentation. The reengineered language description is a basic ingredient for a reengineering factory that can manipulate this language. The described tool support can also be used to develop a language standard without syntax errors in the language description and its code examples.
Categories and Subject Description:
D.2.6 [Software Engineering]: Programming Environments--Interactive;
D.2.7 [Software Engineering]: Distribution and Maintenance--Restructuring;
D.3.4. [Processors]: Parsing.
Additional Key Words and Phrases: Reengineering, System renovation, Language description development, Grammar reengineering, Document generation, Computer aided language engineering (CALE), Message Sequence Charts.
Since the emerge of computer languages, the need to describe languages in a precise way became an indispensable part of computer science. In his paper on the syntax and semantics of the proposed international algebraic language, Backus [1] writes: `we shall need some metalinguistic conventions for characterizing various strings of symbols. To begin, we shall need metalinguistic formulae.' Then he introduced using an example what is now widely known as the Backus-Naur Formalism. In virtually all documents that give a precise language description the method of Backus is used: first the syntax description notation is explained using an example accompanied with some conventions, and then the language description itself follows. In this way myriads of dialects of the Backus-Naur Formalism emerged. They are referred to as BNF, or EBNF, for extended BNF, or metasyntax, metalanguage.
Language descriptions serve more than one purpose: they are used as a guide to implement tools such as compilers or they serve as a reference manual for users. We use language descriptions to implement tools that serve the reengineering of those languages. Such grammar descriptions form the basis of our approach towards reengineering. Let us give an idea to make this more concrete. It is possible to generate all kinds of prefab components that are useful in an environment for reengineering. We can generate a native pattern language from a context-free grammar that can be used to recognize code fragments [22]. It is also possible to generate full documentation for the language [6]. A sophisticated parser can be generated from the grammar [14]. A structured editor can be generated from the grammar [18]. In [3] we generate components for software renovation factories. It is also possible to generate complete programming environments from context-free grammars. In order to generate such environments, one needs an environment as well. The ASF+SDF Meta-Environment [16] is such an environment. We use it for the generation of tool factories [3]. SDF stands for Syntax Definition Formalism [13], it can be used to define the syntax of a language. ASF stands for Algebraic Specification Formalism [2], it can be used to describe the semantics of a language. The combination is thus adequate for defining syntax and semantics of languages and the ASF+SDF Meta-Environment is the supporting environment for both formalisms.
It is not a trivial task to construct a grammar for reengineering purposes. First of all, such a grammar should have certain properties that make reengineering easy. Secondly, since reengineering problems do not have the habit to reside in small languages, the development process is time consuming. For instance, many academics and companies have struggled with a language definition for COBOL in order to create a decent parser for reengineering targets. Since in reengineering, the grammar seems to be the variable and the problem the constant, grammars should be modifiable and tool support should be insensitive to such modifications. Therefore, generating everything from a grammar is in our opinion a solution. According to [21] there are two problems with parser-based technology: first the stringent constraints on the input, and second it is problematic to extend existing parsers. We solve this by using modular grammars that are easily modifiable, and we use unification of grammar rules that are not important for reengineering tasks [4]. Despite the use of new technology, it is still not easy to develop a new grammar for reengineering purposes or reengineering old grammars to migrate them to the new technology. Therefore, we developed tools to support development, the assessment and the reengineering for language descriptions.
Let us give an example of current practice in serious language description documentation. The language description document (an ITU standard) of so-called Message Sequence Charts [15], is an MS-Word document. In order to extract the BNF syntax, the PostScript version of the Word document is first converted to ASCII, then using a script called extract.perl [9], the (nonlexical) BNF rules are extracted. Then, fourteen manual corrections are needed (see the comments in the script [9]). Then the BNF rules are fed to another script that generates an HTML document so that the BNF rules can be browsed.
Our approach to develop a standard would be to write a complete grammar using the ASF+SDF Meta-Environment and put the accompanying text in comments in the grammar specification. Then, by using the formatting technology discussed in [6] we generate a LATEX document and produce hard copy. Using technology presented in [12], we can generate the HTML version, and using the ASF+SDF Meta-Environment we can generate a programming environment for the language. As can be seen, we take the opposite route to develop a standard. The advantage of our approach is that the grammar is complete, its syntax is checked and the examples in the document are parsed using the generated parser. So maybe its an idea to use our strategy for producing a language standard.
We think of metasyntax as a domain specific language. It is a language geared towards the definition of the syntax of (programming) languages. We will call such languages syntax definition languages. Any dialect of the Backus-Naur Formalism (BNF) is an example of a syntax definition language. The Syntax Definition Formalism (SDF) [13] is another example of a syntax description language. In contrast with BNF, SDF is not available in a myriad of dialects: it is part of a support environment (the ASF+SDF Meta-Environment) as a means to define syntax.
A document containing a language description can be seen as a program written in a syntax definition language; the text in natural language is the comment. This phenomenon is widely known as literate programming [17]. Seen from this perspective, maybe the most literate programs that one can think of are language description documents, in particular a standard. In our opinion, it would be natural to parse and compile such documents. For, it is a fact of life that programs that are neither parsed, nor compiled have a high risk of containing errors. Indeed, we have experienced that the language descriptions that we have parsed and compiled often contain errors. These errors can be divided into two different classes discussed in the next subsection.
The first step in language description development is, in our opinion, parsing the language description itself. For, a typographical error in a language description now becomes a syntax error during parsing. Although this phase is obvious to us, this is not a common approach (but see [7] where errors in a book on action semantics [20] are detected in a similar way). Many language description documents, including standards that we have parsed contain syntax errors.
Apart from syntax errors in language description documents they also contain semantic errors. Let us first explain what we consider the semantics of a language description document. The semantics of a language description program can be seen as the set of objects that can be recognized by this program. Suppose, for instance, that we have the following BNF program x ::= 'a'. This program can recognize a single a. The BNF program consisting of the rules x ::= 'a' and x ::= x x recognizes a, aa, aaa, etc. A semantic error is recognition of other objects than intended. So if it was our intention to recognize a b there is a semantic error in the BNF program.
A typical situation is that the parser (generated from the language description document) does not or faulty recognize the example programs that are contained in the document. Then either the language description contains an error, or the example. In either case, the problem should be resolved.
We have seen errors where it is obvious from the context what their cause was. But we have also encountered situations where we have no idea. If the methodology described in this paper is used to develop standards for languages, such errors are revealed before the standard is published.
Creation of a parser amounts to defining the grammar of the metasyntax in SDF. The parser is generated on-the-fly by the ASF+SDF Meta-Environment [14]. Let us depict the SDF module containing the grammar of SBNF:
imports Layout
exports
sorts
SBNF-Terminal SBNF-NonTerminal SBNF-Element
SBNF-Elements SBNF-Rule SBNF-Program
lexical syntax
"'" ~[']* "'" -> SBNF-Terminal
[A-Za-z0-9\-]+ -> SBNF-NonTerminal
context-free syntax
SBNF-Terminal -> SBNF-Element
SBNF-NonTerminal -> SBNF-Element
"[" SBNF-Element* "]" -> SBNF-Element
"{" SBNF-Element* "}" -> SBNF-Element
"<" { SBNF-Element* "|" }+ ">" -> SBNF-Element
SBNF-NonTerminal "::=" SBNF-Element* -> SBNF-Rule
SBNF-Rule+ -> SBNF-Program
In the so-called exports section, we describe the complete syntax of SBNF. We declare the used nonterminals in the sorts paragraph. In the lexical syntax paragraph we define terminals and nonterminals of SSL: the first lexical rule says that the terminal quote (') followed by zero or more characters that are not a quote followed by a quote (') is an SBNF-Terminal. This is the first convention of the metasyntax. The second convention is defined in the other rule: a string that consists of one or more upper case letters, lower case letters, digits, or minus signs is an SBNF-NonTerminal. Then we enter the context-free syntax paragraph. We construct from the smallest parts of the grammar the elements that can be present in an SBNF rule. Since an SBNF-NonTerminal and an SBNF-Terminal can occur in an SBNF rule, we inject them into the sort SBNF-Element. Then we implement the next three conventions. If we have zero or more SBNF-Element's and put text brackets around them, they become a new SBNF-Element. The intended interpretation is, of course, the optional part of a construction. In a similar way we implement one or more occurrences using the curly braces. The next rule expresses that a list of one or more lists of zero or more SBNF-Element's separated by the symbol | and surrounded by angle brackets is again an SBNF-Element. Now we can construct an SBNF-Rule: it is a SBNF-NonTerminal followed by the terminal ::= followed by zero or more SBNF-Element's. Finally, one or more SBNF-Rule's forms an SBNF-Program.
Constructing and testing this grammar was not a hard job. Furthermore, removing the syntax errors was also easy. It was maybe one hour work to extract the SBNF rules from the HTML file, to make the grammar, and to remove the errors. Using the approach that we describe here makes it a trivial task to remove all the syntax errors in a language description document.
In order to judge the quality of an existing language description, it is useful to have tools that can give an indication of the quality of a grammar. To give the reader a flavor of the nature of such tools we discuss three of the tools that we have implemented.
language description we reported 107 top sorts and 118
bottom sorts. We note that this situation is not uncommon, after all,
such documents were never parsed before. The high number of bottom sorts
is explained is as follows: since it is not possible in BNF to express
lexical syntax, there is not a single lexical syntax rule in the SBNF
program.
The high number of top sorts is caused by the fact that the rules connecting the language constructs are often missing. A reason for this could be that the authors of the manual focussed on describing the individual language constructs, and not on the overall structure of the language.
In the previous section we discussed some useful tools to detect syntax errors in a language description. As we explained earlier, there can also be semantic errors in the lnguge description. These semantic errors can be revealed when we execute the language description. In practice however, we need to reengineer the language description before we can execute it. There are at least three aspects that play a role in reengineering a language description. We can reengineer the original language description to make it complete, we can modularize it, and we can convert it to other syntax definition languages. The first aspect is obvious: a complete description is the target of a language description. The second aspect (modularization) not only improves the quality of the documentation in case it is generated from the grammar, it also helpful when the grammar is used for reengineering as explained in [5]. The third aspect (conversion) is useful for the semantical issues. If we convert the completed grammar to a syntax description language that is supported by a parser generator tool, we can generate a parser for the language description and then do semantical checks. We can, for instance, test the example code that is present in the manual, or real world code written in the language. In this way semantical errors can be traced. This is also interesting for standardization, since it prevents semantical errors before the standard is published.
In this paper we proposed an approach to develop, assess and reengineer language description documents. The described methods and tools can be used to develop language description documents that are correct in the sense that the grammar of the language does not contain errors in the description, and that the examples in the document can be parsed by a parser that can be generated from the language description. From the language description it is possible to generate typeset documents, or on-line documents (using already existing technology). This is particularly useful for the development of standards. We applied our tools to a real-world example: a proprietary language from Ericsson. Our tools generated useful listings containing the information that is necessary to correct the errors. We also developed a method to test the definition in order to obtain a correct language description. Future work with respect to this proprietary language depends on input by Ericsson. This includes sending us SSL code, a compiler, an emulator, and more documentation.
Since many reengineering activities start with a grammar in good shape, the methods and tools that we discussed in this paper are extremely helpful for us to accomplish a cost-effective approach towards reengineering.