Hello.

a page of its own.

Talk:Formal grammar/Reductive grammar

#Parser programming languages

#EDIT jmp

element, component, constituent

Rebuttal

edit

METASYNTAX

edit

METACOMPILER

edit

Metacompilers take as input a metasyntax parser programming languages that includes output transformstion operators. The parser is programmed using constituent formulas. A reductive method of syntax analysis starts with a program formula.


 include token recognizing functions.

Formula character class groupings, token, grammar, and higher structure formula. that are reductive analytical grammar and rules with transformation operators that combine recognized constituent into abstract parse trees.

edit 1

edit

Thank you Rp for the kind words and complete explanation you gave on the grammar subjects. You seam very knowledgeable on the subjects of grammars and parsers.

I agree

As far as I'm concerned, the discussion is not on whether to cover this technology in Wikipedia; the discussion is on how to cover it and where, and how to avoid describing the same things in different places by different names.-Rp

Avoiding describing the same things in different places is the reasion I originally entered into this discussion. This and other grammar, syntax and language topics should relate to each other. The nutral point of view should apply.

Historically I think reductive grammars are of importance because compiler writing systems developed in the 1960's were documented as using them.

So when you say:

My idea on the relationship between generative and reductive grammars is different from how You(Rp), Arthur Rubin and standard compiler technology textbooks see it.

or:

That I believe Chomsky's generative grammars and the reductive grammars used by the parser generators I have seen are two different kinds of things

and:

That I clame we need something other than generative grammars, we need reductive grammars, which are specially designed for parsing.

I must answer that you have not understood my point. Hell I didn't understand what made reductive grammars distinctive. So it was futal of me to attempt explaining them.

For the record: I did not read about reductive grammers in connection with parser generators. I found the terminology, used in Schorre based META "programming language" documents that said the language could be described as a reductive or analytical grammar. Reading that I took reductive to mean analytical. The reductive grammar definition I found in the McGraw Technical Terms Distionary seamed to agree.

I know these compiler writing systems to be different. They originated in the early 1960's starting with the development of META II by Dewey Val Schorre. I wrote a compiler-compiler based on them. Used it to write a COBOL compiler. I understand how to program in these languages.

The confusion is my fault in trying to explain a class of programming languages, I understand, as a grammar type I did not fully understand. I do believe the parser programming languages are in part a grammar because their function is to precisely define programming languages. I also believe they are a reductive grammar because in several papers they are stated to be a reductive grammar. These languages were all developed before the first Dragon Book was published. There was a technology shift. When yacc came with Unix and Unix became the prevelent operatoring system at universities, academia was on the yacc and lex bandwagon. When I woked for Kontron I was not directly involved with parser code. I primarily worked on code generation and optimization.


In researching META II on the various web sites I found it being talked about as a metacompiler. I fell into that thinking (or describing) them as a computer programs or metacompilers. In hindsight that was wrong. Ignoring that they are a programming language and talking only of the compiler program leads to the misunderstanding that I see in the summary and conclusions were it is expressed that META II is a parser generator. In programming we merge the compiler and programming language together. That is we may talk about FORTRAN, COBOL, C++, PASCAL, ADA, MODULA II, etc as compilers or languages. In general they are more commonly referred to as programming languages. A distinction separating parser generators from compilers is a parser generator such as yacc does not take as input a program written in yacc. While we say the X compiler takes as input a program written in the X programming language. I.E. A COBOL compiler takes as input a program written in the COBOL programming language. Something that needs correcting is the way META II is commonly being described as a metacompiler or compiler-compiler. When it should be foremost recognized as a programming language. It is a metacompiler. But the innovative breakthrough is the META II language. META II was first coded in the META II language or a subset of it and hand translated into machine code. That machine code version was designated META I. The language came first. It is the META II programming language that is of significant innovative importance. Other META "programming languages" I have found documented as being based on META II: META I, META II, META III, META IV, META V, TREE-META, CWIC, and 2 predicessors of CWIC written in LISP 2. Several others are referenced in documents of those above. The 1968 TREE-META for XDS 940 paper losts many META language projets based on META II.

parser programming languag is an apt description of the technology. Though they are much more then just parser programming languag. They are designed to transform programs written in a programming languages into object code. TREE-META, CWIC and SLIC create Abstract Syntax Trees using tree forming operators and pass them to generator functions whose unparse rules disassemble trees into arguments. The term unparse rule, originally used by TREE-META, is an apt description of the tree crawling pattern matching rules used in code generation by TREE-META, CWIC, and SLIC.

In the parser programming languag description I try to explain the parser languages showing how they are different then parser generators. I found a pdfs of the Dragon Books. And now remember having bought one. I gave it to one of my team at Kontron. She was working on the ADA parser. I had read a bit of it. I basically shelved it. The parsing and Lexing I found unnecessarily complicated. I had already written several parsers. The parsers I wrote were written in assembly or SLIC.

I am primarily an assembly language programmer. In fact the only high level language I have programed in that I haven't written a compiler, is C++. I am passionate about these old systems. When it comes to writing a compiler for a programming language a Grammer that defines valid string in the language does not cut it. These old systems use a reductive grammar that defines a valid program.

Parser programming languages

edit

The Schorre line of Parser programming languages are simular to PEGs or TDPL. However META II preceeded TDPLs by a decade. After reading Ford's paper, I would say the Schorre parser programming languages have a lot in common with PEGs. Except all Schorre based parser programming languages include programming constructs directing semantic transformations to assembly code or abstract syntax trees. Allmost everything that distinguish PEGs as top down analytical grammars suited for defining programming language's is descriptive of the technology I am talking about. The operational interpretation of PEG expressions is the same.

Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:

  • success, in which the function may optionally move forward consuming one or more characters of the input.
  • failure, in which case no input is consumed.

There are many features equalivent to the listed features given in the Parsing expression grammar topic:

Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following PEG or Schorre language operators:
PEG META II CWIC Operation
e1 e2 e1 e2 e1 e2 e1 followed by e2

Implied e1 AND e2 sequence

N.A. e1/e2 e1| e2 Non-backtracking

Ordered choice

e1/e2 N.A. e1 \ e2 Backtracking ordered choice
e* $e $e Zero-or-more sequance operator
e+ e $e e $e One-or-more
e? e/.EMPTY e/.EMPTY e optional
&e N.A. ?e (look ahead)

And-predicate

!e N.A. -e (look ahead)

Not-predicate

( ... ) ( ... ) ( ... ) grouping
N.A. N.A. $(e1 .BREAK / e2) Zero-or-more

forced termination

N.A. N.A. .FAIL Forced failure
N.A. N.A. .ANY Recognize

any character

N.A. N.A. +[ e ]+ make list.
N.A. N.A. :<node> push node
N.A. N.A. !<number> form tree
N.A. N.A. g[<args>] call functions in

other languages

N.A .OUT( ... ) N.A. Text output
N.A. .LABEL() N.A. Label output

Backtracking is not a programed feature of PEGs or a parser generator's input. Explicite programed backtrack points and long failure provides far more efficient backtracking. Backtracking also is different. A failure recognizing a lexeme causes a backup only affecting the input stream. Backtracking applies when recognized elements have to be undone. Machine code generation cannot be undone causing a compile failure. Backtracking is programed using the backslash "\" alternative operator. On a failure parsing the left alternative the input is reset to its initial state. Tree transformations and pending dictionary chages released.

These are recursive decent parser programming languages. A backtrack creates a backtrack stack frame. Backtrack's may be nested, their stack frame linked to the previous backtrack stack frame. A backtrack failure loads the stack pointer with top backtrack and returns failure. This is simular to a long jump in C. Backtrack resets the call stack, node stack and parse stack, releasing created objects and returns to the backtrack alternative. An initial system backtrack frame catches a long failure when no programed backtrack is in effect and results in a compiler failure.

Look ahead is a very powerful feature. It can:

  • test ambiguous sequences and select the correct path.
  • be used to recover from errors skiping over unrecognizable elements.
  • recognize bounded sequences such as String literals.

Comparing PEG's to Schorre parser programming languages one can see the similarities. CWIC extends the parsing features adding token rules, backtracking alternative operator, loop break, forced failure, and calling functions in other languages. Using SLIC separately compiled modules are linked allowing common language features to be shared across compilers and languages. What happened that these older languages are more advanced then later TDPLs or PEGs?

SLIC

edit

SLIC is a metacompiler that allows the compiler writer to describe a programming language and the transformation of programs written in that language into machine executable code. SLIC produces fast executable machine code, not tables. SLIC has five distinct functions:

SYNTAX equations transforms the input source (characters) into a list, tree structure representation of the statements being compiled.

GENERATOR operates on lists or trees produced by syntax equations or other generators to produce PSEUDO machine code sequences.

ISO (In Sequence Optimization) an optional process that transforms a sequence of PSEUDO instructions, produced by generators, into a new optimized sequence.

PSEUDO instructions produce a sequence of MACHOPs, machine instructions, much like a MACRO in assembly language. A PSEUDO machine is defined to best implement the language being complied.

MACHOP is special procedures called from PSEUDO and GENERATOR. They define an internal assembler instruction or data storage operator (DB, DW, DD...) and it's transformation into binary code of the target machine.

SLIC SYNTAX

edit

SYNTAX equations are rules, of a special-purpose non-procedural language, designed to permit the specification of the syntax of a programming language and the transformation of programs written in that language into an intermediate list, tree, structure. This tree structure may then be processed by functions written in the GENERATOR language.

Any compiler may be thought of as implementing an idealized pseudo-machine whose order code and memory organization are reflected in the operators and data structures of the language processed by the compiler. The pseudo-machine specified by the SYNTAX language part of the compiler has four major components:

An input stream, which contains the source code, characters, to be parsed. A push down accumulator, called the parse stack, which contains the parse tree or elements parsed by the SYNTAX language. The operator stack holds items that are later combined with parse stack items to form trees. The symbol table. A language is composed of elements that exist on at least three hierarchical levels. In the context of natural language, these elements are sentences, words, and characters. In programming languages, these elements are statements or expressions, tokens, and characters. The SYNTAX language has equation forms for recognizing and processing each type of element; it distinguishes them by using a different first operator in the equation.

The language element at the lowest level is the character. Characters are used to form, tokens, the members of the next level.

Tokens are formed from a subset of the characters of the language. This subset is called a character class, and its operator, the colon, identifies the syntax equation that describes it (:). The characters that are members of the class are written on the right side of the definition. Each printable character is enclosed in primes. Nonprinting and special characters may by represented as constants. Characters are separated by a bar character (|) OR operator. The equation ends with a semicolon. For example, the equation

digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';

Defines a digit as a 0 OR 1 OR 2, etc.

Character classes may be defined in terms of other character classes, as well as in terms of terminal characters. Thus the equation

HEXdigit: digit | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';

Defines a hexadecimal digit as an A or B or C or D or E or an F or one of the ten characters identified as a digit. Character class definitions may not reference any syntax equation other then another character class definitions.

Basic language elements such as identifiers and numbers are built up from character classes. It is also necessary to save these elements for further processing. These functions are performed in the token making syntax equations, whose operator is the double dot (..). The sequence operator ($) is Also commonly employed to specify that zero or more elements are to be recognized. For example, the equation

number .. digit $digit;

Defines a number as a digit followed by a sequence of zero or more digits-- i.e., as one or more digits.

Leading skip class characters (blanks or characters that are treated as blanks) are skipped on the input stream until a non-skip character or a digit is encountered.

If the character is not a member of the character class digit the scan does not advance, the operation terminates the input stream is positioned back to the point when we entered number and number return failure to whatever syntax equation called it. If the character is a digit, it is put into a token buffer and the input scan advances.

Succeeding characters are scanned and placed into the buffer until a non-digit is encountered. The input stream is advanced as many positions as there where digits. The operation will return true even if no digits beyond the first were gathered, as the sequence operator specifies zero or more occurrences of digit. Once the sequence terminates, all characters gathered are formed into a token and placed onto the push down accumulator (the parse stack). The token is also entered into a dictionary so that all subsequent instances of the same token will reference the same dictionary entry.

Note skip class skipping will stop if a character is matched by the rule, allowing skip class characters to be used in a token rule. An example of this: suppose a line starting with an * is to be treated as a comment.

cr: 13;

CommentLine.. cr'*' $-cr;

cr (line separator) normally a skip class character would not be skipped and would terminate skipping.

It may be the case that a symbol is not the desired result. For example, a number should be retained as a number, rather than as a collection of digits. Functions, MakeNumber and MakeHex, which perform the conversion may be called as in the following definitions:

digit: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';

HEXdigit: 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | digit;

decmal.. digit $digit MakeNumber[];

HEXnum.. '0' ('X' | 'x') HEXdigit $HEXdigit MakeHex[];

These conversion routines intercept the characters gathered in the token buffer, convert them, and place the result on top of the parse stack. No dictionary entry is made.

No special function, however, is required in the following case:

alpha: 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' |;'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z';

alphanum: alpha | digit;

ID.. alpha $alphanum;

Here the resultant token is precisely what is desired. As normal operation has not been interceded by MakeNumber or MakeHex, the token is placed on the parse stack and is entered into the dictionary via an implicit call to the system function MakeSymbol.

You may have observed that the following definition of ID is equivalent to the one above.

ID.. alpha $(alpha | digit);

However the earlier version results in more efficient code. As a general rule, better code results if "or" operation is avoided by making its components a character class.

Entities other than character classes may be referenced in token making equations. The programmer must specify whether or not he wishes to retain the characters as part of the token. For example, an identifier might be defined in this manner:

ID.. alpha $(alphanum | '_');

This specifies that, once the initial alpha is recognized succeeding elements may be either a member of the class alphanum or an "underscore". The alpha and alphanum will be retained if recognized; the underscore will not. It will be recognized and the input advanced, but the resultant token will appear in the dictionary exactly as if the underscore had not been coded. That is, This_Name will be precisely identical to ThisName. The operator to be employed if the character is to be kept is the plus sign (+):

ID.. alpha $(alphanum | +'_');

This is the definition of an identifier in the SLIC languages.

The functions ToUpper and ToLower maybe used to convert the token buffer to all upper case characters or all lower case characters respectivly. The comma operator may also be used to output characters or a variable to the token buffer. When the variable contains NULL no output to the token buffer occures.

ID.. alpha $(alphanum | '_',UnderBarOption);

When UnderBarOption contains '_', symbols will gathered into the token buffer with the underbars retained. But when UnderBarOption contains NULL, symbols will have underbars removed when gathered into the token buffer.

In some cases, it is necessary to specify that a character may not appear on the input. The "not" operator (-) is used for this purpose, as demonstrated hereafter in the definition of string. Note that, whether or not the proscribed character is present, the input will not advance. The input advances only when affirmative recognition of characters occurs.

string .. '' $(-' any | ' ,') ';

A string is defined as a sequence of characters of arbitrary length enclosed by a pair of primes; e.g., 'ABC'. If the enclosed characters are anything but primes, no conflict occurs. If a prime is to be one of the enclosed characters, two primes are written; they are recognized, and the comma operator appends a single prime to the token buffer. Thus, the word "ISN'T" would be coded 'ISNT'.

The techniques applied to recognition of syntactic entities are well illustrated by the case of arithmetic expressions. The most important factor in recognition of arithmetic expressions is the definition of the operator hierarchy. The example adheres to tradition: on the highest level are exponentiation and unary minus, followed by multiplication and division, followed by addition and subtraction. Within each level the order of operation is from left to right, except in case of exponentiation. This hierarchy may be overridden by parenthesis.

Beginning with the left to right problem, let us look at the definition of an expression. An EXP is a TERM followed by zero or more instances of plus or minus TERM.

EXP = TERM $(('+' | '-') TERM);

This demonstrates how the use of the sequence operator permits the replacement of recursion in BNF by iteration in the syntax language.

TERM = FACTOR $(('*' | '/') FACTOR);

TERM has already been used as something that is added to or subtracted from another TERM. Now it can be seen that a TERM may be composed of multiplication or division of two or more FACTOR's. As an EXP is evaluated multiplication and division must be performed prior to addition and subtraction. Consider the expression.

A + B * C

Which means add A to the product of B and C. This implies the following parenthesis:

(A + (B * C))

FACTOR = '-' ITEM | ITEM ('^' FACTOR | empty);

The SYNTAX language primitive empty means that the element with which it alternates may or may not be present in a legal statement. It follows, from the definition of FACTOR, that the unary minus and exponentiation will take priority over other arithmetic operation. The part of the equation dealing with exponentiation defines a FACTOR as an ITEM raised to a FACTOR that is an item raised to a factor, etc. That is until the '^' factor fails and .empty is found, the compiler will continue to search for more exponentiation. Thus, the expression.

A**B**C

Would be implicitly parenthesized from right to left as

(A**(B**C))

Note that the infinite iteration problem is not caused by right recursion. The right recursive definition of FACTOR states that ITEM, unless it is preceded by a (minus), is always followed by another FACTOR or empty. Succeeding calls on FACTOR will recognize characters and so move the input stream until empty provides an escape. Neither an infinite loop nor incomplete recognition is possible. Finally, to complete the definition:

ITEM = VARIABLE | '(' EXP ')' | NUMBER;

VARIABLE = ID ('(' EXP $(',' EXP) ')' | .empty);

An ITEM is either a VARIABLE (simple or with any number of subscripts or arguments), a parenthesized expression, or a number.

The techniques covered to this point are sufficient to permit the writing of a recognizer--that is, a program whose sole function is to check its input for correct syntax. To produce a compiler it is necessary to specify the details of the transformation of the source code into a form comprehensible to GENERATOR functions. This form, similar to the LISP language, is most easily represented in the tree structure format. The programmer's task is to direct the building of this tree by specifying the elements at its branches and the nodes connecting them.

In its simplest form, a tree is a pictorial representation of a functional notation. An operator such as + or * is a node, from which zero or more branches are extended, representing the operands.

Two operators are employed to specify the building of the tree. The colon (:) followed by an identifier directs that a node be pushed on a special operator stack, above any previous nodes. The exclamation point (!) followed by a number directs that that number of items on the parse stack be removed, and connected in the order they were gathered, to the top node (i.e., the node at the top of the operator stack). That node is then removed from the operator stack. This subtree, formed by the node connected to its branches, is then placed on top of the parse stack.

Consider the simple arithmetic expression A - B. It is desired that A and B be placed on the stack and tied to a SUB (for subtract) node. Thus, EXP might be rewritten as

EXP = TERM $('+' TERM :ADD!2 | '-' TERM :SUB!2);

However, there is a simpler way-- that of factoring the equation:

EXP = TERM $(('+':ADD | '-':SUB) TERM !2);

For every iteration, a plus or minus sign produces the appropriate node (via the operator). When the second term is recognized, the tree is constructed from the node and the two operands. Note that both sets of parentheses in the syntax equation are necessary.

The remainder of the expression-- handling syntax equations will now appear as follows:

TERM = FACTOR $(('*':MPY | '/':DIV) FACTOR !2);

FACTOR = ('-' ITEM:UNARY!1 | ITEM)('**' FACTOR:EXPON!2 | empty);

ITEM = VARIABLE | '(' EXP ')' | NUMBER;

ITEM remains as previously coded: VARIABLE presents a problem because the number of subscripts is uncertain, as may be seen in its definition. Since the exclamation point operator specifies a precise number of entities to be attached to a node. How to handle such constructions, as parameter lists and subscripts, which contain a variable number of elements is accomplished by grouping the elements into a list structure. and attaching the list as a single branch to a node.

The following definition of VARIABLE employs the operator which perform this task:

VARIABLE = ID ('(' &ltEXP $(',' EXP)> ')' :SUBSC!2 | empty);

The use of the angle brackets (< and >) specifies that any expressions recognized within them are to be formed into a first in, first out list and placed on the parse stack as a single entity. The exclamation point operator attaches two elements to the subsc node: the identifier and the list of subscripts. The tree for the subscript expression a(i,j,k) might appear as follows:

SUBSC[a,[i,j,k]]

The same technique may be used to define a procedure or function call:

PROC_CALL = ID '(' <(EXP $(',' EXP) | empty)> ')':CALL !2;

An alternate coding for the same definition might appear as follows:

PROC_CALL = ID '(' (&ltEXP $(',' EXP)> | <>) ')':CALL !2;


The second version of PROC_CALL employs a pair of angle brackets with nothing between them; this is equivalent to LOAD[.NIL]. Both place the empty list on the parse stack.

Observe that angle brackets must be completely nested within parentheses and vice versa. It must also be emphasized that the parse stack must contain at least as many elements as the number specified with the exclamation point operator.

It is now possible to parse any arithmetic expression containing these operators. Consider the expression A + B - C * D(j,2). These are the steps:

A and B are each TERM's (because they are FACTOR's, ITEM's VARIABLE's, and ID's), so their addition is all or part of an expression. They are placed in the parse stack, an ADD node is placed in the operator stack, and the tree that results from attaching the node to the operands replaces the operands on the stack. The iteration in EXP continues. The minus is recognized and a SUB node is placed on the operator stack. TERM is called, and it recognized C * D as a multiplication of two FACTOR's. The node building operation creates an MPL node, which is attached to C and D, placing onto the parse stack the tree: This tree is now the first element in the stack; the second is the tree headed by the ADD node created previously. The !2 in EXP now attaches the top operator stack entry (the SUB node) to the two top parse stack items (the trees created previously). a further iteration fails to turn up any more TERM's so the parse stack finally contains a single entity, the parse tree: SUB[ADD[A,B],MPY[C,SUBSC[D,[j,2]]]]

There is a lot more:

THE PROBLEM OF AMBIGUITY BACKUP ERROR REPORTING AND RECOVERY CALLING GENERATOR FUNCTIONS If you have ever used bnf like syntax directed compilers you might have noticed that unlike others, SLIC will parse context sensitive languages like COBOL.

SLIC GENERATOR

edit

The GENERATOR language is used to write programs which process list/tree structures. These structures may have been produced by SYNTAX rules or by other GENERATOR functions.

The GENERATOR language has many special features oriented toward the coding of the semantic phase of a language processor. How ever, many of the language constructs are general list and symbol manipulation functions, allowing SLIC to be used for many AI applications. If the GENERATOR language program is a part of a compiler the output is PSEUDO instruction lists.

A program written in the GENERATOR language is made up of functions and/or procedures called generators. The syntax of a generator may be described as follows:

GENERATOR = ID TRANSFORM $TRANSFORM;

A generator is as a set of transforms that operate on the input. Each transform is made up of an input pattern and an action:

TRANSFORM = PATTERN '=' ACTION;

PATTERN = '[' (PARAMETER_DESCRIPTION / .EMPTY) ']';

Each transform describes a pattern and action to be performed for a specific input. For example:

FACTORIAL[0] = .RETURN 1;

[N] = .RETURN FACTORIAL[N-1] * N;

FACTORIAL is a generator with one input parameter. If the input parameter matches the first description--namely, zero a value of one is returned. Otherwise, the input parameter is given the name N, and the value returned will be the value of the expression FACTORIAL(N-1) * N.

An example that evaluates a numerical expressions:

digit: '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | '0';

num .. digit $digit MakeNum[];

exp = term (('+':add | '-':sub) term !2) print[*1];

term = factor (('*':mul | '/':div) factor !2);

print[xpr] = output[eval[xpr], con];

eval [add[x,y]] = return eval(x) + eval(y);

[sub[x,y]] = return eval(x) - eval(y);

[mul[eval(x),eval(y)]] = return x * y;

[div[eval(x),eval(y)]] = return x / y;

[isnumeric(x)] = return x; Note in the above that eval is called from the action in case of add and sub node transforms. The return values from eval are then added or subtracted. In the case of mul and div eval is called in the pattern of their transforms. This is a special function call. When a function call is made in a pattern the unnamed pattern element is passed to the function and the arguments of the function are the returned values of the function. In the above case eval(x) of the mul and div transforms take the unnamed left branch of the mul or div tree as it's single argument and return x as the result. Like wise eval(y) does the same for the right branch of the tree.

A programming language may be described in terms of its basic data types and the operators that may be performed on them. These operators and data types are combined to form expressions, which direct the performance of operations and return as a value the result of those operations. In the GENERATOR language, these expressions may be composed of operations involving a variety of data types.

Some of these data types, such as integer, Boolean and string, are familiar from more conventional programming languages and need little explanation. Some, such as gensym, and list, are less common, and it would be well to discuss them.

Tokens were encountered in the discussion of the SYNTAX language and defined as the "words" of the source language being compiled. In the GENERATOR language, this class of data is divided into three subsets: numbers, symbols, and strings. A symbol is that token which is created by the MakeSymbol function. Strings are a data type like symbol consisting of a collection of characters; however, unlike symbol they have no dictionary entry and cannot have attributes. String and symbol resemble what in conventional programming languages is called a "string", or a "literal". In the GENERATOR language, however a symbol is more than a collection of characters; it is an entity which may have attributes assigned to it.

The gensym is a data type essential to a compiler that is used to describe the processing of a language. It is a symbol generated by a compiler, a unique name that the compiler builder employs to provide a branch point or data storage location for which no name exists in the source program being compiled.

In a real sense each of these data types--numeric, Boolean, symbol, gensym, or string--is a subtype of the single class of data treated in the GENERATOR language. This class of data is the atom, so called because it is the smallest meaningful unit of information. Atoms may be either numeric or non-numeric, and all atoms may (in the appropriate context) have a Boolean meaning. Atoms may be combined into structures of arbitrary complexity, called lists (or trees).

An atom may appear in a variety of guises with a variety of characteristics. A numeric atom, for example, is the only type of atom upon which arithmetic may be performed. Of the nonnumeric atoms, only symbols and gensyms may have property lists. These lists contain attributes assigned to the atom by the programmer and are comparable to a compiler's or assemblers symbol table or dictionary of the identifiers used in a source program. A dictionary might contain such information about the items entered into it as their print name, memory address, data type, size, etc.

It must be emphasized that every single item of data in the GENERATOR language is an atom. The Boolean constant .TRUE is an atom. The constant .NIL, which is used to represent the Boolean value .FALSE, is an atom, even though it also represents the list of zero elements (the empty list) and is thus an exception to the normal rule that a list is not an atom (and vice versa). Anything whose value is not .NIL has the value .TRUE.

A GENERATOR language variable may reference or contain any type of atom, tree or list. Variables do not have any specific type associated with them, and nothing about a variable suggests the nature of what it contains, It may contain a number at one point in time, then the value .TRUE, then a string, then another number, etc. The GENERATOR language implements operators and functions that permit the programmer to interrogate a variable as to its nature at a specific time.

The classic assembler or compiler builds a symbol table to associate the various symbols encountered in processing a source program with the value of the location counter at the time the label is defined.

The symbol table is a simple function that maps names into addresses, generally by a table lookup operation. Since there is only one value associated with each symbol, there is no need to specifically name the attributes LOCATION or to identify the contents of the location counter as the value of each symbol's LOCATION attribute, since neither can have any other meaning.

In more sophisticated language processors, such as those produced with SLIC many attributes, as well as values for these attributes, become associated with each symbol of the source program. For example X might have the attribute LOCATION with a value 8000, the attribute CLASS with the value VARIABLE, the attribute TYPE with the value REAL.

Such a collection of attribute-value pairs associated with a symbol is called a property list. The collection of symbols encountered during operation of a compiler, each with its associated property list, is called the dictionary.

In order to allow a symbol to have nested scope there is a symbol tables stack. A new symbol table instance is created using PushSymbols() and released using PopSymbols(). These functions return the current level of the symbol table nesting. ScopeLimit(n) is used to restrict symbols to scope n and above. These functions are used to force MakeSymbol to create symbols in the current scope. If it is not apparent why one would use these functions you should be browsing somewhere else.

SLIC MACHOP

edit

The MACHOP language is used to define the instruction set of the target processor that may be used to translate the PSEUDO machine code into the target machine code. The function of a MACHOP is to define the binary output to the object file and format of a pseudo assembly-listing file. The MACHOP operates in a bit addressable memory space allowing any size of addressable unit(8 bits 16 bits 24 bits 36 bits etc). In fact different sizes could be mixed, Some processors have different size data and instruction storage. A MACHOP outputs bit fields. These fields may be combined for display in a pseudo assembly listing.

MACHOP syntax defination

The following MACHOP defines the output for the IPX86 family processor MOV instruction:

.MACHOP MOV dest[ixd],source[ixs] -


.MORG 8: H(16): $/8; if isnumber(source) then { if isreg(dest) then { +H(4):0b1011; (1); if size:(dest) = 8 then 0 else 1; (3): dest;} else { if !frame:(dest) and segment:(dest) != curSegment then +H(8): overide(dest)

+H(7):0b1100011; (1): if size:(dest) = 8 then 0 else 1; +H(2): mod(dest); (3): 0; (3): R_M(dest); +H(szoff): offset:(dest/8);} +H(size:(dest)): source}


else if isreg(source) && isreg(dest) then { +H(7):0b1000100; (1): if size:(dest) = 8 then 0 else 1; +H(2):0b11; (3):source; (3):dest;} else if isreg(source) then { +H(7):0b1000100; (1): if size:(dest) = 8 then 0 else 1; +H(2): mod(dest); (3):source; (3):R_M(dest);} else isreg(dest) then { +H(7):0b1000101; (1): if size:(source) = 8 then 0 else 1; +H(2): mod(source); (3):source; (3):R_M(source);} else if issegreg(dest) then { +H(8):0b10001110; (2): mod(source); (3): issegreg(dest); (3): R_M(source); if !isreg(source); +H(szoff): offset:(source/8);} else { +H(8):0b10001100; (2): mod(dest); (3): issegreg(source); (3): R_M(dest); if !isreg(source); +H(szoff): offset:(dest/8);} The following uses the vectored opcode form to define many opcodes with one MACHOP. In the following #OP contains the value associated with the vectored entry definitions following the body of the MACHOP. The following defines the output for parameterless single byte instructions of the IPX86 family of processors

.MACHOP #OP -


.MORG 8: H(16): $/8;


+H(8): OP;


#AAA 0b00110111; #AAS 0b00111111; #CBW 0b10011000; #CLC 0b11111000; #CLD 0b11111100; #CLI 0b11111010; #CMC 0b11110101; #CWD 0b10011001; #DAA 0b00100111; #DAS 0b00101111; #HLT 0b11110100; #INT3 0b11001100; #INTO 0b11001110; #IRET 0b11001111 #LAHF 0b10011111; #LEAVE 0b11001001; #NOP 0b10010000; #POPA 0b01100001; #POPF 0b10011101; #PUSHA 0b01100000; #PUSHF 0b10011100; #SAHF 0b10011110; #STC 0b11111001; #STD 0b11111101; #STI 0b11111011; #XLAT 0b11010111;

phrase structure/constituency grammar

edit

META II and languages based on it, came out of early work on syntax directed compilers done by members of a Los Angeles ACM SEGPLAN special interest group on syntax directed compilers. They are compiler writing languages of which a part is a parser programming language. The transformation aspects need explaining as they depend on the reductive syntactic analysis methodology.

In these languages, the parser is programed using test functions resembling phrase structure/constituency grammar rules.

<constituent_name> = <test conditional expression>;

A rule assigns a unique name to a constituent (test) function defined using a conditional expression. That function may be called by referencing it's name to recognize the constituent.

These are lexerless parser programming languages in which the terms lexeme and token are a bit different then described in modern compiler text books.[1] A lexeme is recognized by a quoted string literal or a token rule. All tokens are lexemes but not all lexemes are tokens. A quoted string literal is a constant lexeme test used to recognize a specific sequence of characters in the immediate input stream. The recognized characters are not normally kept. A token rule is used to recognize a token lexeme that is a variable character sequence that is kept. Where it gets confusing is a string token rule recognizes a string token lexeme. A token is a variable lexame. A quoted string is a constant lexeme. Constant lexemes may be significant to the input programs operation. But are usually handled by programed transformations in the parser language. A string recognized by a string token rule, is a variable character sequence that is a constant of the input source program. A character sequence recognized by a quoted string literal is a constant of the parser language. (That's about as clear is mud. Hopefully it will be made clearer by examples.) Early Schorre languages used built in token recognizes .ID, .NUM and .STRING. Later advanced parser programming languages have token rules.

In early Schorre META (parser programming languages) a token rule saved the last recognized character sequence that could later be output as part of an assembly instruction. Later generations of Schorre META (parser programming languages) having LISP based generator sub-languages created typed objects. Symbols being catalog in a symbol dictionary. Numeric and string objects are created by calling supplied library functions. Recognized tokens being pushed on a parse stack.

These parser languages are goal orianted. The syntactic analysis starts with a main goal that is the top driving rule. That rule defines the constituent make up of a valid program module. The recogniser achieves its main goal by looking for these constituent goals, from left to right and down, as individual subgoals. These subgoals are themselves defined in terms of other constituents, and so the process of recognising a complete program module becomes a matter of looking for progressively smaller entities, right down to the level at which lexeme's are recognized. This top-down syntax analysis method commonly known as a recursive decent has advantages and disadvantages. The advanced lisp enhanced language features of CWIC and SLIC overcome many if not all disadvantages associated with recursive decent parsers. A generator function may be called anywhere a syntax rule may be called. Generator functions return success or failure status so they effect parsing the same as calling a syntax rule. The first input to a SLIC produced compiler is the command line initiating it. It parses its initiating command line calling Generator functions to setup input, output, and compile options. Some options were built-in to the SLIC system. Program listing and debug options are part of the compiler writing system. Such option if used had to be programed into the command line analysis. Supplied library functions handled built-in option setting.


Though they appear to be a declarative programming languages, programming in them is more like an imperative programming language. One writes a rule in imperative mode, knowing they are test function's that direct the syntactic analysis of an input character stream. Test is a boolean type or an analysis action that returns a test(status) result of success or failure. A quoted string literal is a conditional construct used to recognize a specific sequance of characters in the immediate input stream. We write syntactic analysis test equations that are conditional expressions specifying language constituent tests that are actually test functions. We are in analysis mode of thinking writing a program to analyze textual input. We use constituent names and quoted string literals to express constituent sequences. For example we wish to parse a simple algebraic expression expressing algebraic hierarchy using algebraic terminology: expression, term, and factor. In algebra an expression is a sequence of terms separated by sum or difference operators + and -. We start with the expression rule:

expression = term (('+'|'-') term)*;

The parsing rule expression is a test function consisting of two test sub-expressions. The first test term reference's another rule. term may return failure resulting in expression returning failure. If term is successful the second test, a grouped Kleene star (zero or more loop) (('+'|'-') term)*, is atempted. There is an implied and conective between sequential tests. All must be sucessful for the sequence to be sucessful(true). Separating alternative sequances the alternative operators \, /, | express the boolean or condition. Alternatives are attempted in the order they are coded. The Kleene star loop applies to a grouped sequence of two tests

('+'|'-') and term.

The grouped loop sequence is used to recognize a + or - operator that is followed by a term. The grouped alternative's ('+'|'-') tests are attempted in order. First attempting to recognize the string '+'. On success the grouped test succeeds. On failure the alternative '-' is atempted. On failure of the '-' the grouped loop's first test fails and the Kleene star zero or more loop terminates successfully. If the test ('+'|'-') succeeds the term test function is called. On success the loop repeats. The term test returning failure and not being the first test of a sequance is indicatave of the sequance failing. The recovery after having successfully completed one or more tests of a sequance requires backtracking. Languages not having backtracking aborted the compile. The advanced languages used operators setting backtrack points allowing recovery from such a state. A backtrack or long failure loads the call stack with the top backtrack stack frame and returns failure.

Above I used the more familiar Kleene star for simplicity of discussion. These languages use a preceeding $ zero or more loop operator:

expression = term $(('+'|'-') term);

I know when I write term and the quoted '+' and '-' strings in expression they are test functions or actions that succeed or fail. I know a non-backtracking alternative will catch a failure recognizing a lexeme. On entry a lexeme test saves the input state. On failure the input state is restored before returning. I use the term backup as only the input stream state is involved.


In order to produce code we need to transform the recognized entities into a manipulatable or executable form. META II and other early Schorre metasyntax programming languages used output constructs to write stack machine assembly code lines to an intermediate file. META II uses .OUT(...) and .LABEL(...) to write stack machine assembly code lines to an intermediate file.

-Algebraic expression coded in META II-

EXPR = TERM $('+' TERM .OUT('ADD') /'-' TERM .OUT('SUB'));
TERM = FACTOR $('*' FACTOR .OUT('MPY')/'/' FACTOR .OUT('DIV'));
FACTOR = '(' EXPR ')'/.ID .OUT('LD ' *)/.NUM .OUT('LDL ' *);

In the .OUT intermediate file writer operation * referenced the most recent recognized token. .OUT(...) writes instructions beginning in the opcode field. In FACTOR when .ID recognizes an identifier a .OUT('LD ' *) outputs a

LD <id string just recognized>

LDL, load literal is used when a number is recognized. Code generated is for a stack machine. The load instructions push their operand onto the stack. Arithmetic operates on the top 2 stack entries. The code generated is a reverse polish sequance.

a+3*b-2
generates:
Label Operation Operand // Stack
LD a a
LDL 3 3,a
LD b b,3,a
MPY 3×b,a
ADD a+3×b
LDL 2 2,a+3×b
SUB a+3×b-2

.LABEL is used to write labels beginning in column one. Two generated symbols may be created local to a rule's executation. On the first referencing of *1 and *2 within a rule a generated symbol of the form A01, A02 ... B01 ... etc, is created. Each symbol being unique to the rule's invocation. Using *1 or *2 in a .OUT or .LABLE (*1) outputs a generated symbol string. .LABEL (*) outputs the most recent matched token.

It is important to note that these lexerless parsing languages used special rules and quoted string to recognize lexemes.

TREE-META separated code generation and syntax analysis creating an Abstract syntax tree using :<node>[<number>].

-Algebraic expression coded in TREE-META-

EXPR = TERM $('+' TERM:ADD[2] /'-' TERM:SUB[2]);
TERM = FACTOR $('*' FACTOR:MPY[2])/'/' FACTOR:DIV[2]);
FACTOR = '(' EXPR ')'/.ID *)/.NUM;

The <node>[<number>] tree forming operation creates a tree having <number> branches.

  a+3*b-2

          SUB
         /   \
       ADD    2
      /   \
     a    MPY
         /   \
        3     b

I believe TREE-META had a gen-sym system strategy simular to META II. Of note is that the tree building operations, node and branch specification being combined. I do not know the details of how parsed entities or nodes and trees are implemented in TREE-META. The intermediate tree generation allows a more general code generation strategy. But the tree building operation disallowed factoring node and branch directives. There have been different TREE-META implemetations. The oldest described in a 1967 paper. A 1974 paper calmed to have both a TREE-META compiler and a parser generator for a reduced subset of the parser language and describes backtracking implemented using <- preceeding the first alternative of an alternative sequance.

CWIC and SLIC whose generator functions are LISP 2 dialects have dynamic memory objects. Objects are typed and may be tested as to their type. Push down node and parse stacks are used to build syntax trees. A tree is a list whose first element is a node object. Node objects are created and pushed on the node stack using :<node name string>. !<number> creates a tree, poping the top node stack entry and top <number> of parse stack entries. The tree is pushed on the parse stack. Tokens are pushed on the parse stack as they are recognized. A string object may be created and pushed on the parse stack using the ,<string literal> sequance to create a string object and push it on the parse stack. A +<string literal> sequence is used to recognize a string literal creating a string object and pushing it on the parse stack if recognized. CWIC added token and character class rules using consistent parsing language syntax. Character class being the simplest is designated by the : define operator:

CLASS = ID ':' CHAR $('|' CHAR) ';';
CHAR = ''' DCHR ''' | integer;
DCHR: 'A'|'B'|'C'|'D'|'E'|'F'|
  ''all displayable characters''

Character classes define literal character recognizers. They can be thought of a literal define matching any character listed. A class name used in a token rule recognizes any character in the class and becomes part of the token. Used in a syntax rule they work like a quoted string literal test not normally being kept. Token rules are designated by the '..' rule lexicon.

-Algebraic expression as might be coded in CWIC or SLIC

EXPR = TERM $(('+':ADD|'-':SUB)TERM!2);
TERM = FACTOR $(('*':MPY|'/':DIV)FACTOR!2);
FACTOR = '(' EXPR ') | ID | INT;

DGT: '0'|'1'|'2'|'3'|'4'|'5'|'6"|'7'|'8'|'9';
UPR: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
     'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
LWR: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
     'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';

LET: UPR|LWR;

ALPHANUM: LET|DGT;

ID .. LET $(ALPHA_NUM|+'_');
INT .. DGT $DGT MAKENUM[];

I know when I write the EXPR equation that TERM is a test function and expect it to return a term on the parse stack. The parser is usually written top down. When syntax or token rules are referenced it is usually the case a tree is left on the parse stack. It is expected that EXPR, TERM, and FACTOR fully recognize their constitute and transform it into a tree that 's left on the parse stack. The double dot ,, token rules create token objects. Intercede functions like MAKENUM[] in the INT rule intercede the default operation of cataloging a symbol. MAKENUM creates an integer numeric object that is pushed on the parse stack. When no intercede function is used as in the ID rule the token is assumed to be a symbol. Symbols are cataloged locating the symbol in the dictionary or creating a dictionary entry if it does not exist. In any case a symbol object is created and pushed on the parse stack. LET, DGT, and ALPHA_NUM are character class rules. In SLIC character class rules crate a character class table that is indexed by a character's numeric code. The class name gets defined as a bit mask that is tested(logical and operation) against the indexed location. A non zero result indicating membership in the class. A pre-defined skip_class matches non-printing and white space character codes. In SLIC you may overide the default skip-class by defining a skip_class character class. White space is automatically handled. Token rules skip leading skip_class characters. In SLIC a token rule tries repeatedly to make a first character match failing the first character match it tests for a skip_class match. On matching a skip_class character the input is advanced and the first match test repeated. Once a first match is made, skip_class skipping stops and following characters continue being matched against the rule. In CWIC skip_class character's are avanced over until a non-skip_class character terminates skipping. The token rule then proceeds matching charactes. CWIC tokens may not start with a skip_class characters. That was originally done in SLIC so symbols could be constrained to start in column 1 of a line. A line_break test was later implemented.

Understanding the stack manipulation, specifying parsing order etc is imperative mode programming.

The CWIC compiler writing system has 3 seperate languages. The SYNTAX language, GENERATOR language, and MOL-360 each having a separate compiler. The MOL (Machine Orianted Llanguage) compiler is a separate development supplied with CWIC for writing low level compiler support code.

SLIC

edit

SLIC is a single compiler having five sublanguages. It has basically the same SYNTAX and GENERATOR languages as CWIC. The code GENERATOR languages are LISP 2 based. SLIC extended the target machine code generation capabilities. Where CWIC's generator language used an additive method of building machine code instructions in byte addressable memory sections. SLIC used symbolic macro like PSEUDO instructions appending them to section lists. The term plant (planting or planted) is used to describe code generation. CWIC and SLIC plant code to named sections. Section allow separating generated code by function. I.E. Separate code and data sections. In CWIC BYTE(8 bit) code is planted into a section's byte addressable memory. A memory block associated with the section holds generated code. Angle brackets < > surround the plant expression. BYTE code planted by CWIC is an expression that can be an 8, 16, or 32 bit value. The byte address plant point is auto incremented over the value planted. An IBM 360 add register register instruction:

<AR + (x*16) + y;>

generates a 16 bit, 2 byte instruction. Assuming x and y are register codes. There is no checking that the plant expression produces valid code.

SLIC also plants code to a named section using < ... > around the code. However the code is in machine independent symbolic PSEUDO instructions form.

<ADD x,y;>

Planting appends PSEUDO instructions to a list associated with a section, The PSEUDO entry contains the PSEUDO's named entry and arguments. A section's code is output to the object file using a .FLUSH <section name>; statement. CWIC simply wrote the section's memory block. When SLIC flushes a section the PSEUDO procedures are called in order and deleted. PSEUDO procedures call MACHOP procedures to output code to the object file. A MACHOP is a function programed in the MACHOP sub-language. MACHOPs define machine instructions as a sequence of bit fields making up a target processors instruction. Conditional expressions select field sequances. Vectored entry by opcode mnemonic allow a MACHOP to define a family of like formated instructions. Overloaded operands also alow operand selection of appropriate instruction formats. Thay define a target machine's assembly language. It's instruction formats and their parameters. When executed they produce a sequence of bit fields making up an instruction or dataum. MACHOP's define an assembly language for a target machine. They define data allocation as well as instruction. MACHOPs can be thought of as outputting bit sequences to a bit addressable memory. ".MORG" allows alignment to a modulo boundary. .MORG 8 ... aligns the sections bit address to an 8 bit "byte" boundry. .MORG 36 ... aligns the bit address to a 36 bit "WORD" boundry. The plant address are kept internally as bit addresses. Modulo arithmetic being used to calculate an addressable unit and offset within the addressable unit. MACHOPs are ment to be able to define any target machine code instruction set architecture. Machine code assembly listing being a standard SLIC produced compiler option. MACHOPs also define the instruction numeric print formating. A program listing may optionally include the generated code in assembly listing format. Generated code in the listing appears at flush points.

The intention of these extensions to CWIC was separating the target machine and language translation. The theory being that PSEUDO instructions would be target machine independent. That is to say that pseudo names and arguments are consistent across target machines They would be defined at a level that could target many machine architectures. PSEUDO's might be language specific or completely language and machine independent. The SYNTAX and GENERATOR phases would then be target machine independent. Linking with different target PSEUDO, MACHOP and systen libraries to target specific computer systems. That is a pseudo machine would be defined as having a set of pseudo instructions. Those pseudo instructions could then be coded for various machines.

SLIC and CWIC operate in phases. CWIC having SYNTAX and GENERATOR phases. SLIC adding ISO and PSEUDO. A phase should not be confused with a pass as in mutipass compilers. A syntax rule calling a generator initiates the generator phase. In SLIC flushing a section initiates the ISO phase followed by the PSEUDO phase. ISOs are optional. They are organized into named groups and could be optionally performed using command line switches.

The SLIC linker combines relocatable object files into a load module for the target machine.

ISO(In Sequence Optimizer) is an optional phase that operates on the pseudo code instruction list when flushed. The process of generating code is programed. A SYNTAX rule calls a generator, with the top parse stack entry usually being the argument passed to the generator. Code generation is programed by when and which generator functions are called. Generators may plant pseudo instructions. The pseudo phase is initiated by a .FLUSH <section>; statement. An optional set of ISO (In Sequance Optimazation) transforms may be run on the section when flushed. The pseudo list is then processed calling each pseudo instructions. As they are called they may produce object file output calling MACHOPs to emit object code. Like generator functions PSEUDOs may also plant pseudo instructions. A constant may need planting to a constant section for example.

In SLIC and CWIC variables may contain any type of object. Unlike todays object oriented languages, objects are not programed in the language but are part of the language. All variables hold dynamic objects that may contain any object type and can be integrated as to their type. I am not sure how they would be classed today.


phrase structure grammar. Each rule names a linguistic constituent

Conditional (programming)

CWIC and SLIC

edit

In CWIC and SLIC there is an abstract parse tree on the parse stack. The abstract parse tree or APT is an abstract syntax tree whose formation is directed by parser language transform operators. In the expression rule it is expected of term to leave the recognized term's APT on the parse stack. In the parsing language strings are normally only recognized. The '+' and '-' strings would match the corosponding lexemes in the input stream. Recognized quoted string lexemes are passed over and no other action is part of a string recognition. The parser language uses a node stack to save tree nodes usually corosponding to recognized lexeme strings. When a '-' is recognized a node should be created for the operation. If we wish to express subtracting the second term from the first. Or add them if '+' is recognized. The equation is written with tree transformations:

expression = term $(('+':ADD|'-':SUB) term!2);

The !2 operator creates a 2 branch tree. In this case a binary APT whose node represent the recognized operation. The top two parse stack entries are poped to make the expression tree branches. The tree is pushed on the parse stack. a-1 is transformed into the tree:

    SUB
   /   \
  a     1

or as displayed by SLIC's debuger:

 SUB[a,1]

The equation creates a left handed tree result. For example the expression:

a+b-c

is parsed left to right. Recognizing the variable a that is pushed on the parse stack. '+':ADD recognizes the + and :ADD creates and pushes an ADD node on the node stack. The variable b is then recognized and pushed on the parse stack. !2 creates a tree poping the ADD node and top two parse stack entries a and b forming a tree:

 ADD[a,b]

    ADD 
   /   \
  a     b

The ADD[a,b] tree is pushed on the parse stack. The zero or more sequance loop continues recognizing the - and pushing a SUB node on the node stack. c is recognized as a term and again !2 creates a tree from the top two parse stack entries and top node stack entry:

 SUB[ADD[a,b],c]

        SUB
       /   \
    ADD     c
   /   \
  a     b

The expression equation using the $ sequence operator:

expression = term $(('+':ADD|'-':SUB) term!2);

The equation can be changed to produce a right handed tree by using recursion:

expr = term (('+':ADD|'-':SUB) expr!2|.EMPTY);
 a+b-c

    ADD
   /   \
  a     SUB
       /   \
      b     c

The set of CWIC parse equations for an expression:

expr = term $(('+':ADD|'-':SUB)  term!2);
term = factor $(('*':MPY|'/':DIV) factor!2);
factor = '(' expr ')'|id|number;

dgt: '0'|'1'|'2'|'3'|'4'|'5'|'6"|'7'|'8'|'9';
upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
     'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
     'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';

let: upr|lwr;

alphanum: let|dgt;

id     .. let $alphanum;
number .. dgt $dgt MAKENUM[];

CWIC has 3 grammar rule types:

= syntax
.. token
: character class

SLIC adds a fourth:

== backtracking syntax

Character class rules are like a define in other languages. They define a literal test that may match any of the list of characters. The alternative | or operator is used for consistency. They are simplistic only allowing a single character or its numeric code as alternatives. Token rules define tokens of the language. The defult is to create a symbol object. Symbols are entered into the dictionary. An intercept function is used to create other type onjects from the recognized lexem:

MAKENUM decmsl conversion.
MAKEHEX hex conversion.
MAKEOCT octal conversion.
MAKEBIN binary conversion.
MAKESTR string object.

The integer rule used in SLIC.

bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dec: OCT|'8'|'9';
hex: DEC|'A'|'B'|'C'|'D'|'E'|'F'
        |'a'|'b'|'c'|'d'|'e'|'f';

integer .. ("0H"|"0h") hex $hex MAKEHEX[]|
           ("0O"|"0o") oct $oct MAKEOCT[]|
           ("0B"|"0b") bin $bin MAKEBIN[]|
           dec $dec MAKENUM[];

TOKEN rules are also simplistic only able to reference character classes and call a generator function as the last operation. In the above the numeric conversion functions are only allowed as the last thing in an outermost alternative sequence.

In SLIC I added a short cut rule. The == rule is equilivant to the = rule with a .FAIL backtrack alternative:

x == y:

is equivalent to:

x = (y)\.FAIL;

CWIC and SLIC have two forms of backtracking. I distinguish them as backup and backtracking. Backtracking involves a long failure. A long failure is implemented simular to a longjmp in c. A backtrack alternative or == rule creates a backtrack stack frame on the call stack and saves the parse state in it. Backtrack frames are linked. Backtracking is more complicated then a backup that only need reset the input stream. These are a type of recursive decent parsers making parse and node stack entries that are dynamic memory objects that may require removal and possibly release of their memory on a failure. Symbols may have been created and attributes set or changed. Luckily backtracking can be avoided in most cases by factoring. Commonly only being used for error recovery. Object code generation can not be backtracked causing an internal compiler error. The backtrack alternative does not enable backtracking. It simply sets a backtrack recovery point. An initial system backtrack block created on startup catches unaccounted for compiler backtrack errors.

The backtracking alternative operator \ catches a long backtracking failure. The == equation catches a backtrack failure and returns failure to it's caller.

A backup occurs on a lexeme recognition failure. It may cause a backtrack failure if not the first test of a sequence. If the test is the first lexeme of a sequence failing simply proceeds to an alternative of the present equation or lacking an alternative it returns failure to it's caller. Backup occurs within lexeme tests. Once a lexeme has successfully been recognized subsequent recognition failure will cause a long failure regardless of a backtrack alternative being in effect or not. Such a long failure is caught by the initial backtrack block that reports it as a compiler error. Backtracking is not necessarily local to the syntax equation in which it occurs. Backtracking may span any number of calls. Syntax rules do not create a stack frame. Only the return address is on the call stack. Backtrack stack frame's are backward linked. The backtrack frame maintains all info required to reset the parse to the saved state. Backtrack frames contain dictionaries of symbols created and/or whose attributes have changed. On success these update the previous backtrack fram dictionary. On failure their memory is released.

program = $(decl \ ERRORX["syntax errer?"] $(-';' .ANY) ';');

The backtrack alternative above catches any error from decl. ERRORX is a CWIC library function that reports an error. The farthest point parsed is flaged with the error message. Error recovery is attempted by looking for a ';' in the programed loop: $(-';' .ANY) The -';' matching any character not a semicolon. A negative test match does not advance the parse. .ANY matches any character advancing the parse one character. If a semicolon is matched the negated test fails and the ';' following the loop advances the parse.

These are analytical grammars. With the ability to parse over unrecognized language constructs you can not equate them to production grammars These languages have two types of lexemes. literals and tokens. Tokens lexemes are recognized by token equations and create token objects. Literal lexemes are recognized by quoted strings and normally do not create token objects. Operators preceeding a string can change its operation.

Char class

edit
Operator operation
,<string> Creates a string object pushing it on parse stack.
+<string> If matched. Creates a string object pushing it on parse stack
-<test> Complements a test result.
~<char> Complements single character test. Advancing over any single character not matching the given <chat>. (A SLIC language extension.)

Reductive

edit

Reductive grammar is defined in the McGraw-Hill Dictionary of Scientific & Technical Terms, 6E, Copyright © 2003 as:

Reductive grammar (computer science) A set of syntactic rules for the analysis of strings to determine whether the strings exist in a language.

The definition is not vary specific. But does prove it exists and is analytical. My position on reductive grammars is simply: They are of historical significance. I did not invent the terminology or technology. I do however think the technology is misunderstood and overlooked so I will atempt explaining the technology.

In the defination "syntactic rules for the analysis of strings" is describing syntactic rules used for syntactic analysis or parsing. So we are not that far apart on reductive grammars. However we must include the full meaning of reductive.

Rp, You say that BNF is the very same thing as Chomsky's context-free grammars. There is no difference, other than (extended) BNF allowing alternation and iteration in right hand sides.

First BNF included alternation from the start. You need to study the origional IAL and ALGOL 60 documents. All of them including the notes. When I referenced BNF I was talking of its original use:

BNF, as described by Peter Naur in the ALGOL 60 report are metalinguistic formula. Sequences of characters enclosed in the brackets <> represent metalinguistic variables whose values are sequences of symbols. The marks "::=" and "|" (the latter with the meaning of "or") are metalingustic connectives. Any mark in a formula, which is not a variable or a connective, denotes itself. Juxtaposition of marks and/or variables in a formula signifies juxtaposition of the sequence denoted. [2]

If you study the IAL and ALGOL papers it is apperant BNF is not a production grammar but a metalanguage for defining and discussing languages.

Another example from the ALGOL 60 report illustrates a major difference of the BNF metalanguage and a Chomsky context free grammar. Metalingustic variables do not require a rule defining their formation. Their formation may simply be described in natural language within the <> brackets. The following ALGOL 60 report section 2.3 Comments Specification, exemplifies how this works:

For the purpose of including text among the symbols of a program the following "comment" conventions hold:

The sequence of basic symbols: is equivalent to
; comment <any sequence not containing ;>; ;
begin comment <any sequence not containing ;>; begin
end <any sequence not containing end or ; or else> end

In rule form the above is somewhat like Chomsky context sensitive rules having multiple symbols on the left. However the rules are not adding text to the string but removing text. The rules strip comments out of the source text. The sequence <any sequence not containing ;> does not name a metalinguistic variable but provides a description of a sequence of characters. The comment rules expressed in table form were added in a revision. Naur's notes explaining metalinguistic variable were already in place as was the meaning of "::=" being "is defined as". Which stands in opposition to Chomsky's "is replaced by" production rule. Using the table form left the origional meaning of the "::=" unchanged.

Rp, you claim that what is reductive is not the BNF rules themselves, but the parsers that are automatically generated from them. This was the innovation brought about in 1962, apparently by META II: the automatic generation of parsers from grammars.

Schorre META languages META II and those that followed were not common parser generators.

I fell into the common thinking of META II and other Schorre META languages as compilers. In reading original META II paper. Trying to figure out how to explain META II I realized it was a language. In programming we merge the compiler and language together. That is we may talk about FORTRAN, COBOL, C++ compilers or languages. In their documents they are more commonly referred to as languages. META II is a programming language as described in it's original document. A programming language for writing compilers. The paper describes the compiler generated code. The parser programming languages of META II and TREE-META(1968) are identical except in code generating constructs. That is on removing code transforms their grammar languages are identical. TREE-META describes it's syntax language as a reductive grammar. (The 1974 TREE-META added backtracking.)

META II is a META programming language. A part of which is a syntax analysis parser language that is described as a reductive grammar. It is in later Schorre META language compilers having the identical parsing or syntax analysis language as META II were you find their parser language described as a reductive grammar. I first learned of the term reductive grammar in a TREE-META paper dated 1968. The TREE-META grammar language is identical to META II, Only code generation aspects have changed. The term reductive grammar is also found in other Schorre based META language/compiler papers.

I am not claiming we need reductive grammars. As I did not make up the term. It has been used in historical programming META language documents. My position is that it needs to be included here because metacompilers using reductive grammar metalanguages are topics here.


Reductive grammars as used in Schorre META language compilers is best described in a 1974 TREE-META paper:

TREE-META's Method of Syntax Analysis

A top-down method of syntax analysis is used in the TREE-META system. The main goal is to match the complete program with the Syntax Rule whose name follows .META at the head of the program. The name of the Syntax Rule appears at the start of the rule followed by the = symbol. The right-hand side of this Syntax Rule indicates which entities have to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves defined in terms of other entities, and so the process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic symbols such as characters and numbers are recognised.TREE-META

The subjects of compiler-compiler, metacompiler and parser generator have been confused and mangled together over the years to the point they have lost all distinctiveness. I learned about these systems when metacompiler refered to a specific type of compiler-compiler and a compiler-compiler was a real compiler compiler producing complete running compilers. That was before yacc's prominence. That was when a metacompiler actually compiled a metalanguage program, taking as input what would today be a metasyntax programming language. And make no mistake they are programming languages designed for the purpose of writing compilers.

Because of our mangled terminology parser generator and compiler-compiler are no longer distinguishable.

What is a Reductive Grammar?

edit

My views are different because the technology is not taught. I do not think it ever was. When yacc came with Unix and Unix became the prevelent operatoring system at universities, academia was on the yacc and lex bandwagon.

I intimately understand the technology because between 1969 -1974 I wrote SLIC(System of Languages for Implementing Compilers) a compiler-compiler, written in itself, based on the technology. Taught others to use it. In 1975 - 76 SLIC was used to write a COBOL compiler and a TI 990 assembler. SLIC is a complete compiler writing system, not a parser generator. SLIC was based on CWIC (Compiler for Writing and Implementing Complers) developed at Systems Developent Corporation. SLIC has basic the same parsing language. The code generating languages are both LISP 2 based. SLIC extended the target machine code generation capabilities. Where CWIC used a crude additive method of building byte code instructions. SLIC used symbolic instructions. And added two additional levels of sequential instruction generation. The code generation output assembly macro like instructing that expanded into numeric machine code. SLIC operates in 5 phases.

The first phase analyzed the source creating symbol tables and transforming the input source code into an abstract parse tree. At some point a call is made to a generator function passing it the tree.

Phase 2 Crawls around the tree emitting pseudo instructions. Like CWIC the SLIC code generators output code into named sections. PSEUDO instructions are appended to a sections instruction list. A flush statement is used to output code to the object file. CWIC simply wrote the sections memory block

But this discussion is about the parser language. I did not just discover it. Or read about it in connection with parser generators. None of the Schorre based META languages are parser generators. They are true compiler compilers. Code generation and syntactic analysis integrated into a single system to the object file. SLIC went through the pseudo instruction list executing each PSEUDO instruction. As PSEUDO instructions executed they called MACHOPs that again expanded to numeric machine code. The MACHOP language allowed target machine instructions to be defined as bit field sequences. Addresses are to bits. With arithmetic manipulation bit addresses are converted to target machine addressable units.

The point is these were true compiler-compilers not parser generators.

My understanding of redective grammars is from how they are used in several Schorre based META languages. I have alway known these as a programming language as that is how I learned them. And that is how they are documented.

To understand what reductive grammars are I looked at how reductive methods are applied in mathmatics, computer science, and philosophy. I found reductive is a type of top down analysis that breaks down a problem into smaller and smaller parts.

A 2007 published paper showing reductive logic is an established methodology of analysis in Math, Logic and Computer Science today:

The process of calculating a proof of a given putative conclusion, for which non-deterministic choices between premises must be resolved, is called proof-search and is an essential enabling technology throughout the computational sciences. This study suggests that the reductive view of logic is as fundamental as the deductive view, and discusses some of the problems that must be addressed in order to provide a semantics of proof-searches of comparable value to the corresponding semantics of proofs. Just as the semantics of proofs is intimately related to the model theory of the underlying logic, so too should be the semantics of reductions and of proof-search. [Published: Oxford Scholarship Online: September 2007 http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780198526339.001.0001/acprof-9780198526339]

In mathematics we have more specific explanations that Arthur Rubin should also understand. How reduction is used in solving math problems: Matrix row and column reduction. Reductions in solving integrals by parts. Differential equation and Laplace transform reduction methods.

In computer science:

The Prolog family of languages, logic programs have a procedural interpretation as goal-reduction procedures.

Reduction is a scientific descriptive term used in mathematics and computer science. Mathematical reductions are basicly breaking down a problem into smaller and/or simpler parts that have known soloutions or are simpler to solve. Similarly in computer science and applied logic reductive relates to top down goal seeking methods of soloutions that break down a problem into simpler parts.

In TREE-META papers we find its parser language described as a reductive grammar. And explained as a top-down goal seaking method of syntax analysis.

From the various uses of reduction we can define a reductive grammar:

Reductive grammar A set of top-down goal orianted formation rules for strings of a formal language. It is a form of phrase structure grammar. Each rule names a linguistic constituent and defines it's possible formations. It is a specific organization of grammer rules. Starting with a top rule defining a language using other rules. Each rule in turn dividing its structure in a like fasion.

Reductive is the method of syntactic analysis used in Schorre META languages.

A reductive grammar defines the syntax that one must use writing in the language. Reductive grammar rules define parsing objectives. We can define a book:

book = table_of_contents chapter chapter* index;

A COBOL program has divisions:

program = IDENTIFICATION_Division
          ENVIRONMENT_Division
          DATA_Division
          PROCEDURE_Division;

A C program is made up of declarations:

 program = declaration*;

Programming a parser in these languages is simple. Many compiler books start by introducing students to recursive decent parsers. The advanced Schorre parser programming languages are used to progran recursive decent parsers on steroids. With their program controlled backtracking, negative and positive look ahead plus symbol attribute testing they are not limited to simple context free languages. They do not simple define valid statements or sentence's of a language. They define the complete valid structure of programs.

edit b

edit

The term derivation does not really apply to these parser programming languages. Transformations of the input source is not inferred by the way grammar rules are written but programmed using explicit directives that are part of the language. Instead of writing several rules that infer transformations a single rule can be written, eliminating the need for backtracking, by factoring out ambiguity,

Rp, You used an example of ambiguity along the way:

when doing top-down, left-to-right parsing, these rules

A := B C | D D := B E

present a problem: you don't know whether to interpret A as B C or as B E as they both start with B. This can be solved by backtracking, but such a parser can take an exponential time to run.

You also clamed alternative operators as being syntax suger:

A = B (C | E);

You ssid nothing about grouping. (why?) I would say that what you call syntax suger is syntax sweetness for it is exactly how reductive grammats like META II eliminate the need for backtracking.

CWIC and SLIC have backtracking designated using the backtrack alternative operator \. They also have backup that only effects the input stream when recognizing a lexeme. The old documents calls both backtracking. I think it confusing.

I have been trying to understand and catagorize their parser capabilities. Examining various parser types. The only thing one can say is these languages are used to program a type of top-down recursive descent parser used for syntactic analysis. They are recursive decent building a abstract parse tree using separate parse, node, and call stacks. Tokens are recognized by token .. rules and pushed on the parse stack. Terminal synbols are recognized using quoted strings. CWIC and SLIC have token rules while earlier languages have built in token functions .id, .num, and .string. The term token in these languages is slightly different then curently used in computer science. Token lexems are recognized by token rules. String lexems recognized by quoted string literals are not tokens. A bit confusing to explain. We have string token rules that create string objects. A string literal recognizes a character sequence that is not normally kept. A + preceeding a string literal will create a string token if a match occurs. A rule such as: expr = term (('+'|'-') term)*; would not keep the + or - lexemes. Unparse rules began with TREE-META. Unparse rules recognize tree patterns invoking action procedures. CWIC and SLIC have named LISP 2 based unparse generator functions. With LISP procedural actions invoked on an unparse rule's recognition of a parse construct. A generator is in a way like an overloaded function. It is a named group of <unparse> => <action> pares. Generators are called from syntax rules. The unparse part is a active tree crawling recognizer. Unparse rules may call other generators and/or assign tree parts to local variables.

I had not giving the linguistic grammar type of their syntax rule's any thought until trying to describe these old systems in today's lingo. In fact having written compilers, assemblers, JCL languages and designed comunacation protocol's using these same gramnars. I can say, at the time, I did not give a crap about the linguistic grammar type. I just did what was necessary to get the job done. When writing an assembler it needed to recognize location tag symbols starting in column 1. I just added a line break matching feature to the parsing language. Implementing a line break test was simple: .LB recognized line seperators. Like literal tests the line seperators were not kept. Used as first test in a token equation it locked the token recognition to column one. I have forgotten the operator used for line break. I thought it to be .BREAK. But .BREAK comes from CWIC and is used to break out of a loop.)

Anyway Rp. As you and Ruban seam truly conserned with grammars and parsers. I would appreciate your help. I think that first you need a complete understanding of the operation of these compiler compiler writing or implementing systems.

edit c

edit

Having already tried explaining them as a grammar type and failed. I will show how they are used in writing a parser.

The CWIC and SLIC compiler writing/implementing systems are true compiler compilers that can broduce executable code. The are programming languages and can be used for other purposess


linkable object files have almost the same parser language. CWIC developed on a IBM 360 computer system was more punch card orianted. SLIC was developed on a DEC-System-10 taking ASCII input and included an interactive debuger. CWIC had a SYNTAX parser language and a GENERATOR lisp 2 based code generator laguage. SLIC also had SYNTAX and GENERATOR languages but added MACHOP (machine instruction archartecture), PSEUDO (pseudo machine instruction), and ISO (In Sequance Optimazation) languages. A linker/load formater was also part of the SLIC system. The generator language of CWIC generated binary byte(8 bit) code. SLIC's generator produced PSEUDO code instrucions. Pseudo enstructions expanded into machine instructions using MACHOP instructions defined in the MACHOP language. PSEUDO instructions were in a way like an assembly macro instruction. Procedures written in a subset of the generator language. CWIC and SLIC generator languages planted code into named sections. Plant is a term originally used be CWIC. Instructions are planted in a sequence one after the other. Like planting a row in a garden. In CWIC planting was into a byte addressable memory block. Eight bit byte numeric values were calculated and planted into a named section. A .FLUSH section name statement wrote the sections memory block to the output file. SLIC also planted instructions into named sections. Only instructions were pseudo calls and planting a pseudo instruction appended it and it's arguments to a list. On flushing the pseudo instruction list is executed calling the PSEUDO instruction procedures in list order deleting it entry. The PSEUDOs expanded into machine code calling MACHOPs to output relocatable binary code to the object file. MACHOP are defined to produce bit field sequances. The MACHOP view of memory is a sequence of addressable bits. The memory model is bit addressable. Addresses are bit addresses. To address an 8 bit byte the memory address the SLIC address is divided by 8. Like a 16 bit word address target machine would divide the bit address by 16. The DEC-10 a 36 bit word machine required the bit address to be divided by 36. In defining MACHOPs a .MORG is used to aline output to the target machine addressable boundry.

edit d

edit

I think at one time the Schorre based matacompilers may have been properly called parser generators. But today's parser generators do not really describe them. A more descriptive name for the technology is Parser Programming Language. Or a Metasyntax Programming Language. The emphasis being programming language.

Importance

edit

Production grammar's have on the left side a pattern (or regular expression) string having at least one non-terminal symbol. There may be multiple rules having the same pattern string. These rules are randomly selected and applied to a string initially containing a non-terminal symbol. However when we have equations of the form:

name is-defined-as pattern-structure

.I.E.

expr = term $(('+'/'-') term);

And whose semantics are:

The pattern-structure is matched against strings to determine whether the strings exist in the language.

The pattern semantics has operators.

/ alternative
\ backtrack alternative
$ zero or more
: push node
! make tree
+[ ]+ make list
( ) group

Reductive grammar parser programming languages have different semantics. Calling them production rules is like claiming that a programming language having flat mathematical operator hierarchy is the same as one having algebraic hierarchy because they appear visually to be the same.

Is algebraic notation the same as Reverse Polish Notation because they have the same expressive power. That seams to be the argument here. That productive, analitical and reductive grammars are no different because the have the same expressive powet. I think, just as in programming languages, grammars having different semantics, justify them as being different. Such grammars do not fit into linguistic theory but do fit into computer science parser technology. Reductive grammars are really more aligned with mathematics and logic then lingustics.

The subject's of my concern are metasyntax programming languages that are used to program a parser. They also serve to define the language they parse. So they are both a grammar and a programming language. Though one might use equivalent production rules to define the grammar those rules would no longer be the same language. The languages I am interested in classifying I believe could be described as top down analytical phrase structure grammars. Grammar terms like constituent apply to the parser languages equations. Each equations or rule uniquely names a language constituent and defines it's structure. It specifies the parsing of the constituent. They are goal oriented languages made up of equations. Each equation defining a goal. Goals are tests that recognize some part of a language and transform it into an internal structure or output. Rules or equations are boolean functions returning success or failure. Each equation or rule is a uniquely named goal.

As an analytical phrase structure grammar each equation or rule uniquely names a language constituent and defines it's possible formations. The advanced parser programming languages have character class, token, and syntax equations. Character class defined by a syntax rule as.

class = id ':' (char|id) ('|' (char|id)* ';';

Character class rules may include previous defined class by reference of their id's and use character constants. A char token rule is defined as:

char .. ''' .ANY ''';

Note is a char defined by the char rule and used in it's defining equation. Numeral classes are used in defining numeric tokens:

bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dec: oct|'8'|'9';
hex: dec|'a'|'b'|'c'|'d'|'e'|'f'|
         'A'|'B'|'C'|'D'|'E'|'F':

An integer token defined using classes above.

integer .. ("0H"|"0h") hex hex* MAKHEX[]|
           ("O"|"o") oct oct* MAKOCT[]| 
           ("B"|"b") bin bin* MAKBIN[]|
                     dec dec* MAKNUM[];

The MAKHEX[],MAKOCT[],MAKBIN[] and MAKNUM[] are supplied library function that create a numeric object. Token rules parse character sequences gethering them into a token string. The default is to enter the string into the dictionary creating a symbol object that is pushed on the parse stack. Intercede function MAKHEX etc used above operate on the token buffer creating numeric objects that are pushed on the parse stack. Character class rules define character groups that may be recognized using the class id in a token or syntax equation. In a token '.." equation in tecognizing a class character copies it to the token. In a syntax equation "=" or "==" a character class is treated as a literal string. Literal strings are normally not kept. Token ".." equations recognize character sequences and create token objects. Symbol tokens (the default) are entered or cataloged into the dictionary. Symbols may be assigned attribute values. Other uncataloged objects are created by intercede functions that create numeric and string objects.

I agree that the rules may be used as documentation on writing programs in the language specified by the rules. But the rules are a program that is compiled into a syntax analyzer.

The parser language has the ability to recognize invalid language sequanes, report and recover skiping unrecognized character sequances. An error reporting, and recovery example programed in the parser language.

prog = $(decl \ (".END" .BREAK / ERRORX["syntax error"] $(-';' .ANY) ';')

The above looks for a sequence of decl followed by .END. The \ backtrack alternative is used to reset the parse to start of the most recent decl recognition attempt when decl fails. A recognition of the string ".END" will execute a .BREAK successful loop termination. Otherwise ERRORX is called. ERRORX reports an error specifying the point of failure. The parse expression $(-';' .ANY) skips any character not a ;. The - negates a test returning success if unsuccessful. .ANY matches any character. The $ sequence operator repeats the following test until it fails. The parse expression sequence skips the decl in error, ignoring characters to the decl terminal ; character were the compile attempts to continue.

Success or Failure (semantics)

edit

The metasyntax programming languages are compiled into executable code.

The terms success and failure refer to the status of a test. A test may be considered a boolean type having a value of success or failure. But in most cases success and failure being the state of a bit in the processor's status register. Most commonly the zero status flag bit.

A rule is compiled into an executable test function returning Success or Failure. SLIC optionally shows generated code in the program listing having generated symbiles of the form: % <letter> <number> %g1, %l3 etc. A backtrack failure resets the call stack to the most recent backtrack before returning failure.

After a test in generated cade, 2a machine branch on condition is used to test the returned status (success or failure)

     call   a_test
     je     a_success_loc
     jne    a_failure_loc

Assuming a_test returns to the instruction after the call instruction. The je instruction can be used to transfer control to a a_success_loc or a jne to a_failure_loc. Note. In the above branch instructions illiterate the condition testing. Not actual code. Normally only one or the other would be generated. What ever follows would be the opsite condition.

expr = term $(('+':ADD|'-':SUB) term !2);
''Example code is IA86 based for example only. Resembles SLIC assembly list option''

Would generate the following IA86 code sequence.

%s1:    ASCIIZ   "+"
%S2:    ASCIIZ   "-"
_ADD_N: DB   "ADD",0
_SUB_N: DB   "SUB",0

expr:   call term      // term
        je   %g1       // success?
        ret            // return failure.
%g1:    Lea  si,%S1    // "+"
        call _cmp_str  // string test
        jne  %g2       // failure?
        lea  si,_ADD_N // Success: ADD node
        jeq  %g3
%g2:    Lea  si,%S2    // Failure look for "-"
        call _cmp_str  // string test
        je  %g4        // ?
        cmp  al,al     // make EQ status 
%g5:    ret            // return success
%g4:    lea  si,_SUB_N // Succrss: SUB node
%g3:    CALL  _node    // create and push node obj
        call term      // term test
        je    %g5      // backtrack failure
        jmp  _backtrk_ // backtrack fail
%g5:    mov  ecx,2     // tree branches
        call _tree     // create and push tree
        jmp  %g1       // $ loop

Functions _cmp_str, _node, _tree and _backtrk_ are supplied library functions. The point here is that the translation to code is simple and stright forward.

_cmp_str skips white space (_skip_class characters) preceeding the first character match. Compiling to machine code, compilers broduced by these compiler compilers are generally faster then their table driven counter parts.

SLIC produced relocatable machine code link files. Polish fixup link blocks allowed complex forward references to be handled by single pass compikers. The SLIC target linker required a backend target load file formater to be written. A target linker was then built linking the formater with the linker library.

Machine dependent support code for SLIC was written in macro10 or bliss.

What is a Reductive Grammar?

edit

I have alway known these as a programming language as that is how I learned them. And that is how they are dicumented.

In programming we merge the compiler and language together. That is we may talk about FORTRAN, COBOL, C++ as compilers or languages. In their documents they are more commonly referred to as languages. META II is a programming language as described in it's original ACM papers. A programming language for writing compilers. The paper describes the compiler generated code. We know it's the META II compiler but it does not explicitly name it as META II compiler. The parser programming languages of META II and TREE-META are basically identical except in code generating constructs. That is on removing code transforms their grammar languages are identical. TREE-META describes it's syntax language as a reductive grammar.

To derive a consistent definition of reductive grammar I looked at how reductive methods are applied. Research of reductive as used in mathmatics, computer science, and philosophy shows reductive is a type of top down analysis that breaks down a problem into smaller and smaller parts.

A 2007 published paper showing reductive logic is an established methodology of analysis in Math, Logic and Computer Science today:

The process of calculating a proof of a given putative conclusion, for which non-deterministic choices between premises must be resolved, is called proof-search and is an essential enabling technology throughout the computational sciences. This study suggests that the reductive view of logic is as fundamental as the deductive view, and discusses some of the problems that must be addressed in order to provide a semantics of proof-searches of comparable value to the corresponding semantics of proofs. Just as the semantics of proofs is intimately related to the model theory of the underlying logic, so too should be the semantics of reductions and of proof-search. [Published: Oxford Scholarship Online: September 2007 http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780198526339.001.0001/acprof-9780198526339]

The above establishes that reductive is a term currently used in computer science. In mathematics we have more specific explanations that Arthur Rubin should also understand. How reduction is used in solving math problems: Matrix row and column reduction. Reductions in solving integrals by parts. Differential equation and Laplace transform reduction methods. And in computer science:

The Prolog family of languages, logic programs have a procedural interpretation as goal-reduction procedures.

Reduction is a scientific descriptive term used in mathematics and computer science. In understanding it's general meaning we can derive a definatiom of reductive grammar.

Mathematical reductions are basicly breaking down a problem into smaller and/or simpler parts that have known soloutions or are simpler to solve. Similarly in computer science and applied logic reductive relates to top down goal seeking methods of soloutions that break down a problem into simpler parts.

In a TREE-META paper we find its parser language described as top-down goal seaking method of syntax analysis:

A top-down method of syntax analysis is used in the TREE-META system. The main goal is to match the complete program with the Syntax Rule whose name follows .META at the head of the program. The name of the Syntax Rule appears at the start of the rule followed by the = symbol. The right-hand side of this Syntax Rule indicates which entities have to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves defined in terms of other entities, and so the process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic symbols such as characters and numbers are recognised. TREE-META 1974

The above applies to all Schorre based metacompilers. META II, META 3, META4 META5, TREE-META, META6, CWIC, SLIC and others.

From the various uses of reduction we can define a reductive grammar:

Reductive grammar A set of top-down goal orianted formation rules for strings of a language. It is a form of phrase structure grammar. Each rule names a linguistic constituent and defines it's possible formations.
Reductive grammar (computer science) A set of syntactic rules for the analysis of strings to determine whether the strings exist in a language. McGraw-Hill Dictionary of Scientific & Technical Terms, 6E, Copyright © 2003 by The McGraw-Hill Company.

That reference came from an online source that most likely only show the initial paragraph. The definition is not wrong. I simply did not perceive it's preciseness at the time. It defines a reductive grammar as one used for syntactic analysis. A grammar used by a parser. So it is describing a grammar that a parser generator might take as input. Reductive is the method of syntactic analysis. However the documents in which I found "reductive grammar" appearing precedes the advent of todays parser generators. The fact it is described in the McGraw-Hill Dictionary of Scientific & Technical Terms as computer science related proves that Reductive Grammar is a valid computer science term despite the opions of you(Rp) or Arthur Rubin.

A reductive grammar defines the syntax that one must use writing in the language. Reductive grammars are objective rules. We can define a book:

book = table_of_contents chapter chapter* index;

A COBOL program has divisions:

program = IDENTIFICATION_Division
          ENVIRONMENT_Division
          DATA_Division
          PROCEDURE_Division;

A C program is made up of declarations:

 program = declaration*;

Programming a parser in these languages is simple. Many compiler books start by introducing students to recursive decent parsers. The advanced Schorre parser programming languages are used to progran recursive decent parsers on steroids. With their program controlled backtracking, negative and positive look ahead plus symbol attribute testing the are not limited to simple context languages. The do not simple define valid statements or sentence's of a language. They define the complete valid structure of programs.

NPV?

edit

My interest is in the grammar type of these metasyntax languages as they are used in programming parsers. They define the language they parse so thay must be a grammar as used.

In various places, you find it explained that a formal language is defined by a formal grammar. The Parsing or syntactic analysis topic is an example:

Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, conforming to the rules of a formal grammar. The term parsing comes from Latin pars meaning part.

READ ARTICAL

The fact is: Analytic, like PEGs and reductive grammars used in Schorre parser programming languages do just fine defining a programming language.

It can be confusing when Formal Grammar starts off explaining itself as productive. To get the full implication of this use the android app and click the link to Formal Grammar:

Formal Grammar:

In formal language theory, a grammar is a set of production rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax.
READ ARTICAL

From the app you only see the above paragraph displayed in a sub window. Assuming an analytical or reductive grammar can be a Formal Grammar. Would not a NPV, direction indeterminate, description intro be more appropriate. i.e.:

A formal grammar is a set of formation rules for strings in a formal language. The rules describe valid string formations that may be formed from language's alphabet

or:

A formal grammar is a set of rules defining valid string formations of a formal language.


Sense formation can be the action or process of something being formed. Or a structure or arrangement of something. It leaves open the specifics of a Formal Grammer being analytical, reductive, or productive.


Subjects having multiple usages should start with a general concept and then branch out. I am not sure how this should be done. Reductive grammars are of historical significance.


PEGS are claimed to be analytical but are described using productive terminology. That works for PEGs. But still there is no concept of ordered choice in production grammars. That is one distinctive dfferance setting some analytical grammars apart from production grammars. The Schorre metalanguages also use ordered choice. The Formal Grammer topic makes it out to be that only a production grammar is a formal grammar. And yes there is an analytical grammar section. But not starting out with with a NPV leads one to believe that productive grammars are the only valid formal grammar. And then article like the parser article seams to be wrong.

I hope to explain the advanced parser programming languages of CWIC and SLIC that really can not be expressed in a productive grammar.

History

edit

The secific languages I am talking about originated with Dewey Val Shorre's META II compiler compiler language in 1964. Or META I a hand coded executable that META II was bootstrapped from. There is a series of metacompiles that sprang from META II. The Schorre META languages are all metasyntax programming languages that are compiled into executable code. The early META languages were compiled into byte code for a virtual stack machine. CWIC compiled to IBM 360 machine code. SLIC compiled to DEC-10 machine code. That is compilers produced using SLIC ran on a DEC-System-10. SLIC has the ability to generate code for a large number of computer instruction sets. Scannerless parsing is employed in all Schorre based compiler writing systems.

All these languages have transforming operations included in the parsing language. META II used ".OUT" to write assembly code to a file. * addressed most recently parsed token entity. .ID, .NUM, and .STRING recognized tokens conforming to ALGOL 60 rules. The example below shows how an assignment statement is parsed and translated to executable assembly code text.

ASIGN = .ID '=' EXPR ';' .OUT('STORE' *);
EXPR = TERM $('+' TERM .OUT('ADD') / '-' TERM .OUT('SUB'));
TERM = FACTOR $('*' FACTOR .OUT('MPY') / '/' FACTOR .OUT('DIV'));
FACTOR = .NUM .OUT('LDL '  *)/.ID .OUT('LD ' *)/'(' EXPR ')';

META II was simple, generating code for a stack machine. Using * to address the most recent recognized token of the cure rule. .OUT output a line of code. Output starting in the opcode field. .LBL output labels on separate lines starting in column 1.

In 1968 TREE-META made a notable change adding code generation unparse rules and changing the syntax equations to transform input source code into abstract syntax tree structures. The :<node name> operator creates a node and a braced number, [<number of branches>], creates the tree:

ASIGN = .ID = EXPR:ASGN[2];
EXPR=TERM $('+'TERM:ADD[2]/'-'TERM:SUB[2]);
TERM=FACTOR $('*'FACTOR:MPY[2]/'/' FACTOR:DIV[2]));
FACTOR = .NUM/.ID/'(' EXPR ')';

 :ASGN[2] creating a tree of the form:
    ASGN
   /    \
 ID      EXPR

The EXPR rule in turn creating ADD[2] or SUB[2] trees:

     ADD            SUB
    /   \          /   \
TERM     TERM  TERM     TERM

Like wise TERM creates MPY[2] and DIV[2] trees.
       MPY                DIV
      /   \              /   \
FACTOR     FACTOR  FACTOR     FACTOR

At about the same time with Schorre on the developer team CWIC did the same. Only with a far more advanced code generator language based on LISP 2. The CWIC syntax language gained token and character class rules, positive and negative look ahead and programe controled backtracking. !<number> is used to creates trees equilivant to the [<number>] operation in TREE-META. In CWIC there are two operational stacks and a call stack. The parse stack contains recognized entities and lists created by the by the !<number> tree foming operator or +[...]+ list forming construct. The node stack contains nodes pushed by the :<node> operation. The !<number> operation removes the top <number> of parse stack entries combining them with the top node stack entry into a list and pushes the list onto the parse stack.

ASIGN = ID = EXPR:ASGN!2;
EXPR=TERM $(('+':ADD/'-':SUB)TERM!2);
TERM=FACTOR $(('*':MPY/'/':DIV)FACTOR!2);
FACTOR = NUM/ID/'(' EXPR ')';
DGT: '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9';
UPR: 'A'/'B'/'C'/'D'/'E'/'F'/'G'/'H'/'I'/
     'J'/'K'/'L'/'M'/'N'/'O'/'P'/'Q'/'R'/
     'S'/'T'/'U'/'V'/'W'/'X'/'Y'/'Z';
LWR: 'a'/'b'/'c'/'d'/'e'/'f'/'g'/'h'/'i'/
     'j'/'k'/'l'/'m'/'n'/'o'/'p'/'q'/'r'/
     's'/'t'/'u'/'v'/'w'/'x'/'y'/'z';
LET: UPR/LWR;
ALPHANUM:  LET/DGT;
NUM .. DGT $DGT MAKENUM[];
ID  .. LET $(ALPHANUM/+'_');

The string equation in CWIC is a prime example of CWIC's analytical grammer recognition power. CWIC used the ' mark to delineate strings. Two ' marks within a string are used to represent a single ' mark. The string token equation uses strings, conforming to itself:

string .. '''' $(-'''' .ANY/ '''''','''') '''' MAKESTR[];

The above string parsing equation recognizes a string token. Characters recognized by ,ANY are copied to the token. Quoted strings in token equations are normally only recognized not copied to the token string. Operators preceeding a string modify it's function. A , preceeding a string is not recognizing the string it inserts the string into the token. A - is a negative look ahead.

-''''

tests for a single quote character returning failure if matched. Successful when the string is not matched. ? is a look ahead eveluating to the test result but not advancing over any characters regardless of the test result.

'''''',''''

tests for two ' characters together and if successful inserts a single ' into the token.

'don''t'

becomes the string don't. Two single quotes within a string are inturpeted as one single quote. A string literal preceeded by a +, if matched is copied to the token. In syntax rules the , and + operators create entities and push them on the parse stack.

A character class rule designated by the : rule operator:

dgt: '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9';

deines a list (group) of character literals and binds it to a name. The name may then be used to match any single character in the list. In token equations character class matches become part of the token.

SLIC is based on CWIC, extending it, adding features enabling the description of machine instruction set architectures and separating target machine code generation into a separate phases. The parsing language changes were mostly syntax suger. It had single character string literals:

char .. '''' .ANY '''' MAKESTR[];
string .. ('''' .ANY '''' | '"' $(-'"' .ANY/ """""",'"') '"') MAKESTR[];

A char string was then used in character class equations. MAKESTR[] is a function call. MAKESTR, MAKENUM, MAKEHEX, MAKEOCT, and MAKEBIN are supplied functions the convert the token buffer character sequence into a numeric atom.

Development stoped on these compiler writing systems in the early to mid 1970s or was not publicly available.

I first learned about the technology at an ACM SegPlan meeting in 1969. Erwin Book gave a talk on the CWIC compiler writing system. There is a paper on CWIC available from the ACM digital library dated 1970. Erwin brought a few programming manuals to the meeting and I got a set. At that time I was employed at Cerritos Collage, where I later wrote SLIC.

Schorre meta compilers are clasical examples of syntax-directed translation. A compiler's source code language specification and syntactic analysis is programmed using boolean-valued parsing functions that appear to be phrase structure grammar rules. Each boolean-valued parsing function specifying a linguistics constituent test against the input source character stream. With their boolean algebraic base ambiguous language constructs are combined and common ambiguous sub-expressions factored out eliminating the need of backtracking in most cases. Backtracking was in the original design of CWIC. Early Schorre metalanguages not having backtracking relied completely on factoring to eliminate ambiguous constituents. All have backup that is used in their lexical analysis functions. Backup only saves the input state were backtracking saves the parse state. Backtracking is also limited to parsing. Generating code within an active backtrack causes a compiler error. Note generators may be called to do tests.

Relation to Parsing Expression Grammar and Boolean Algebra

edit

Rp, I see you have been active over on the Talk:Parsing expression grammars. The languages I am trying to explain are quite simular to PEGs. After reading Ford's paper on Parsing expression grammars I have to say PEGs are not a new concept. The resemblance to Parsing Expressing Grammars is uncanny. The main difference are token recognition syntax and backtracking details. Other differences are minor. Using different operators to express identical operation. PEGs are claimed to be related to the family of Top-down parsing languages introduced in the 1970s. As a description Top Down Parsing Language is descriptive of the technology I am talking about. In fact I would say they are Parsing Expression grammars after reading Fords paper. It could be describing the parser programming language in the Schorre line. However PEGs only cover parsering. All Schorre parser languages include semantic transformations directly producing assembly code or abstract syntax trees. Allmost everything that distinguish PEGs as top down analytical grammars suited for parsing programming languages applies to these languages. Except the Schorre metasyntax languages were developed before 1970, are compiled into executable code and included semantics transforming features. Compiling to executable code is an important defining property.

The Schorre metasyntax language parser equations have the same operational interpretation as a parsing expression grammar rules. Each parsing equation conceptually takes input from the input character stream (or string) as its argument, and yields one of the following results:

  • success, in which the operation moves forward consuming one or more characters of the input. Or in the case of look ahead the parse state is left unaffected.
  • failure, in which case no input is consumed.

Failure is a bit complicated. There are two forms of failure. On the failure of a string or token test the input stream is reset to the initial entry state and a normal return of failure status occurs from the failing equation to the calling expression. That will in turn cause a failure. In a sequence of tests the failure of the first test results in a non-backtracking failure. If other then the first fails the result is a long-failure. CWIC called this long failure backtracking. A \ backtrack alternative creates a backtrack save state stack frame before it's left alternative is atempted. A backtrack failure resets the call stack pointer to the top backtrack stack frame restores the parse state and returns failure. This may be analogous to a pack rat parser. I can not find a detailed description of a pack rat parser. But I do not see different alternative operators being used in PEGs. And what I have found on packrats describes a bottom up parsing algolrithm. A backtrack alternative is used when ambiguities can not be factored out. Look ahead of other then a string also uses backtracking.

X = (A | B | C) \ D
A = a E
B = b F G
C = d

long_failure in which the parser state is reset to the top backtrack saved state and control transfered to the saved state backtrack alternative. If no backtrack exists control transfers to a compiler exception handler that reports the unhandled backtrack failure, terminating the compile with prejudice.

Each parsing equation is a parsing function of a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function.

A parse expression is a test or a sequance of tests that recognizes a sequence of input characters. A test returns or evaluates to a boolean status of success(true) or failure(false). The parsing expressions rules are non-commutative boolean algebraic expressions were tests are ordered. The test:

a b c

are an implied non-commutative and sequence a   b   c. We're a b and c are tests applied in the order written.

The parsing equation languages and parsing expression grammars have in common being top down analytical parsing languages. They are written in the form of a non-commutative boolean algebraic formula. They take the first choice alternative matched. Not the longest. They have look ahead and backtracking. Though backtracking is different. CWIC and SLIC use backup and long failure backtracking technology. A backtrack alternative creates a call stack backtrack sack frame before starting it's left alternative. A backtracking long failure occurs on a successive failure in a sequence of parse expressions once a successful literal or token match has occured. The long failure occurs regardless of any backtrack stack frame having been set by a backtrack alternative. A long fail will be caught by the system and reported as an internal compiler error if no backtrack alternatives have been programed to catch it. Some parser language constructs create backtrack alternative that simply return failure after backtracking.

There are equalivent parsing equation features to the stated parsing features of the PEGs:

Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following operators:

  1. Sequence: e1 e2
  2. Ordered choice: e1 / e2
  3. Zero-or-more: e*
  4. One-or-more: e+
  5. Optional: e?
  6. And-predicate: &e
  7. Not-predicate: !e

All the above PEG features have equivalent forms in CWIC:

  1. Sequence: e1 e2
  2. Ordered choice non-backtracking: e1 / e2
  3. Zero-or-more: $e
  4. One-or-more: e $e
  5. Optional: (e \ .EMPTY)
  6. And-predicate (look ahead) ?e
  7. Not-predicate(negative look ahead): -e
  8. Any character not matchimg: ~ch
  9. grouping ( ... )
  10. Zero-or-more termination: $(e1 .BREAK / e2)
  11. Forced failure .FAIL
  12. Ordered choice backtracking: e1 \ e2
  13. Optional (long failure on partial match) (e / .EMPTY)
  14. Recognize any character .ANY
  15. make list grouping +[ e $(',' e) ]+
  16. push node: :node
  17. form tree: !number
  18. call functions in other languages: g(args)
  • long failure may occur regardless of the alternative operator used.

CWIC and SLIC

edit

My passion comes from writing SLIC (System of Langauges for Implementing Compilers) that was based on CWIC (Compiler for Writing and Implemening Compilers). CWIC, developed at Systems Development Corporation, was a vary advanced compiler writing system using a Schorre based metasyntax parser programming language and a LISP 2 based code generating language.

SLIC was written while I was at Cerritos Collage. Compiler technology was not taught at the time. I taught SLIC to a few students and they were writing simple interpreter within a few hours. And compilers witin a week. SLIC produced code that could be debugged with a symbolic debuger. Syntax equations could have trace points set that displayed the parsed string and tree produced. Break point and trace points set using the SLIC debugger helped student understanding of the parsing process. SLIC was implemented on TOPS-10 so an interactivity debugging was the norm.

These old system used the term syntax language. I am choosing to use parser programming language or just parser language. The parser language is used to program top down recursive decent parsers. They have programming features that get around the draw backs of simplistic recursive decent parsers. They use, long fail backtracking and backup, technology.

CWIC and SLIC are advanced syntax driven metacompilers whose parser programming, ".SYNTAX", language is designed to simplify the following aspects of compiler construction:

  1. Parsing and lexical analysis.
  2. Detection and pictorial reporting of errors. (Could easily be changed to use stderr.)
  3. The recovery from errors to continue compilation. (Error recovery is programed in the parser language)
  4. Hashing and construction of dictionary or symbol table entries.
  5. The construction of an intermediate Abstract Parse Tree.

The code generation, or sequencing, languages are powerful list processing, programming, languages having features designed to simplify:

  1. Recognizing and optimization of special case code sequences.
  2. Global optimization over the directed graph of a program.
  3. Evaluation of constant expressions at compile time.
  4. Elimination of common sub expressions.
  5. Mixed mode arithmetic.
  6. Detection of duplicate and undefined labels.
  7. The generation of data and instruction code sequences.
  8. Features for writing and reading tree and dictionary structures facilitating the creation of multi-pass compilers
  9. Stacked dictionaries that facilitate compilation of nested block structure languages.
  10. Attaching attributes to symbol facilitating the association of symbolic or numeric data with simbols in the dictionary.
  11. Interrogation and minipulation of symbol attributes.


In CWIC and SLIC a tree is implemented internally as a list structure. Tree roots and branches are a list whose first element is a node name. A tree then can be represented in three textual formats:

Tree Diagram Form:                          ADD             
         NODE                              /   \
        /    \                            a     MPY
    left      right                            /   \
    branch    brabch                          c     d

List Display Form:
   [NODE,left branch,right branch]      [ADD,a,[MPY,c,d]]

Tree Display Form:
   NODE[left branch,right branch]       ADD[a,MPY[c,d]]

These are stack oriented languages having program accessible parse and node stacks The parse stack is were the parse tree is constructed. The node stack provides temporary storage of parse tree nodes used in constructing the parse tree.

A tree of any number of branches can be created. A list can be constructed using list building operators +[ ]+ All parse stack entries between the +[ and ]+ make list operators will be poped and placed in a list that is pushed on the parse stack If, in using stack operators, the stack is poped below the opening +[ make list operators initial stack base an internal compiler error is reported and compile aborted.

The following example heavily commented parser equitions express a subset of the SLIC languages written in the SLIC parser language:

edit

edit
parse = id // on id's success a symbol object is created. The next
           // symbol deturmins the test function type:
   ((  "==":BKTRK p_xp // Backtrack syntax parsing equation.
    |  '=' :RULE  p_xp // Syntax parsing equation.
    |  '..':TOKEN t_xp // Token object making equation.
    |  ':' :CLASS c_XP // Character class equation.
    ) !2    // The :node, id and ?_xp are combined to form a tree:
            // type    :  ==, =, .., :
            //  == BKTRK[id,p_xp] different trees are created
            //  =  RULE[id,p_xp]  depending on the equation type.
            //  .. TOKEN[id,t_xp] All starting with id. Factoring 
            //  :  CLASS[id,c_XP] eliminated backtracking id.
      syngen(*1)       //  call syngen with *1 = top parse stack entry. 

/* (Note. The parse equation continues after this comment.)
On recognizing an id there are 4 alternative equation types deturmined by the symbol following the id. A node is created for each rule type and right side constituent parser is called. p_xp for syntax equations. t_xp for token/lexing, equations and c_xp for character class definations. On success of parsing the right constituent expression a tree is created by the !2 tree builder operator combining the top node stack entry, created and pushed earlier on recognizing the equation type, and poping top 2 parse stack entries into the list. The list
  [node,id,?_xp]
is then pushed on the parse stack and passed to the syngen generator function by using *1 (the number 1 being the relative parse stack entry from the top) The passed parse stack entries are removed from the parse stack.

A generator is a named group of unparse => action pairs. Expression trees passed to a generator are tested and decomposed into local variables by unparse rules. Unparse rules unparse tree or list structures creating arguments of tree branches assigning them to local variables of the action code. An unparse rule (ADD[x,y]) recognizes an ADD tree assign it's left branch to x an right branch to y.
*/     // The parse equation continues testing for a generator function.
       // id having been matched and no equation type recognized a test
       // for a generator unparse action list follows.

    | +[ $( '(' +[+[(unparse $(',' unparse)/.EMPTY]+ ')' '=>'
               action]+ )  // [[unparse*], action]*
      ]+ :GENR!2                 //GENR[id, [[unparse*], action]*]
          seqgen(*1)
   );

Parser equation and code generator functions start with an id (a unique identifier token symbol). Following the id is a symbol identifying the equation type. A backtrack equation, limiting backtracking to the equation, is signified by a ==. An euation that a backtrack failure can ignore uses a single =. When a backtrack failure occurs it returns to the most recent backtrack alternative or equation. Lexical token making equations are identified by a .. symbol. Character class difanitions use a : symbol. Grouping allows the parsing of an id and then deturmanation of the parsing equation type. A code generator function also starting with an id names a list of unparse action rules.

Alternatives and Backtracking

edit

In CWIC and SLIC there two alternative types that allow backtracking control flow to be programed. Backtrack == equations and backtrack \ alternatives creates a backtrack stack frame on the call stack. The == acts is if having a backtrack alternative of .FAIL. A backtrack syntax equation always returns to its caller. A backtrack from a normal = syntax equation will return failure to what ever previous rule has a programed backtrack alternative. A initial backtrack stack frame is created before the top driving parsing equation function is called. If a backtrack failure occurs the initial backtrack stack frame may be returned to if no backtrack alternative is in effect. It is a compiler failure and termination results. As parsing proceeds the parse state is kept in the top backtrack frame. Backtrack frames are back linked. On success of a backtrack alternative the top backtrack frame's parse state is transfered to the previous backtrack stack frame and released. On a backtrack failure the backtrack parse state is reset created objects released and failure returned to the setting backtrack alternative. On starting the last backtrack right alternative the backtrack stack frame is released. The dictionary is kept in the backtrack frames. Parse and node stack states are saved in backtrack frames. Created objects are tracked in the top backtrack frame. A baktrack stack frame pointer points to the top backtrack frame. On a backtrack or long failure the call stack pointer is reset using the baktrack stack frame pointer. The parse state is reset and a return to the backtrack alternative setting alternative is made. The example code generated by a series of backtrack alternatives.

           X = A B \ C D \ E F;

X:   push  (%g1)      // push alternative success
     call  _Backtrk   // set backtrack
     jne   %g2        // long failure test.
     call  A          // call A rule
     jne   %g2        // might return failure or long fail
     call  B          // call B rule
     ret              // ret success | failure
                      // to _backtrk
// second alternative is also a backtrack alternative.
%g2: call  _Bktkalt   // reset to saved backtrack state
     jne   %g3        // backtrack fail test
     call  C          // call rule C
     jne   %g3        // failure test
     call  D          // success call D
     ret              // return status to _Backtrack
                   // On success backtrack cleared an transfer to %g1
%g3: call  _ClrBktk   // Non-backtracking alternstive
     call  E          // Call E rule
     jne   %g1        // on failure goto #g1
     call  F          // call rule F
     je    %g1
     jmp   _fail-
%g1: ret              // return status to caller.

The code may seam strange. The backtrack support routines manipulate the call stack and use the stack pointer in determining what to do. Backtracking support code is not structured programe code. But on the other hand it simplifies code generation.

Below defining the p_xp parser function involves restricting how non-backtracking and backtracking alternatives can be used. A sequence of alternatives may contain only one type alternatives. Backtracking and non-backtracking alternatives may not be mixed. i.e.

a|b|c           is valid.
a\b\c           is valid..
a|(b\c)         is valid
(a|b)\c         is valid.
a|b\c           is not valid
a\b|c           is not valid

Grouped (a|b) and (a\b) constituting a single term in an alternative sequance.

edit

edit

In defining the p_xp parser, the above alternative combinations are inforced by the parsing rules. Once an alternative is used any additional alternatives in the sequance must be of the same type.

p_xp = s_xp
        ( '|' a_xp:ALT!2
        | '\' b_xp:BKT!2
        | .EMPTY
        );

s_xp = '(' p_xp ')'
     | +[opr $opr]+;

a_xp = s_xp
        ( '|' a_xp:ALT!2
        | .EMPTY
        );

b_xp = s_xp
        ( '\' b_xp:BKT!2
        | .EMPTY
        );

opr = test | action;

test = string
     | '+' string:keep!1
     | ('-':not | '?':TRY)
          ( string 
          | id 
          | '(' p_xp ')'
           )!1
     | genr;

action = '!' int:TREE!1
       | ':' id:node!1
       | ',' string:insert!1;

genr = id 
       ( {?def:(*1):=gen}
         +[ '(' (sgarg 
             $(',' sgarg))|.EMPTY
             ')'
         ]+ :GCALL!2
       | .EMPTY);

The braced {?def:(*1):=gen} expression is an inbeaded generator action. It tests or sets the def attribute to gen. It returns success if the def attribute is gen or undefined. And if undefined sets it to gen. The inbeaded gen action test is a SLIC extension. CWIC would have called a generator function to do the attribute test and set. The changes I made to the CWIC parser language implementation would be considered syntax suger.

The grammar specification, parser programming language developed by Schorre resembles a phrase structured analytic grammar rules that most closely align with the functional programming paradigm. There are no formal arguments to a parser function. All take their input source from an input character stream and return success or failure status. On success the recognized input is advanced over and transformed into an equilavant semantic parse tree output to the parse stack. On failure the input is not advanced. Positive and negative forms of Look ahead features in CWIC and SLIC allow parsing ahead and reporting the status after reseting back to the initial state. The parser and code generator languages are functional Stack-oriented programming languages.


The language specification starts with the top driving "program" parser function. The complete language specification is programed using parser functions that define language constructs. Sense a parser function is a boolean function expressed as a logical expression. We have the basis for the functional programming paradigm claim.


A parser equation specifies some language construct or token of the object language. As a funcion each character from the input stream is analyzed, testing its compliance to the rule. If unsuccessful the rule returns failure. If successful it will have advanced the input stream over the characters in compliance.

expr = term $(('+'|'-') term);

The expr rule first calls term. If term returns failure expr returns failure. If term is successful the expr rule then specifies that zero or more terms preceded by a + or - may follow. The zero or more operator $ is a programed loop. The loop successfully terminates when it fails to match the first or start element of the loop expression. In this case the loop expression is:

$(('+'|'-') term)

The first specified match in the loop expression is grouped. ('+'|'-'). A plus or minus character must be matched for the loop to proceed. If a plus or minus is in the the input stream with only white space proceeding it the test will succeed and term will be called. When a plus or minus is not found the loop will terminate successfully. Having recognized a plus or minus. A term must follow. Failing in a sequance after a match is successful results in a backtrack failure.

Character string constants are delimited by " marks. In SLIC a single character constants may be used, delimited by ' marks and may contain a '. is a single ' character string. Generally single ' character strings are used in character class declarations where single characters are required. They may be used anywhere for single character strings.

program

edit

This example is simular to one in the ACM CWIC paper.

.EXTERNAL .GENERATOR DECL;
.EXTERNAL .GENERATOR C0MP1LE;
.EXTERNAL .GENERATOR MAKENUM;
.EXTERNAL .GENERATOR ERRORX;

PR0GRAM = $((ID ( ":=":STORE EXP             // statement.
                 | "=":EQU NUM) ';'!2 C0MPILE(*1)// declaration.
            |".END" .BREAK ) // on matching .END break out of loop
           \ ERRORX('???') $(-";" .ANY) ";" );

EXP = TERM $(('+':ADD|'-':SUB) TERM !2);
TERM = FACTOR $(('*':MPY|'/':DIV) FACTOR !2);
/*
  The two syntax equations above create left handed binary trees. That
  is operations of equal hierarchy are expressed in tree form resulting
  in left to right evaluation. a - b + c is treated as: (a - b) + c
  Were as a right handed tree is treated as: a - (b + c)
*/

FACTOR = ID | '(' EXP ')' | NUMBER;

// Charter class definations:

LET: 'A'| 'B'| 'C'| 'D'| 'E'| 'F'| 'G'| 'H'| 'I'| 'J'| 'K'| 'L' | 'M'|
     'N'| 'O'| 'P'| 'Q'| 'R'| 'S'| 'T'| 'U'| 'V'| 'W'| 'X'| 'Y' | 'Z';
DGT: '0'|'1'|'2'|'3'|'4'|'5'|'6'| '7'|'8'|'9';
SYMCHR: LET | DGT | '_';

// Token recognizing - object making equations.

ID .. LET $SYMCHR; // Symbol objec catalogs an identifyer.
NUM .. DGT $DGT MAKEINT(); // Numeric object created


COMPILE(EQU[X, Y])=>DEF:(X) := Y;
       (STORE[X, Y])=>{
    DEF:(X) := EVAL(Y);
    PRINT(DEF:(X);}
EVAL(ID(X)) => DEF:(X);
    (NUMBER(X)) => X;
    (#V1[EVAL(X), EVAL(Y)]) => #U1;
        #V = ADD, SUB, MPY, DIV;
        #U = X+Y, X-Y, X*Y, X/Y;
.STOP SETUP PROGRAM

The early Schorre compiler compilers did not have backtracking. In these languages most ambiguous language constructs can be factored out. The clasic exame:

A = a b | a c;

is simply coded:

A = a ( b | c);

expr = term $(('+':ADD|'-":SUB) term!2);

In CWIC and SLIC backtracking is different then PEGs. CWIC and SLIC have a lisp 2 based code generation languages. With lisp dynamic memory object support the parsing language, using specilized token Token equations create token objects. Like PEGs there is the concept of white space. When recognizing lexical analysis#Lexemes surrounding white space is skipped over. Character class rules or defines are simular to syntax and token rules only strictly limited to defining a list of characters. Using CWIC parsing equations the character class rule syntax is defined:

char_class = id ':' chr $('|' chr) ';';

chr .. ''' .ANY ''' MAKESTR()|
      ("0B"|"0b") bin $bin MSKEBIN()|
      ("0O"|"0o") oct $oct MAKEOCT()|
      ("0H"|"0h") hex $heX MAKEHEX()|
       Dec $Dec MAKENUM();

bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dec: oct|'8'|'9';
hex: dec|'a'|'b'|'c'|'d'|'e'|'f'
        |'A'|'B'|'C'|'D'|'E'|'F';

id .. alpha $(alphanum|+'_');

Above the char_class is defined as a chr followed by zero or more chr separated by the alternative operator |. The char_class syntax parsing equation defines the parse of a char_class (character class defining) equation using chr (a token) and literal string '|'. The chr token equations defines a single character as a single character quoted string or numeric ordinal using char_class equations bin, oct, dec, and hex. The bin class could also be defined using numeric ordinal ASCII codes:

bin: 0H30|0H31;

MAKEBIN, MAKEOCT, MAKEHEX, and MAKENUM are supplied function that create numeric objects from recognized numeric token strings. Token equation copy character recognized by class tests to a token buffer. Mached literal string are not. In the id rule:

id .. alpha $(alphanum|'_');

the '_' string test when matched is not copied into the token. To keep the '_' as part of the id a +'_' is used:

id .. alpha $(alphanum|+'_');

The id rule above is used in CWIC and SLIC. This + operator applies to syntax and token equations. A +<string> is a string test that in a syntax equation if successful a string object is pushed on the parse stack. In the id equation it would be more efficient to define a symchr class:

symchr: alphanum | '_';

id .. alpha $symchr;

symchr is a single class test were (alphanum | '_') is two tests.

Backtracking and backup

edit

In the CWIC ACM paper | is aperently the backtracking alternative operator and / non-backtracking. In SLIC I used | for non-backtracking and \ for backtracking. Added %e zero or more backtracking iteration operator. Backtracking last failing iteration. Syntax sugar equivalent to;

$(e \ .BREAK)

A == syntax rule:

e == e1 e2;

is equalivent to:

e = (e1 e2 \ .FAIL);

The CWIC manuals did not fully describe the implementation of CWIC. I do not know what terminology thay used or details of how backtracking was implemented. But it was well defined . The only uncertainty is the last right most backtrack alternative long failing. In a sequence of backtrack alternatives:

 (a \ b \ c)

A backtrack frame is created before attempting the first alternative a. On a failing backtracking and attempting b. On b failing backtracktracking and attempting c. The backtrack frame was removed before atempting c. In general the last alternative is not ambiguous. And one could add failing last alternative:

 (a \ b \ c \ .FAIL)

So if c fails (long fail or not) the backtrack .FAIL alternative catches it and simply fails the parenthetical expression or group. My asumption is that long prenthised fail applied to the left alternative. removing the backtrack stack frame before the last right alternative. CWIC only talking about back tracking being a long failure. Reseting the input state was parting of token rules and literal string recognition algorithms. It was not described as backtracking in CWIC.

In implementing SLIC I had backup and backtracking. CWIC had both. Ijust do not recall it having a term for backup. Backup is used in token recognition and literal string compares. A _cmp_str function for example called with a pointer to a string to match against the input character stream. On entry the input stream state is saved. If the _cmp_str test fails the starting saved input state is restored. Backup applies to saving and restoring the input state. It occurs only in character sequence matching failures of token equations and literal strings. A backtrack occures once a test is successful and a subsequent test fails. For example:

st = id ":=" expr ';':ASIGN!2: 

When id is successful and the string test ":=" fails a long fail occures. How ever we might write:

st = id (":=" expr :ASN
        |"=" number:DEF)!2 ";";

If id successful ":=" is atempted. If ":=" fails "=" is atempted. The parentheses bound a sequance. If expr, number, or ';' fail a longfail redults. Holever if both ":=" and "=" fail or ";" fails a long fail from st results. In the Schorre META II documents the grouping to avoid backtracking is called factoring. In the early metacompilers you had to use factoring to get them to work.

Backtracking saves and restores the entire parse state. Recognizing tokens and literal strings does not directly invoke a backtrack. Backtracking is implemented using backtrack stack frames that are back linked from the current to the previous to an initial backtrack stack frame created and initilezed on startup before the first parser function is called. The initial backtrack frame contains the system base. The dictionary head. Basically all dynamic object tracking is in a backtrack frame. All object changes are done relative the top backtrack frame. On a long failure the backtrack data is reset. Releasing created objects and shadow objects (pre-existing objects whose attributes have changed). The saved input stream state is restored. etc. On success the top backtrack frame is used to update the previous backtrack frame and removed. A longfail simply loads the call stack pointer with the backtrack stack pointer and returns. The backtrack routine takes it from there. A backtrack is kind of like a longjmp in c. Any number of parser equation returns may be skiped. A backtrack occurs whether or not a backtrack alternative has been used. Returning failure to the initial system backtrack frame is a compiler error. The system will report it as such and dump source code around input point. The parser state is not changed on returning to the system backtrack frame. The error is reported and the program exits with prejudice.


The PEG topic talks about Expressive power PEG packrat parsers cannot recognize some unambiguous nondeterministic CFG rules, such as the following:

S ← 'x' S 'x' | 'x'

CWIC easily handles the problem:

S = 'x' $('x' 'x' \ .BREAK);

or SLICs syntax sugary sweet version:

S = 'x' %('x' 'x')

Or if an even number should make the equation fail:

S = 'x' $('x' 'x');

The above equation on it's last iteration matching only the first 'x' failing on the second world long fail. That is the sequence 'x' 'x' may fail the first 'x' backing up to the start of the current iteration. Once the parse has moved past the first of a series a failure on a subsequent test causes a long failure. A backtrack can catch snd limit failure to the equation.

S = 'x' $('x' 'x') \ .FAIL;

 or in SLIC again syntax sugar equiluzant using a. == equations limits long failure to S

S == 'x' $('x' 'x');

The == equation is a backtrack equation. It sets a backtrack that returns the success or failure accordingly.

A == equations creates backtrack stack frame on entry. I always returns to its caller.


In summation it appears Ford has added nothing significant to what was existing in 1969. The CWIC manuals I took home from the CWIC ACM SEGPLAN meeting were together about 3/4" thick 8 1/2" x 11" paper documents. In the following s represent a symbol string and n an integer. The CWIC parser model has a call stack, parse stack, and node stack. The language included tree and list transforming operators;


  1. :s Creates a node object of s and pushes it on the node stack.
  2. !n creates a tree having n branches poped from the parse stack and top node stack entry. Pushing the tree on the parse stack.
  3. +[e]+ parse stack entries created by e' are poped creating a list that is pushed on the parse stack.
  4. Token making .. equations parse symbols, strings, and numbers creating a typed object that are pushed onto the parse stack.

In the memory consumption disadvantages described of PEGs there is no discussion of memory required for analysis to parser output in obtaining the program structure naturally handled by recursion. The memory used in recursive decent holds the nested structure of the source language that gives CWIC an advantage constructing an abstract parse tree functionally expressing the source analyzed. An if statement

if_st = "if" condition statement:IF !2 ("else" statement :ELSE!2|.EMPTY) ';'

The above parses an if_st matching the "if" string it looks for condition and statement. If good it creates a tree:

           IF
          /  \
{condition}  {statement}

it then tests for the string "else". If 'else' test fails the alternative .EMPTY is successful and the if_st returns success. If "else" is successful statement is called and if successful :ELSE !2 builds the ELSE tree:

              ELSE
             /    \
           IF     {statement}
          /  \
{condition}  {statement}

Thw if_gen code generator using unparse => action rules meant to be called with the IF[ ,,, ] or ELSE[...] tree above:

ifgen(ELSE[c,es])=>{  ifgen(c,%T1);
                      codegen(es);
                   <%T1:>
                   };
     (IF[c,ts])=   {  gof(c,%T1)
                      codegen(ts)
                   <%T1:>
                   };
    (IF[c,ts],w])=>{  gof(c,%T1)
                       codegen(ts)
                   <  goto W
                      %T1:>
                   }; 

A generators unparse tule is in a way like overloading functions. It matches the unparse rule against it's arguments. Variables take on values of were they are placed. The pattern:

(IF[c,ts],w) =>

matches a ifgen call having two paraeters. The first a tree pattern IF[c,ts] matches a 2 branch tree whose node is IF. the lift branch assigned to local variable c and right branch to cs. The action code calls a generator:

gof(c,%t1)

gof is a generator for generating code that will goto it's second argument on the condition argument c evaluating to false. The sequance:

< ... >

bound code production blocks. The example is SLIC pseudo code instructions that are coded in the pseudo language to produce specific target machine instructions. In creating SLIC I did this to separate target machine code generation and make generators more readable. One does not have to deal with machine specifics in producing sequential instruction code.

CWIC and SLIC

edit

The term test as used here is like a data type. A parser equation is a test function. A literal string in a parse expression is a test. Test is the boolean result status, success or failure, of a test operation.

A parser is a test made up of tests, a set of functions written in the form of equations. They are each a test function defining a goal returning a boolean status of success or failure. The input to a test operation is the source code character stream. A test is any operation that matches the input character stream to a goal. Test is the boolean status result of a goal test.


An atomic test is an atempt to recognize a symbol: identifier, number, or string object. In any sequence of tests when other then first atomic test fails a backtrack failure occures. A backtrack alternative create a backtrack stack frame on the call stack. If a backtrack failure occures the parser state is restored the top backtrack stack frame saved state. A backtrack occures regardless of a backtrack alternative. If no backtrack stack frame has been created by a backtrack alternative the backtrack failure is caught by a defult system backtrack frame that simple reports it as a internal compiler error and exits. Backtracking is most commonly used for error detection and recovery.

In the parsing language there are 3 types of equations.


The Parser Programming Language of SLIC and CWIC are functionally identical. The | non-backtracking alternative and \ backtracking alternative are used in SLIC.

The parser language is a type of functional programming language in which each parse equation is compiled into a function returning a boolean status of success or failure. The input to a function is a character stream. Like most languages streaming I/O is part of the language support libraries. The character stream input buffering In CWIC and SLIC provides backtracking support buffering The character class defines a set of characters. For example, the set of characters representing binary digits:

bin: '0'|'1';

Octal digits:

oct: bin|'2'|'3'|'4'|'5'|'6'|'7';

Decimal digits

dec: oct|'8'|'9';

and Hexadecimal digits:

hex: dec|'a'|'b'|'c'|'d'|'e'|'f'|'A'|'B'|'C'|'D'|'E'|'F';

Character classes do not compile to an executable function. The character class name when used in a token rule matches any character in the class. They are simple, consisting of character literals, ordinals, or previous defined classes separated by alternate | operators.

edit point

edit

These are scannerless parsers. Token equations recognize and create token objects. The defult is a cataloged object. Supplied interceding functions MAKESTR, MAKENUM, MAKEDEC, and MAKTFLOT prevent the cataloging, instead converting the matched token string to other typed objects. Token equation recognize character sequences. The skip_class character class is generally ignored between tokens. It is a predefined character class. A token equation ignores skip_class characters until it recognizes it's goal. A character not in the skip-class and not matching it's goal will fail. The SLIC string rule:

string .. (''' .any ''' | '"' $(-'"' .any | """""" ,'"') '"') MAKESTR(); 

The token string equation exemplifies several features of the syntax language. A token equation ignores leading skip_class characters until a match is made. On matching a character skiping skip_class stops. A string used in a rule is not normally kept as part of the match. A string following the , insert operator is not matched against the input characters. It is an argument of the insert operator. The - negative look ahead operator is used above in finding the string terminating " character. .any matches any character. The sequence

''' .any '''

recognizes a single character string bound by ' characters. The sequence

'''

recognizes a single ' character in the input character stream. .any is always successful except at the end of input. When there is no character to match. The string rule recognizes two forms of strings, a single character string bound by ' marks or a multi-character string bounded by " characters allowing "" within the " bounded string to represent a single ". i.e. the """""" is actually a string containing two " charactets. The ,'"' inserts a single " character into the token. MAKESTR intercedes the default cataloging operation and creates a string object. CWIC did not have a single character form. I added a char token rule used in defining character classes. It helped in detecting errors.

char .. ''' .any ''' MAKESTR();

I also added it to the string token rule for consistency. You could say the special single character string is syntax suger. But on the other hand it adds clarity that only single character strings may be used in character class definations.

The integer rule examplifies the use og syntax sweetness.

int .. ("0H"|"0h") hex $hex MAKEHEX()
       |("0O"|"0o") oct $oct MAKEOCT() |
       |("0B"|"0b") bin $bin MAKEBIN() 
       |('+'|+'-'|.EMPTY) dec $dec MAKENUM();

The above int integer parser recognizes an integer in several basses. Strings not normally kept preceeded by a + become part of the token string. MAKENUM recognize signed numeric strings.

On the fly syntax directed lexing has advantages. The allowable token type only need be recognized. Recognized tokens are converted to objects that can be manipulated. The complete language is specified in a common metasyntax language. Character classes and tokens use a subsets of the general syntax rules.

This simple example interpreter based on the 1970 ACM digital library paper on CWIC.[1] illustrates the parser programming language backtracking and look ahead. Also of note is it's use of syntax sweetness (as opposed to syntax suger) to avoid backtracking. Updated a bit to include multiplication, division, error detection - reporting and recovery.

edit point

edit

The parser language is a set of functions written in the form of equations. They are each a test function defining a goal returning a boolean status of success or failure. The input to a test operation is the source code character stream. A test is any operation that matches the input character stream to a goal. A quoted string or character test. A syntax or token equation test.

A top-down method of syntax analysis is used. Starting with the top program defining test synax equation, T he main goal is to define and parse a complete program. The right-hand side of an equation indicates which entities must be recognised, from left to right, as individual subgoals to achieve the parsing equation goal. This top-down method divides the parsing task into smaller and smaler goals. These goals are themselves defined in terms of tests, and so the process of recognising a complete program becomes a matter of tests for progressively smaller entities, right down to the level at which basic symbols such as characters and numbers are recognised. The parsing can be said to be a goal oriented language.

In a syntax equation the logical AND operation is l implied by a sequence. i.e.

A B C

The A B C sequence must each evaluate to success in the order written. On any failure the sequence terminates in failure.

There are two types of failure. A long failure is most important to understand as an unhandled long failure results in a compiler termination. One must use backtracking alternatives to handle long failures. There is a backtrack set before the top syntax equation is called. I does not actuality

or (alternative) operators. The | (bar) or (alternative) operator is non-backtracking:
'if' xp 'then' | alternative

If the string 'if' test result is failure the alternative is attempted. If the string 'if' test is successful xp is atempted. A failure from xp would then result in a backtrack failure. In a sequence of tests a backtrack occurs once a test is successful and any successive test fails. If there is no backtrack alternative in the call stack a compiler error occures. A backtrack \ alternative sequence creates a backtrack stack frame on the call stack. Backtrack and non-backtracking may not be mixed with in a sequances:

'a' (B \ C \ D) | Y
'a' (B | C | D) \ Y

are valid. The sequance:

'a' (B \ C | D) | Y

is not. In the first case if 'a' fails Y is a tempted. If 'a' is successful and B C and D fail a long fail occures. A long failure occures on any failure of a successive test in a sequence once a first test is successful. A grouped test (B \ C \ D) constitutes a single test. (B | C | D) for example constitute a test. B, C, or D may be a sequence of tests and the backtrack failure individually applies to each. On the first in B Failing C would be attempted. On a successive test in B a long failure occures. On a long failure the call stack is reset to the backtrack stack frame and backtrack restores the parser state to the backtrack frame saved state and the next backtrack alternative atempted. All backtracking alternatives failing results in a normal failure. In the sewuance: X = A ((B \ C) | D); what happen's if D long fails? Is D atempted? In SLIC I chose to remove the backtrack stack frame on beginning the last alternative. One could write:

X = A ((B \ C \ .FAIL) | D);

explicitly indicating a backtrack failure on atemptig C. The .FAIL last alternative would not cause a backtrack and D attempted. I do not know how CWIC worked in such a case. A rule of the form:

X = A \ B \ C;

presents the same quandary.

X = A \ B \ C \ .FAIL;

makes it explicit that X never long fails returning failure or success to it's caller. I added the == rule that never long failes.

X == A \ B \ C;

It does a bit more catching any long failure and returning failure to it's caller.

There are three types of a tests. A literal test is a quoted string or character. A token test is defined by a lexical equation that on success creates a token object. A syntax equations is a test. A test is simply anything or operation that tests a given character sequence or pattern against the input character stream. A string test is normally not kept. A + preceeding a string test creates a string object or in a token equaton is kept as part of the token object. Token equation by default create symbol objects that are cataloged in the dictionary.

CWIC program

edit

First declaring external linkages

.EXTERNAL .GENERATOR DECL;
.EXTERNAL .GENERATOR C0MP1LE;
.EXTERNAL .GENERATOR MAKENUM;
.EXTERNAL .GENERATOR ERRORX;

.SYNTAX // (equations):

The program syntax equation is a loop that recognizes statements that begin with an ID until the string '.END' is matched and .BREAK terminates the loop. Errors are reported by ERRORX(<message string>) and recovery programed to scan for a ':'.

PR0GRAM = $((ID ( ":=":STORE EXP             // statement.
                 | "=":EQU NUM)!2 C0MPILE(*1)// declaration.
            |".END" .BREAK ) // on matching .END break out of loop
           \ ERRORX('???') $(-";" .ANY) ";" );

There are two outer alternatives separated be the backtracking \ alternative operator. The first alternative starts with an ID followed by grouped alternative (":=":STORE EXP | "=":EQU NUM) that recognize ":=" creating a STORE node or "=" creating an EQU node. Nodes are pushed on a the node stack. An EXP is expected to follow the ":=" or a NUM the "=". following the grouped alternative an !2 creates a tree from the ID, STORE node and EXP or EQU node and NUM. The tree replacing the ID and EXP or NUM top two parse stack entries. C0MPILE(*1) pops the tree to the COMPILE generator function, removing it from the parse stack. *1 represents top parse stack entry. A carry over from META II.

   STORE          EQU
   /   \         /   \
 ID    (EXP)   ID     NUM


that may alternative be followed by ':=' EXP or '=' NUM. The alternative to I'D is '.END'                   That when matched breaks out of the loop.BREAK ending the program  backtracking alternative reports an a error an scans for a ';' character.
alt
'.END' .BREAK

and

ID (ST | DECLARATI0N)


separated by a | non-backtracking alternative. third alternative:

ERRORX('???') (-';' .ANY)* ';'
 The inner error recoveey loop, (-';' .ANY)* skips to  a ; 
 hopefully over the error.
 Both ST and DECLARATI0N are preceeded by an ID, a
 token rule that if successful pushed a symbol object
 onto the parse stack. Of note is factoring id off those rules
 eliminating the need to backtrack over it.

DECLARATI0N = '=' NUM ';' :EQU!2 DECL(*1); /* DECLARATI0N:

  Calls cmpstr('=') match test.
    If cmpstr returns failure. Failure is returned.
  Calls NUM a token rule that parses an integer number.
    If NUM returns failure a long fail results
  Calls cmpstr(';')
    If cmpstr returns failure a long fail results
  :EQU  creates an EQU node pushing it on node stack.
  !2  creates a two branch tree pushing it on star stack.
    The left branch being an ID matched in the PROGRAM rule.k
  DECL(*1)  Calls DECL generater passing poped top parse
     stack enity.
  • /

ST = ':=' EXP ';' :STORE!2 C0MPILE(*1); /* ST:

  Calls cmpstr(':=') string match test.
    If cmpstr returns failure ST returns failure.
  Calls EXP a syntax rule that parses an arithmetic expression.
     If EXP returns failure a long fail results
  Calls cmpstr(';')
    If cmpstr returns failure a long fail results
  :STORE  creates STORE node pushing it on star stack.
  !2  creates a two branch tree pushing it on star stack.
  C0MPILE(*1)  calls C0MPILE generater passing poped
  top parse stack enity.
  • /

EXP = TERM (('+':ADD|'-':SUB) TERM !2)*; TERM = FACTOR (('*':MPY|'/':DIV) FACTOR !2); /*

 The two syntax equations above create left handed binary trees. That
 is operations of equal hierarchy are expressed in tree form resulting
 in left to right evaluation. a - b + c is treated as: (a - b) + c
 Were as a right handed tree is treated as: a - (b + c)
  • /

FACTOR = ID | '(' EXP ')' | NUMBER;

// Charter class definations:

LET: 'A'| 'B'| 'C'| 'D'| 'E'| 'F'| 'G'| 'H'| 'I'| 'J'| 'K'| 'L' | 'M'|

    'N'| 'O'| 'P'| 'Q'| 'R'| 'S'| 'T'| 'U'| 'V'| 'W'| 'X'| 'Y'| 'Z';

DGT: '0'|'1'|'2'|'3'|'4'|'5'|'6'| '7'|'8'|'9'; ALPHNUM: LET | DGT;

// Token recognizing - object making equations.

ID .. LET ALPHNUM*; // Symbol objec catalogs an identifyer. NUMBER .. DGT DGT* MAKEINT(); // Numeric object created .F1N15H .STOP SETUP PROGRAM

edit point

edit

Of note is the syntax expression:

('.END' .BREAK | ID (ST | DECLARATI0N))

from the program rule. Here the '.END' string test will fail on matching '.' the first character if an ID. If '.END' is successful .BREAK breaks out of the loop and PR0GRAM successfully ending the program. Both a statement and a declaration may be preceeded by an ID. An ID will have been pushed on the parse stack. so when the :EQU!2 in DECLARATI0N or :STORE!2 in ST executes a tree will be built whose left branch is that ID. By factoring ID out of DECLARATI0N and ST we avoid backtracking. However if we fail to successfully match either, DECLARATI0N or ST, a long fail results. We have:

\ ERRORX('???') ...

a backtracking alternative (the \) that catches the backtrack failure. ERRORX('???') prints an ^ under the point in the line were the parser failed. The backtrack resets the parse state to the point on originally entering the backtrack expersion. Of note here Rp is how what you call syntax suger is used to avoid ambiguity avoid the need of backtrack8ng. I call it syntax sweetness. These languages create symbol dictionary entries on the fly. Using syntax sweetness the undoing of that is avoided.

Creating an abstract parse tree is programed using two operators and two stacks. A node is created and pushed on the node stack by the sequence :<node name> i.e. :ADD creates an ADD node and pushes it on the node stack. !2 creates a 3 element list. The two top parse stack entries poped into the last 2 list entries and the top node stack entry into the first, pushing the list on the parse stack. Simerly for other trees. A list's first element being a node type adinifies the list as a tree.

The .. token making euations recognize tokens creating cataloged dictionary objects or numeric objects by calling MAKEINT(). The dictionary is a stacked symbol table that accommodates symbol scope in nested block structure languages. Interceeding object making functions like MAKENUM and MAKESTR prevent the default cataloging operation. The token equation name can be used to identify a token in an unparse rule.

.GENERATOR
DECL(EQU[X, Y]) => DEF:(X) := Y;
COMPILE(STORE[X, Y]) => {
    DEF:(X) := EVAL(Y);
    PRINT(DEF:(X);}
EVAL(ID(X)) => DEF:(X);
    (NUMBER(X)) => X;
    (#V1[EVAL(X), EVAL(Y)]) => #U1;
        #V = ADD, SUB, MPY, DIV;
        #U = X+Y, X-Y, X*Y, X/Y;
.F1N15H
.STOP SETUP PROGRAM

Another generator example producing IBM 360 instructions:

  expr_gen(#V1[expr_gen(x),expr_gen(y)]) => 
       {
         <#U1 + (x*16)+y;>
         releasereg(y);
         return x;
         #V1 = ADD, SUB;
         #U1 = AR,  SR;
       }
    (x)=>
       {
         r1 = getreg();
         load(r1, x);
         return r1;


The above machine code example ilistrates the problem I had with the readibility of code generation. CWIC generated IBM 360 machine code into memory buffers. I had access to a DEC-System-10. Code generation was called plantig. The following plant:

<AR + (x*16)+y;>

built an IBM 360 add register to register instruction and planted it into a named block of memory called a section. As you can see code generation was very specific to the targer machine. IBM 360 an 8 bit addressable machine and the DEC-System-10 a 36 bit word machine. Instruction in CWIC were built using arithmetic expressions that built an instruction, summing the various instructions fields. Instruction planting expressions were sattered throughout generator code. My goal was to make generation of code more readable and easier to do so for different computer instruction set architectures. The same instructions are commonly used in different contexts produced by diferant code generators. So my goal was to move target machine specifics out of the tree-crawling generator language.


I designed a machine code assembler defining language. An instruction or family of like instructions are defined by a MACHOP proceedure. This sublanguage defines an instructions numonic and it operand syntax. Index, indirect, etc. A family of like instructions could e defined using the vector #<symbol> notation used in the generator language. The pseudo code language defines a macro like pseudo instruction. The pseudo language is used to define a machine independent set of operations. The SLIC generator language plants pseudo instructions. A.planted pseudo instruction call is appended to a named code section list. That list is executed when the section is flushed. The pseudo list my optionally be processed by in sequence optimization transform rules before their exicution. On execution a pseudo proud calls MACHOPSs to output binary code.

The overall concept is to divide compilation into separate phases. Each phase having a specilized domain specific language. The pseudo operations can be defined and implemented in parallel with programming the parser and generatio. multiple sets of PSEUDOs and MACHOPs can target different target machines.

edit point

edit
.MACHOP #RR r1,r2
 #RR =  LR,  AR,  SR,  MR,  DR;
 #op = 0H18,0H19,0H1A,0H1B,0H1C;
 .MORG 8:H(16):$/8
 H(8):  op;
  (4):   r1;
  (4):   r2;

The above defines an IBM 360 register to register family of instructions AR add register register numeric HEX opcode op=0H19. Subtract reg,reg, op=0H1A etc. The opcode an 8 bit field and registers 4 bit fields. A 16 bit instruction. The instruction starting on an 8 bit boundary. (.MORG 8) Planting into a bit addressable memory space allowed different computer system architecture's having different addressable unit to be coded. The DEC-10 and IBM 7090 series computer had 36 bit word addressable memory. Burroughs ALGOL machines were 48 bit word machines, with MCP writable flag bits. The MACHOP language defined a series of bit fields in a bit addressable memory aligned to a modulo boundary by a module org.

 .MORG <modulo>:<print format>

.MORG set the plant location to an evenly divisible boundary point. .MORE 8 aligned to an 8 bit boundry. $ represented the relative plant location.

.MORG 16:H(16):$/8:

Sets $ to an evenly divisible by 16 location using truncation integer arithmetic ..

$ = (($+15)/16) * 16;

If $ == 32 (on the boundry)

   $ = ((32+15)/16) * 16
     = (47/16) * 16
     = 2 * 16
     = 32 (it is unchanged)

If $ == 33, (between boundry)

   $ = ((33+15)/16) * 16
     = (48/16) * 16
     = 3 * 16
     = 48 (set to next boundru)

You could have 8 bit byte addressable memory and alignment to a 16 bit word boundry. So on aligning to a 16 bit boundry (MORG 16:H(16):$/8:) the address would be ($/8) to an 8 bit byte address. The SLIC system could print the generated assembly code with the compile listing. The MACHOP language specified the pseudo assembly listing code format. The sequence:

H(8):   op;
 (4):   r1;
 (4):   r2;

specifying 3 separate fields that are combined making a 16 bit numeric printing in hex. Or

H(8):   op;
H(4):   r1;
H(4):   r2;

Would print the bit fields individually separated by a space. Conditionals could select different instruction formats. Or field values. i.e A single MACHOP could define a register register instruction of 16 bits. Or register memory 32 bit instruction. A memory memory a 48 bit instruction. The MACHOP basically defined an asembly language. The SLIC linker formated the MACHOP generated relocatable object code file to target load format file. The MACHOP header line was a specialized syntax specification of it's assembly parameters. The output is a relocatable object code file. Polish fix up blocks handled forward referances. Creating linker fix-up block is automatic. Using an undefined symbol attribute in an expression on a MACHOP instruction call parameter would autimently create a polish fix up block when used in a field expression. A fix up linker block would automatically be generated once all undefined in the fix up expression were defined. Un resolved fix ups automatically reported as errors on closing the object file.

The pseudo and its parameter list were combined into a structure and appended to a list associated with a section. On flushing a section the pseudo list is executed. Pseudo instructions were basically named procedures. They used a subset of the generator action code. Even able to plant pseudo code.

The point is these are not parser generators I may have rambled on a bit but I can not stress enough the syntax defining language of these compilers is a programming metasyntax language. These compilers were self hosting compilers or could have been. There were a few implementations that were not self-hosting. One for example, A version of META II, had analysis code added in to the it's assembly that was from a compile.

All the Schorre based metacompilers compiled to executable code (or assembly) and inteen produced executable or assembly code. They were not exactly parser generators. If the definition of parser generator is a compiler that takes a metasyntax language and produces a parser these do fit. But does parser generator include transforming operators in the syntax rules. Or have different rule specification types that recognizes tokens and and character classes.

Looking for the defination

edit

So to derive a consistent definition of reductive grammar I looked at how reductive methods are applied in mathematics and computer science. Remembering mathematical reductions I havn't thought about sense 1967. Not that I havn't applied them. Just not thought about the name. A 2007 published paper showing reductive logic is an established methodology of analysis in Math and Computer Science today:

The process of calculating a proof of a given putative conclusion, for which non-deterministic choices between premises must be resolved, is called proof-search and is an essential enabling technology throughout the computational sciences. This study suggests that the reductive view of logic is as fundamental as the deductive view, and discusses some of the problems that must be addressed in order to provide a semantics of proof-searches of comparable value to the corresponding semantics of proofs. Just as the semantics of proofs is intimately related to the model theory of the underlying logic, so too should be the semantics of reductions and of proof-search. [Published: Oxford Scholarship Online: September 2007 http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780198526339.001.0001/acprof-9780198526339]

The above extablishs that reductive as a term currently used in computer science. In mathematics we have more specific explanations that Arthur Rubin should also understand. How reduction is used in solving math problems: Matrix row and column reduction. Reductions in solving integrals by parts. Differential equation and Laplace transform reduction methods. And in computer science:

The Prolog family of languages, logic programs have a procedural interpretation as goal-reduction procedures.

My point is that reduction is a scientific descriptive term used in mathematics and computer science. In understanding it's general meaning we can derive a definatiom of reductive grammar.

Mathematical reductions are basicly breaking down a problem into smaller and simpler parts that have known soloutions or are simpler to solve. Similarly in computer science and applied logic reductive relates to top down goal seeking methods of soloutions.

edit point

edit

I recently found a 1974 TREE-META document in which it describs it's parser language:

A top-down method of syntax analysis is used in the TREE-META system. The main goal is to match the complete program with the Syntax Rule whose name follows .META at the head of the program. The name of the Syntax Rule appears at the start of the rule followed by the = symbol. The right-hand side of this Syntax Rule indicates which entities have to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves defined in terms of other entities, and so the process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic symbols such as characters and numbers are recognised. TREE-META

edit point

edit

A point I just recently noticed is that in several older documents the grammar rules were called syntax equations. Not rules. I like that it helps in destingushing them from grammar production rules. I think I have nailed down the distinguishing features of a reduction grammar.

The next step is to actually explain how these languages actually work: Get the technology straight. Then maybe you, Arthur and I can at least agree on the technology. I avoid calling them parser generators. It does them a great disservice.

I would like to introduce the term Parser Programming Language or PPL. A PPL is a metasyntax programming language. Being a programming languages they have a formal specification. A syntax equation is a named goal that is compiled into a parameterless boolean function. You cannot overload a parameterless function. The term test is used to describes a boolean operative. A parser goal that tests the input against a specfication returning a result of success (in which case the parse is advanced), or failure (in which case the parse is not advanced).

edit point

edit

We have a good knowledge base on bottom up parser algorithms. Various properties lije Time and memory use that is being applied to these languages without the understanding to do so. All because of the preconception that they are production grammars used by a parser generator.

One fallacy is their memory usage. When parsing an ambiguous language they only ever have one posable path active. They have no need of tracking multiple possible paths. Alternatives are tried in the order programed. The first completed solves the goal. Using syntax sweetness allows the factoring of alternatives eliminating backtracking in most cases.

edit point ^

edit

ambiguity Rp. In your summary you talked-about how syntactic sugar does not add any expressive power, only more conciseness. How instead of two rules:

A :== B
A :== C
you can write:
A := B | C

And then later:

Another problem is ambiguity along the way: when doing top-down, left-to-right parsing, these rules
A :== B C | D
D :=  B E
present a problem: you don't know whether to interpret A as B C or as B E as they both start with B.

That shows you do not understand the grammars I am talking about. It simply solves the problem with syntactic sweetness:

A = B (C | E);

No more backtracking. Problem solved! In the next syntactic sugar example

instead of two rules A :== C and A :== CA, you can write A := B+

I agree that is syntactic sugar. But they are programming languages and do not allow syntax equation function name overloading. So would be written as:

A = C C*;

You also clame A := B* to be syntactic sugar. However that is needed to express left recursion.

expr = term (('+'|'-') term)*;

Left recursion is just a way of expressing a loop. And any decent programmer knows a non-terminating left recursive loop is a problem. You might propose right recursion.

expr = term (('+'|'-') expr)*;

that solves the non-terminating loop. But these are program and are also programed to produce a tree:

expr = term (('+':ADD|'-':SUB) expr!2)*;

Given: a-b+c would produce a right handed tree:

   SUB
  /   \
 a     ADD
      /   \
     b     c

Which is the wrong order of evaluation. The CWIC and SLIC grammar languages transform source code into Abstract Syntax Trees. We could write the equation to produce a list and would result in the same tree either way:

expr = +[term (('+'| '-':NAG!1) expr!2)*]+:SUM!1;

That would transform the expression into

     SUM
      |
[a,NAG[b],c]

I hope you are at least beginning to see why I prefer not calling these parser generators. These are not how parser generators are normally explained.

Backtracking is programed and only involked on backtracking alternatives failure to complete. And then actual backtracking applies only after the successful recognition of an entity is made and failure occurs on a following enitity. I can not say much about their thoritical compile time. But I can say that the compilers written in CWIC and SLICED ran faster then computer manufacturer compiler for the same languages. A COBOL compiler written in SLIC compiler 5 to 25 percent more lines per runtime units then the DEC COBOL compiler. Erwin Book claimed simular results for CWIC running on an IBM 360,

These parser languages are part of integrated compiler writing systems


The Schorre metacompiler parser programming language's syntax equation languages are goal oriented, logic equations. The prime or start top driving syntax equation has the goal of parsing the complete program. Each syntax equation is a named goal. The right hand side is a logical expression of sub-goals. The sub-goals reduce the language into smaller and smaller parts. Down to words of the language and the character sequences of which words are constructed.

edit point

edit

I think it is this multi-stages of goal reduction into finer and finer resolution to which the term reduction grammer applied. That had not occurred to me until I recently found a 1974 TREE-META doc explaining the top down parser in terms of goals. Anyway there are lots of other language features that rule out their being production grammars. I like the term syntax equation.

The languages come from the keypunched limited character set card deck programming era. I will use modern equilivant forms in the examples. The Kleene star sequence operator in place of the $ loop operator. i.e.

(x Y z)*

is equilivant to:

$(x Y z)

both meaning sequence (x Y z) may apear zero or more times.

edit point

edit

The parser language could be described as:

  1. goal oriented
  2. string processing
  3. stack orianted
  4. functional programming

In describing these languages the term test was commonly used. Think of it as liken to a booean data type. Saying that XYZ is a test describes it's function and it's possible result. Equivalent to a boolean data type in a typed language. Like an integer function returns an integer. An integer expression evaluate's to an integer. A test evaluates to success or failure. As a goal oriented language a test is a goal that can succeed or fail.


Thesa metalanguages are scanerless Parser Programming Languages they have token recognizing equations that are distinct from syntax equations. Initially, in the early Schorre metacompilers, token equations were built-in token parsing functions following ALGOL 60 specifications:

.id identifier
.number integer number
.string a quoted string

In CWIC and later metalanguages special token equation are used. Token equations use strings and character classes to define valid character sequences. These equations use a subset of the features in general syntax equations making them consistent and easily to understand.

Character classes are a kind of literal defining euations that only use single character literal, or numeric ordanl tests. For example to define the digits usable in a binary number we define the bin character class as:

bin: '0'|'1'; // quoted glyph form.

or

bin: 0H30|0H31; // hex numeric codes

A character used in defining a binary number that may contain a 0 or 1. In these character class define euations only single character literal, or numeric ordanl tests may be used With one exception. They may include a previously defined character classes. Characters are separated by the, alternative operator |. bin written ascii character ordanl:

bin: 0H30|0H31;
i.e.
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dec: oct|'8'|'9';
hex: dec|'A'|'B'|'C'|'D'|'E'|'F';

The hex character class includes the dec character class that includes the Oct character class that includes the bin character class. Once defined character class may be used in token and syntax equations: Alternative order is irrelevant.

num .. ("0H"|"0h") hex hex* MAKEHEX() |
      ("0O"|"0o") oct oct* MAKEOCT() |
      ("0B"|"0b") bin bin* MAKEBIN() |
              dec dec* MAKENUM();

Token equations are defined using the token defined as operator (..). A symbol is created by default unless an interceding function is called that makes a non-cataloged object. Symbols are automatically cataloged in the dictionary. Attributes are created by assigning values to them.

 upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'L'|'M'|'N'
     |'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';
 lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'
     |'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';
 let: upr|lwr;
symchar: let|'_';
 id .. let symchar*;

The id rule defines an identifier. It is common to use id for identifier. In token equations literal string tests are not kept as part of the token. Action modifiers '+', '-', '~', ',', and '?' preceeding a string change it's test operation.

'+' the string is kept as part of the token when successful. '-' is a negative match. successful if string is not matched. String not kept in any case. ~ matches any single character string not the character. On not matching the character it is kept and input advanced. If matched the test fails, input not advancing. ?looks ahead evaluation to success or failure as the string would if not preceeded by it. The string is not kept, input not advanced in either case. The , inserts the string into the token, always evaluating to success. Though no test actually performed. The token matching algorithm skips leading skip_class characters. The skip_case is predefined but may be defined as any any other class overriding the default. The SLIC token algorithm allows skip_class a character to be matched as the first character. Skip class character may be part of a token in both systems. Just not the first in CWIC.

One might ask how do the character class membership testing campare to regular expressions. That of course depends on the implementations. The class test is a single instruction on many machines. In the above classes each rule create a class test mask. Each class containing character constants is assigned a bit. Each class is then a bit mask of all included classes including the assigned bit of itsekf if one assigbed. The skip_skip class is predefined. 0b00000001. We first defined bin which would get a bit mask of 0b00000010 then oct, dec in orde:.

skip_skip 0b00000001
bin       0b00000010
Oct       0b00000110
Dec       0b00001110
hex       0b00011110
lwr       0b00100000
upr       0b01000000
let       0b01100000
symchr    0b11100000

There is a character class map table indexed by the character numeric code. Testing the class mask against the class map table indexed by the character code is non zero if it has membership. So a class test could be a single nachine instruction plus a branch on condition instruction. So quite efficient code can be generated for token equations. Tokens operate by copying token characters to a token buffer. Supplied functions are called to create specific token object. Numeric string to number objects conversuons use in the num equation above. If no conversion function is called the catalog function is called. Cataloged symbols are dictionary objects. The dictionary is a symbol table stack. The action performed by the cataloger depends on catalog flags. Catalog functions are supplied to set declaration mode. In declaration mode an error occurs if a symbol is already in the top symbol table. If not in declaration mode catalog action depends on auto declaration mode.

edit point

edit

Logical and is implied by a sequance.

X = A B;

For X to be successful both A and B evaluated in the specified order must be successful. In fact if A fails B is not atempted. X is a goal that requires both A and B goals to succeed. The languages have Two types of alternative "or" operators. And three failure types. A backup failure occurs when a token or string match fails. The token or string compare function restores the input character stream to it's origional entry state and returns failure. i.e. "begin" is a string match.

In a sequence of goals if the first one fails the sequence fails. If there is an alternative it is taken. when no alternatives exist the sequence fails. The containing expression then fails. i.e.

X = (a B | c B | d A) C | x A;

In the equation above if a fails c is atempted. If c then fails d is atempted if d fails x is atempted. If a, c, d, or x succeed and the following goal fails a long backtracking failure will result. Hopefully a backtrack alternative has been used in a calling rule. Note the above expression can be further factored.

X = ((a | c) B | d A) C | x A;

Look at the above example. Say the equation calling the above provides a backtrack alternative.

Y = X \ Z;

The backtrack alterative operator \ will take the Z alterative no mattet the failure type from X. There can be any number of Call levels between the backtrack alternative and the backtracking failure. Look ahead is another feature of these languages. And again we can have backtracking and backup lookaheads. We have positive and negative look ahead. String and token look aheads employ backup to reset the input state. Look ahead uses ? for positive look ahead and - negative look ahead. Used with a string backup is employed. Employing look ahead on a syntax expression or equation name invokes a backtrack setpoint. There are syntax equations and token (making) equations. There are also character class defines. Character matching work different in syntax and token rules. In token rules a matched character class character becomes part of the token. In a syntax rule they are identical to a character literal match. A +<string literal> is pushed on the parse stack if matched. ? <string literal> looks for the string returning success or failure the parse state is unchanged. -<string literal> matches the string to the input stream and nagates the result leaving the parse state unchanged. A ~<character literal or character class name> in a syntax equating matches any character except the <character literal or class> skipping it.

Syntax equations can call a generator function any were a syntax equation call could be made. A generator could be used for many functions outside syntax equations abilities. Generator functions return success or failure same as syntax rules. They consist of sets of unparse rule action pairs. Actions are programmed in a lisp 2 procedural dialect that can perform various analysis returning success or failure. Symbol attributes testing and setting is one function a generator might parform. One could test if a symbol is defined or a defined constant. Even command line compile options.


In a syntax equation a + preceeding a character string if matched will create a string object and push it on the parse stack. In a token equation the string would become part of the token string. A comma preceeding a string literal in a syntax equation creates a string object and pushes it on the parse stack.


A backtrack creates a backtrack stack frame on the call stack. When a backtrack failure occures the top backtrack stack frame pointer is loaded into the call stack pointer and failure returned. A backtrack failure is simular to a c longjmp. The call stack is primed with a backtrack stack frame befor the top program driving rule is called. The prime backtrack does not restore the parse state. It is there for consistency and to catch programming errors in the compiler. Personally I have only used backtracking for error reporting and recovery:

program = ((decl | ".end" .break) \ ERRORX("Syntax error") (-';' .any)* ';')*

The above specifies a valid program as a sequence of decl followed by .end. Assuming .end to not be a valid decl. If decl returns failure the non-backtracking alternative ".end" .break is atemped. If ".end" is matched .break ends the loop and program is succesful. If a long backtracking failure occures the backing alternative;

ERRORX("Syntax error") (-';' .any)* ';'

is taken. ERRORX reports "Syntax error" and shows the location at which furthest parse attained. If the non-backtracking alternative fails to match ".end" the backtrack alternative will also be taken. Assuming a ';' ends a decl the (-';' .any)* ';' attempts to skip over the decl in question. the -';' matches any character except a ';' The character is not consumed in any case. .any matches the character not matched. Kleene star * loops until failure occurs on encountering a ; the outside the loop ';' then matches the ';' following the Kleene sequence. and the outer Kleene * loop continues looking for decl.

separate point of contention

edit

A separate point of contention is your position that John Backus based BNF on Chomsky's work. There is no hard facts proving John know of Chomsky grammars in 1958. He mentioned Emil Post, which makes a lot more sense. John was employed at IBM. As the major compiler developer he would have been exposed to boolean equations used in citcut design. Compiler developers worked closely with design engineering. You can find ACM papers, by IBM engineers, from that time period on boolean equations use in computer circuit design. Post's doctoral thesis in mathmatics was on Boolean algebra. Backus was a math major. The ALGOL 60 BNF terminology is most simular to Posts. Posts talks about rule names being boolean variables. In the BNF description they are linguistic variables. We see the use of operator being the same as Posts. The := BNF symbol is interpreted to mean "is defined as". The same is used in the boolean circuit design papers. I think there is simply a backward projecton of the 70's compiler technology. And in actual use in the ALGOL report BNF is simply a metalanguafe. < > are not always linguistic variables. For example: <any sequence of characters not a ;> That can be found in section 2 of the report describing comments.. And personally I believe Naur had more to do with how it was used in the ALGOL report. After all he is the one that rewrote the report to use it. He replaced linguistic variable with class name in his own works. You can find it in ALGOL class notes written for an ALGOL class he taught.

personal

edit

My brain seams to work a bit different then the norm. I do not remember name's or symbology and such unless I constantly using them for some period of time. I retain concepts and just know how things work. I started as a math physics major in collage. In high school I got hooked on math in first year algebra. Learning algebra, trig, calculus, differential equations on my own checking books out of the library. That was in the first few weeks of class. I aced the advanced math SAT test that was given in that algebra class towards the end of the year.. Solved every problem correctly. Off the percential scale. I started collage concurrently with my senior year of high school. I aced the higher math collage entrance exam. We were given 4 hours to complete it. I finished it in 15 minuets working all problems in my head, using no scratch paper, Missing only 3 multi-level algebraic substitution problems. It took me a long time to realise my memory limitation. Even writing, I often repeat my self forgetting what I wrote only a few minutes earler. But on the other hand I can keep extramly complex algorithms in my head. I do understand parser generators and how they work. I was just never interested in them. That is because when I learned about CWIC first. Modern compiler courses skim over recursive decent parsers. What you decibels as syntactic sugar eliminates backtrack of ambiguous language construct. These were selfhosting compilers. So extending the language was fairly simple.

Most of my professional career I have worked on compilers, assemblers and operatoring systems. Though I worked on compilers that used parser generator technology, that you and Arthur know, I was not directly involved with the parser. Working for Kontron FutureData I managed the language development group and worked on code generation and optimization. From the little I did working with a programmer on the MODULE 2 parser I can say that in the late 80s they were primitive.==END RSP==

first need to agree on what we are talking about. It's like I am talking apples and you are talking about pumpkins. I apologize again for my inadequate explanations in trying to explain reductive grammars. Sense you understand a bit about compilers. I will explain how they work as parser programming languages and maybe we together can decided what kind of grammar they are.


Rp, You said: My idea on the relationship between generative and reductive grammars is different from how You Rp, Arthur Rubin and standard compiler technology textbooks see it. I have lots of compiler text books. None of which talk about reductive grammars. Some mention analytic grammars. Actually few even talk abought linguistic theory. Other then grammar types (i.e. CFG) that a specific parser type can handle. If you have a defination of reductive grammar that you can provide a reference for, I would like to here it. I have never studied Chomsky grammars before now. I am 68 and have been programming sene I was 17 in 1965. I have been active in compiler development, off and on, sense 1968. Last heading the Future Data language development group at Kontron Electronics in Irvine until they moved to Hayward in the mid 80s. For more details on my background see about me. The parser generators we used at Kontron were good for doing about half the job. It took a great amount of time fiddling with them to get an initial parser. And then again the parser still had to be worked on to finilize it. And that was having a BNF language specification to start from. As you already touched on. BNF has to be written specific for a parser. One programmer in my group had her Masters in Computer Science from UCI. A very smart Vietnamese girl that worked on our ADA parser. We were using A VAX 11/?? (I think 70) running Unix. The compilers and assemblers we developed ran on our development systems. Both Z80 and MC68000 based development systems were supported at the time. We called them In Circuit Development Systems. Our software run on FutureData Z80 and Kontron MC68000 micro computer's with In Circuit Emulaters. Fully integrated Systems with symbolic deguging. Many of the FutureData In Circuit Development Systems went to Atari game developers

I have learned a few things esearching old documents and remembered from calculus and differential equation solving. I have found sources using reduction techniques that are analogous that I will get to eventully.


My passion and frustration comes from developing an advanced compiler compiler that I named SLIC (System of Languages for Implementing Compiers). The inspiration for SLIC was CWIC (Compiler for Writing and Implementing Compilers) developed at System Development Corporation. I learned about CWIC at an L.A. ACM SEGPLAN meeting in 1969 were I first met Erwin Book the lead developer of CWIC. He was gaving a talk on it. After the talk there were informal discussions and I talked with Erwin about my interest in compilers. He gave me a set of programming manuals on CWIC. Dewey Val Schorre, the author of META II, was also a member of the CWIC development team. I implemented SLIC on a DEC-System-10. SLIC was later used in writing a COBOL cross compiler that was completed in less then 4 man months. It ran on a DEC-System-10 producing TI990 machine code. I have known these as programming languages, not really conserned about their grammar type until recently. I think there should be a fitting grammar definition that agrees with their original documentation. I think this technology is vary important and has been ignored and forgotten by mainstream computer science. The blame I think is UNIX and it's popularity, wide spread use in universities, during the early 70. yacc dominated mostly because of unix. It fit with Chomsky linguistic theory. It also contributed to the BNF changes from it's original use in the ALGOL 60 report.

A really frustrating thing is that SLIC is about as close to an ideal compiler compiler as possable. It can be divided into front ends that translate a language into sequential pseudo code. And a back ends that defines the target machine's instruction and translate the intermediate sequential pseudo code into executable code. All stages have the ability to perform optimizations. If planed one can mix and match front ends and backendd. It was running in 1974. SLIC running on the DEC-10 included an interactive debuger. You could set break and/or trace points on grammar rules, in generator actions, pseudo code or MACHOPs.

My agenda is not to push SLIC. There are topics here that used the reductive grammar technology. A lot of the power of Wikipedia is it's cross referencing ability. Technical article can be more to the point linking to other articles. I am trying to get reductive grammar expained. I would appreciate help with the task. I use examples that are as they would be generated in SLIC. CWIC code generation is to binary using arithmetic expression to create machine code. SLIC's additional PSEUDO and MACHOP languages provide a much more readable symbolic output. In explaining the grammar language it is important to illiterate it's compilation to machine code. which is easier with SLIC examples. This is about the grammar language. As far is I know only TREE-META and CWIC are the only documented Schorre languages that used intermediate tree structure between the parser and code generation phases. But there was META 4 and META 5 of which I have found no source of documentation. SLIC's grammar and generator languages are 98% identical to CWIC's. The goal in writing SLIC was to separate machine code into separate phases. Earley compiler compiler systems produced 3 address micro instructions that could be coded an a macro assembly labguage. I simply built those into SLIC. Only making them executable functions.

These languages share some properties with Parsing Expression Grammars. They are analytical. Altunatives are tried left to right, in the order programed, taking the first match. (not the longest). That alone takes them out of the production grammar class.

You clame they are parser generators. Personally I avoid calling them parser generators. Parser generator are part of the new mindset of generating a bottom up parser from a grammar. LALR parser for example. What I am talking about could be called Parser Programming Grammar Languages. The difference is subtle. You talk about packrat parsers, LALR parsers and other types of parsers that grammar rules are translated into by parser generators of whach most are bottom up parsers. How the various parsers get around problems introduced by using production grammars. You have talked about the so called advantages of bottom up parsing. The languages I am talking about are all top down. In fact I now think that is part of what they ment by reductive grammar (See TREE-META goal). Which is analogous to mathematical reduction. Breaking up a complex problem into smaller parts that lead to the solotion. A reductive grammar describes a programing language from the top breaking it down into smaller meaningful parts. i.e. Declarations, procedures, statements, tokens, and characters. When I hear parser generator I think yacc(Only pronuced yuk).

The technology is analogous to how we solve complex problems. Breaking a problem into manageable sub goals.

All the Schorre based metacompilers have a specialized parser programming language in witch the main goal is to match the complete program with the starting syntax rule. (See TREE-META goal) The start rule drives the parser describing a complete program.

program = declaration declaration*;

The above defines a program to be a declaration followed by a sequence of declaration using the (Kleene star * sequence operator). The Schorre languages used a precedence sequence operator $ zero or more operator.

program = declaration $declaration;

The above equilivant program rules are pretty general. And could be part of a production grramar. However they are programed functions that are compiled into executable code:

(Hand compiled commented example code)

program: // = declaration $declaration;
     call  declaration
     bne   %g1          // on failure return to caller. 
%g2: call  declaration
     je    %g2         // loop while success
     cmp   al,al       //set success
%g1 : ret                // return to caller

The above illustrate the translation of a rule into IA 86 machine code. Hand compilations is straight forward. The complicated assembly code is in the support library. The %g<number> is a gen_sym. Compiler generated symbols are part of the CWIC and SLIC Compiler compiler systems.

A declaration could be most anything. The declaration sub-goal is another rule. I can not express the point enough that these languages are programming languages. That a rule compiles into an executable function that calls other functions. These are test functions that return (boolean) success or failure. In SLIC success or failure is returned in the processors status register zero flag. In the above example 80x86 or IA86 code a conditional je or jne tests the processor status register zero flag. equal for success. Is Parser Programming Grammar Language a term used today? If not that would fit these languages. No mater how one names the technology they would be a grammar in specifying a language. CWIC and SLIC are more general then compiler compiler. They could translate their input into a BNF form for publication. I taught sectatary to use SLIC. I forget the details other then the task was searching text files. SLIC started up taking it's input from the user terminal. It processed it's own command line. parsing it in the syntax language. Supplied library generators open files given the file in a tree structure. The source could be a list of files in tree structure format. Part of the system is a list processing language. The CWIC generator actions are coded in a LISP 2 dialect. My action language added operations of first, butfirst, last and butlast from the LOGO programming language.

META II is extremely simple compared to CWIC. But the basic parsing concept is the same. META II compiled to an inturpitive stack machine. Compilers written in META II compile to a stack machine intupeters specific to the language. A subset of ALGOL was written in META II. It ran on a stack machine inturpiter written specifically for it. META II produced a parser exactly as coded in the META II metalanguage. It was limited to producing stack machine code as it output object assembly code directly out of the syntax rules.

 EX3 = .ID .OUT('LD ' *)/'(' EX1 ')';
 EX2 = EX3 $('*' EX3 .OUT('MLT'));
 EX1 = EX2 $('+' EX2 .OUT('ADD'));

The META II example above is from the original META II UCLA paper by D. Val Schorre. .ID .NUMBER and .STRING were built in recognizers. They followed ALGOL token rules. When an identifier is matched by .ID in rule EX3 a string 'LD ' and the last parsed entity recognized by .ID is written to the output assembly file by .OUT('LD ' *). .OUT a built in function writes to the output file. * writes the last parsed token (.ID, .NUMBER, OR .STRING).

The input sequence:

A + B * (C + D)

Would output:

Compiler Output
Assembly

Output

tack Rule Operation
LD A A EX3 push A
LD B B,A EX3 push B
LD C C,B,A EX3 push C
LD D D,C,B,A EX3 push D
ADD C+D,B,A EX1 Top stack entry D poped and added to next entry D
MPY B*(C+D),A EX2 Multiply top two stack entries
ADD A+B*(C + D) EX1 Add top two stack entries

The example is greatly simplified for explanatin. The EX1 and EX2 rules can be expanded to handle subtraction and division:

 EX3 = .ID .OUT('LD ' *)/'(' EX1 ')';
 EX2 = EX3 $('*' EX3 .OUT('MLT')/'/' EX3 .OUT('DIV'));
 EX1 = EX2 $('+' EX2 .OUT('ADD')/'-' EX2 .OUT('SUB'));

The syntax sweetness of grouping and alternative operators allows us to avoid backtracking.

TREE-META, CWIC and SLIC constructed Abstract Syntax Trees: They have node and parse stacks. :<node> pushes a node onto the node stack.!<bumber> creates a list whose first element is the top node stack entry. The node is poped into the first list element. parse stack entries are pop into the list in reverse order. The parse stack enties are poped into to list first pop to last position then next to last and so on. The result is the list order matches their parsing orderm. The first element being a node is recognized as a tree. followed by it's branches.

expr = term $(('+':ADD|'-':SUB)term!2);
term = factor $(('*':MPY|'/':DIV) factor!2);
factor = id|number|'('expr')';
id .. let $alphanum;
number .. dgt $dgt MAKENUM[];

given: A + B * (C + D) would generate a tree:

 ADD[A,MPY[B,ADD[C,D]]]

   ADD
  /   \
 A    MPY
     /   \
    B    ADD
        /   \
       C     D

As to code generation I'm not going into the details. A generator would be called at some point. It could rearrange the tree for least register usage:

 ADD[MPY[ADD[C,D],B],A]

         ADD
        /   \
      MPY    A
     /   \
   ADD    B
  /   \
 C     D

Now evaluating the left branch first we produce:

    LD  r1,C
    ADD r1,D
    MPY r1,B
    ADD r1,A

An inturpiter example simular to one in the ACM CWIC paper:


.SYNTAX
PR0GRAM = $(ID ('=' DECLARATI0N | ':=' ST)
           | '.END' .FINI) 
           \ ERRORX['???'] GARBOL);
DECLARATI0N = $((NUM ';':EQU!2 DECL[*1] .FINI
                | ERRORX['Numeric  expected'] GARBOL)
               \ ERRORX['; assumed'] NUM:EQU!2 DECL[*1] .FINI );
ST = $((EXP ';' :STORE!2 C0MPILE[ *1]
       | ERRORX['Syntax error'] GARBOL)
      \ ERRORX['missing ; assumed'] EXP:STORE!2 COMPILE[*1] .FINI );
GARBOL = $(-';' .ANY) ';';
EXP = TERM $(('+':ADD/'-':SUB ) TERM !2);
TERM = FACTOR $(('*':MPY/'/':DIV FACTOR !2);
FACTOR = ID / '(' EXP ')' /NUM;
LET: 'A'/ 'B'/ 'C'/ 'D'/ 'E'/ 'F'/ 'G'/ 'H'/ 'I'/ 'J'/ 'K'/ 'L'/ 'M'/
     'N'/ 'O'/ 'P'/ 'Q'/ 'R'/ 'S'/ 'T'/ 'U'/ 'V'/ 'W'/ 'X'/ 'Y'/ 'Z';
DGT: '0'/ '1'/ '2'/ '3'/ '4'/ '5'/ '6'/ '7'/ '8'/ '9';
ALPHNUM: LET / DGT;
ID  .. LET $ALPHNUM;
NUM .. DGT $DGT MAKENUM[];
.F1N15H
.STOP SETUP PROGRAM
.GENERATOR
DECL(EQU[X, Y]) => DEF:(X) := Y
COMPILE(STORE[X, Y]) => DEF:(X) := EVAL(Y); PRINT(DEF:(X))
EVAL(IDP(X)) => DEF:(X)
    (NUMBER(X)) => X
    (#V1[EVAL(X), EVAL(Y)]) => #U1;

     #V =  ADD,   SUB,   MPY,   DIV;
     #U = X + Y, X - Y, X * Y, X / Y;

.FINISH
.STOP SETUP PROGRAM

The changes I made above are the error reporting and arithmetic operators handled. Hope I got all the grouping right in the error detection. I used both nom-backtracking and backtracking in a combination to report different errors. Operator error by non-backtracking and backtracking for errors on after operator. The inclusion of subtraction and division were quite simply adding them to the vector arrays. Vector's designated by the # provide a multiplative power to a generator function. Vectors are a shorthand notation. The #V node list vector and #U corresponding operation vector allow a single unparse action pare to match any one of the nodes in the node list and perform the corresponding operation vector in the action. Note the tree crawling unparse rule (#V1[EVAL(X), EVAL(Y)]) that when the node #V1 is recognized the unparse rule calls EVAL(X) with left tree branch. X is the return value of EVAL(X) not it's argument. When a generator call appers in the unparse rule it is called with the obect of it's position. Its parameter are returned values. In an action they are like function in other languages. Argument's in the bracketed argumebt list. Generators may return multiple values.

(x,y) := gen(z);

gen taking a single object and returning two that are placed in variables x and y. Besides values generators return success or failure. A generator failing to have a successful unparse returns failure to it's caller. That includes a syntax rule. A generator can be used for many things during syntax analysis. Creating objects in token rules. Generators are called to genet code from systax rules. These are syntax directed compilers. Single and multipass compiler can by programed in these languages. Supplied library generator function handle the writing bad reading of lists and trees..

The SLIC compiler compiler that I wrote includes the SYNTAX and GENERATOR languages of CWIC. With a few additions. A backtracking syntax rule. Just syntax sugar as you would say.


X == ... ;

It is equivalen to

x = ( ... ) \ .FAIL;

It simply caught a backtrack failure and returned failure. Backup and backtracking are specific features.

Backup is automatic. It applies to token and string matching operations. I use the term test as a type, a boolean type whose value of success or failure. Though as I explained above the test type is a processor status bit state. In taking about these rule functions we can say they are a test function. We can look at a grouped expression in a syntax rule as a test. Most ever ecaluating part of a rule evaluates to a test Every syntax or token rule is a test function. A literal string is a test. "XYZ" calls a builtin string compare test function. A syntax or token rule is a test function. A grouped expression is a test. Backup applies when only a single item has been matched. A token or a string for example. Once a test has succeeded a failure of a subsequent test results in a backtrack failure. Backtrack's are created on the call stack. as backtracks are created they are linked to the previous top backtrack. A backtrack is created by calling the babktracker. It creates the backtrack stack frame and calls it's callers return address with success. That would happen at the beginning of the first alternative of a backtrack alternative operator. The backtrack contains many things in oder to restore the state on failure. A temporary symbol table that contains created symbols and attribute changes to previous symbols. On success all this is transfered to the previous backtrack stack frame. All the way back to the initial program backtrack frame. The program backtrack frame is created before the first driving start rule is called. It simplified house keeping as it contains the top structures that are backtracked. The root symbol table for example. It doesn't backtrack on a failure as failure indicates a internal compiler error. A backtrack failure occurs Whether or not a backtrack alternative exists. To under backtracking a good grasp of a test is needed. A test maybe a simple string or token test. A rule is a test. Parentheses grouped tests are them selves a test.

 A (B C D) E

is a sequence of three tests. The group (B C D) is a test. B C and D are tests. A rule:

X = A (B C D) E;

Will return failure to it caller if A returns failure. If A is successful then a failure either of the two following tests results in a backtrack failure. A backtrack failure may fail any number of levels to a rule having a backtrack alternative. Or the base backtrack may catch the backtrack resulting in an internal compiler error. It's not that hard to understand backtracking semantics. Maybe now you see why I prefer not to call these parser generators. I see you have a good grasp of modern compiler technology. Mine is not as good manly because everything I have looked at does not compare to the power CWIC had in 1970. The only gain I see is in formal proof's that do not apply to these languages. The exponential time just doesn't happen. Yes backtracking is time coustly. But it isn't applied universalky. In the COBOL compiler we ran into an ambiguity problem. We copied the DEC-10 syntax including extensions DEC made. We ran into the problem with the record contains clause.

File_contains_clause = "FILE" "CONTAINS" number ("RECODS" | "RECORD"|.EMPTY);

Allowing "RECORD" was a DEC extension. The problem is the a Record_contains_clause starting with "RECORD" "CONTAINS" ...

You can probably see the problem. .EMPTY makes it optional that the File_contains_clause end with "RECODS" | "RECORD". .EMPTY is always successful. So when a Record_contains_clause follows the "RECORD" of the Record_contains_clause is taking as part of the File_contains_clause and a syntax error resulted. The soloutine took about 10 minutes to have a new fixed runing compiler in place. We changed

File_contains_clause = "FILE" "CONTAINS" number ("RECODS" | -Record_contains_claus "RECOD" | .EMPTY);

CWIC is not a parser generator it is a complete compiler development system including a powerful LISP 2 generator action language that can process the trees performing optimizations before producing code. CWIC was however limited to generating 8 bit byte numeric code. It ran on an IBM 360 system. Target machine instruction were put together using arithmetic expressions. The were not easy to read. One of my goals in developing SLIC was to improve instruction generation readability.

That is basically the changes in SLIC. SLIC has three additional sub languages. In the MACHOP language one codes an assembler of sorts. A machine operation is defined by a MACHOP. Assembly data allocation assembly pseudo op may also be defined. BYTE, WORD, STRING or ASCII etc. The opcode mnemonic and instruction are first specified. The opcode mnemonic may use the vector notation used in the CWIC example above. instruction fields and attributes may be specified. A MACHOP is an executable procedure that output bit sequences. An instruction is programed as a sequence of bit fields. Conditionals may be used to select the presence of a field or different field sequence. Conditional expressions may be used in programming a field value. Sense the goal is code generation for any target machine the is no addressable unit other then the bit. That is to say code is generated into a bit addressable nemory space. A .MORG statement is used in a MACHOP to aline to an addressable boundary of the target mchine. A secondary function is formating the numeric output to the program listing of machine instructions. A compile option prints the generated assembly in the program listing. The numeric output shown on the left of the assembly instruction. This was a great help in debugging code generation. Assembly output alone doesn't get you to the ideal compiler. A PSEUDO operation language is the inbetween. The generator language produce pseudo code instructions. I designed pseudo instructions to best fit the language. They could be more general. A program does not just consist of instructions. And it is usual to separate code at lest into data and machine instructions. Named code sections allow code to be separated as to function. In CWIC section were a block of memory into which 8 bit bytes were planted. Plant is the term used in CWIC for emitting code sequences. Flushing a section, using a .FLUSH <sectin> statement wrout out the code block. Addresses in the code had to be resolved befor flushing. In SLIC a section contains a pseudo list. Pseudo instructions are appended to the list as they are planted. The list is executed when a section is flushed. Pseudo instructions are procedures coded in the pseudo language, A subset of the LISP 2 generator action language. Pseudo procedure called MACHOPs to output binary machine code. Unresolved address expression created expression trees that once all references were defined a polish fix up block is output. An insequence optimizer was planed that would operate on the pseudo lists when they were flushed. The list is cleared when flushed. The generator and pseudo language could create generated symbols or gen_sym symbols. These symbols are local or unique in time and to the procedure creating them. They start with a % followed by a letter and number. %g1 %g2 These are used in actions and pseudo instructins to address branch points etc.

I do not think this is the technology, you mentioned, your master's thesis was on.

As a summary.

The CWIC grammar language has three rule types. Syntax, token, and character class.

The character class define a named multi-literal that matches a character blonging to the class. Character classes in SLIC were used in generating a class table indexed by a character's numeric code. Each class table entry is a set of ordered bits define a character's class membership's.

bin:  '0' | '1':
oct:  bin | '2' | '3' | '4' | '5' | '6' '| '7';
dec:  oct | '8' | '9';
hex:  dec | 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
          | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';

Each class having characters is assigned a bit. A class's bit mask contains clasd bits of classes it includes and it's own unique bit if actual characters are included in it's class.

A class like alphanum, containg only the alpha and dec classes, would only contain those two bits.

Token rules define symbols, strings, numbers etc. In CWIC a token rule first skipped skip_class characters. SLIC token rules skip skip_class characters differently. It is simpler to just say a token could start with a skip_class character. It skips skip_class characters until it finds a match or fails on non-skip_class character that doesn't match. An example would be symbol starting with a line_break. A discarded positive match not kept as part of the symbol.

I really do not see the advantage of the elaborate backtracking or packrat parsers when usually it isn't required except on a vary few language constructs.

Do you really take the position that for one single language phrase as in the COBOL eample we need the backtracking overhead applied to the whole parsing process?

TREE-META added backtracking using an <- on the first alternative some time between 1968 when it was first released an 1974 in the later document I recently found.


You (Rp) said my idea on the relationship between generative and reductive grammars is different from how You, Arthur Rubin and standard compiler technology textbooks see it.

The only recent definition I have found is in the 2006 McGraw Technical Term dictionay. The others are in old pre 1975 metacimpiler documents. Were the grammar language is said to be reductive or analytical. I took that to imply reductive and analytical to mean the same thing. I have changed my view on that. As I explained reductive a plies to reducing the hole into smaller and smaller easer to handle problems. It is a specific form of analysis.

I never meant to express that a reductive grammar reduces a language to a start symbol as you explained. That is not how they work. I think I have explained them above.

Forget Chompsky for the moment. McGraw-Hill Dictionary of Scientific & Technical Terms, 6E. 2003 defines reductive grammar (computer science) as A set of syntactic rules for the analysis of strings to determine whether the strings exist in a language. [2]

I have to explain why I called BNF a reductive grammar? I am not sure I did. I would not say it is now.

I was talking about it's use in the ALGOL 60 report. A point of history. BNF as used in the ALGOL 60 report had | alternative operator. Not added in EBNF. There is no real prof that John Backus knew of Chomsky at the time he presented BNF to the IAL. His original comment named Emil Post as his source. Then it changed. Seams John was being vary evasive. It is just as likely it came from an internal IBM source. The notation is not much different then boolean logic equations used in computer circuit designs published by IBM engineers around that time. Being a compiler developed he would have been working with engineering. In those papers rule names were referred to as boolean variables. And originally in BNF as linguistic variables. He may not of been able to talk about IBM engineering tools. Non-disclosure employment agreements were not uncommon. It was Peter Naur that edited the ALGOL 60 report. He is the one that changed the draft to use BNF. It is really unclear who exactly is responsible for how it was used in the report. i.e. How much did Naur contribute to BNF when changing the report? The story is that John Backus brought some notes to a meeting illistrating BNF. Naur took them and did the editing. From Naur's description of BNF. The meta symbols ::= is to be interpreted as "is defined as". "is defined as" not "is replaced by".

BNF as used in the ALGOL 60 report is strictly a metalanguage. Look at the commenting specification in section 2.3 of the revised ALGOL 60 report.

; comment <any sequence of characters not containing ; >   ;

begin comment <any sequence of characters not containing ; >   begin

end <any sequence not containing end or ; or else >   end

The above is in the ALGOL 60 report in table form. I just put it in rule form using   expressing replaced by equivalence. (productive grammar form) In that form we do not have a context free Chomsky grammar rule. There is multiple symbols on the left. Can you say that <any sequence of characters not containing ; > is a symbol? No! We instead have a text description. It is a description like $(-';' .any) that is used in the CWIC to skip over unrecognized input text and comments. Why was it in a table form? Why not just use BNF rules for specifying comments? The simplest explanation is that the ::= did not fit with the "is defined as" defination. So in the report it appears as:

The sequence of basic symbols: is equivalent to
; comment <any sequence not containing ;>; ;
begin comment <any sequence not containing ;>; begin
end <any sequence not containing end or ; or else> end
By equivalence is here meant that any of the three structures shown in the left hand column may be replaced, in any occurrence outside of strings, by the symbol shown in the same line in the right hand column without any effect on the action of the program.

There we have a production rule. Or is more a reduction. Maybe elimination rule?

One should study the original ALGOL 60 documents.

editING point

edit

So what is a reductive grammar?

I remembered mathematical reductions in solving several types of problems. There is a common concept between mathematical reduction solutions and these reductive grammares. Both simplify a complicated problem by reducing it into simpler parts. Reductions may be repeatedly applied in stages. That is how these grammars work. You start with a rule expressing the problem. In this case a program rule. The program rule would express the top program structure. COBOL divisions, C and C++ are a sequence of declarations.

An analytical reductive grammar describing context sensitive constructs does not follow Chomsky format. Chomsky grammar rules are of the form:

<pattern>    <replacement>

Were reductive analytical rules are of the form:

<name> <defined as operator> <pattern>

<name> is descriptive, naming a language structure described by the <pattern>. The word formation may be talking about a pattern. A marching formation. An arcraft formation. A term in an arithmetic expression has a defination. These are simular to phrase structure grammar. Formation may refer to structure. Formation, structure or pattern can mean the same thing. As in a rock formstion. An ambiguious phrase as it could also refor to a rock formation. I do not dispute BNF being a Chomsky production grammar as defined and used today. It does not fit reductive grammar either. But it is more like a reductive language specification as used in the ALGOL report.

Rp. Your long description of the new syntax is syntactic sugar, which I have been snidey And claim that it does not add any expressive power, only more conciseness and that what is reductive is not the BNF rules themselves, but the parsers that are automatically generated from them. and was the innovation brought about in 1962, apparently by META II: the automatic generation of parsers from grammars.

The example from the META II paper I went over above shows you above conclusions to be wrong. The point that you are missing is that the parser is programed in the language. Because it is a programming language left recursion in not allowed. A infinite loop would result. The solution is the zero or more or sequence loop operator. Your syntactic sugar descusion did not inlude grouping. Why not? Grouping and the sequence operator accomplish the same parse. Your ambiguous problem is solved with grouping and the alternative operator you said to be syntactic sugar.

Another problem is ambiguity along the way: when doing top-down, left-to-right parsing, these rules

 A :== B C | D
 D :=  B E

Are simply bad code and should be written:

 A :== B (C | E)

No problem there. Syntactic sugar factoring to avoid ambiguities the sweet soloutine to backtracking.

CWIC has three stacks. The call stack, node stack, and parse stack. The parse stack is commonly called star stack as *1, *2 etc addressed parse stack entries. The *<number> is carried over from META II that used the same notation to address recently parsed entities. The node stack hold node objects created be the :<node name string> sequence.

expr = term (('+':ADD|'-':SUB) term!2)*;

In BNF expr might be written as:

<expr> :== <term> | <expr> <addop> <term>
<addop> :== + | -

The CWIC expr rule having no left recursion does what takes two BNF rules to express and more in getting to an abstract syntax tree.

end 4

edit

They may even return multiple levels in backtracking. A backtrack '\' alternative operator on beginning the left goal saves the parse state on the call stack. Multiple backtracks may exist and may return multiple levels of rules on a backtrack failure.

expr = term1 $(('+':ADD / '-':SUB) term2!2);

In the above if term2!2 failes a backtrack failure will occure. If no backtrack alternative has set a backtrack point an outer backtrack block on the call stack before the driving rule is called will catch the backtrack failure and is sure a internal compiler error.]

Backtrack stack entries are chained back word to previous backtracks. An initial compile fail backtrack catches a backtrack failure not having a backtrack alternative and reports a compiler failure. A backtrack failure occurs once a sub-goal is successful and a following sub-goal fails. Backtracking can return any number of call levels. Backtracking is nested. The outermost being before the driving program rule is called. This outermost backtrack does not actually backtrack to a saved stare it catches a programming error. A backtrack oc

Like a PEG these languages take the first alternative matched. Not the longest The are not a pack rat parser. The only have one backtrack path.

edit 05

edit

The following is part of a compiler-compiler's grammar. CWIC had separate compilers for each of it's languages. SLIC was a single compiler but used the language declarations:

.SYNTAX
.GENERATOR
.PSEUDO
.MACHOP
.ISO

The sub-languages were divided into sections using the above language section headers. One could have multiples of a section type. Most times sections were compiled separately and linked together.

The syntax below is simular. Updated to modern character sets and removing language section declarations.

 program =  $declarations; // empty files produce no errors.

 declarations = "#" dirictive
              | comment
              | global	  	  DECLAR(*1)                 
              | pseudo_op	  PRODUCTION(*1) 
              | emitor_op	  MACHOP(*1))
 //           | optimizer	  ISO(*1)
              |(id (grammar	
                   |sequencer	  GENERATOR(*1))
                 \ ERRORX("!Error") garbol);
 
 grammar = (':' class  :CLASS   // character class define
          |".." token  :TOKEN   // token rule
          |"==" syntax :BCKTRAK // backtrack grammar rule
          |'='  syntax :SYNTAX  // grammar rule.
           )';'!2   PARSER(*1)  // Combine node, name, and
                               // rule tree. Call PARSER(*1)
                  	$(-break -"/*" .any) ;
 
 comment = "//" $(-break .any) 
         | "/*" $(-"*/" .any) "*/"
 
 sequencer =	 +[ transform $( '|' transform) ]+';'
 
 transform = '(' +[(unparse $(',' unparse)) | -- ]+ ')'  '->' action:TRNS!2;
 
 char_lit = ''' .any ''' | number;
 
 class = +[ char_lit $('|' char_lit) ]+:CLASS!1;
 
 string = '''.any ''' | '"' $(-'"'.any|"""""",'"') '"';
  
 token = char_sequence (gen_call|--) $('|' char_sequence (gen_call|--));
 
 char_sequence = (char_match $char_match 
                | '(' char_sequence ')') $('|' char_sequence);
 
 char_match = id {type==class} | string_lit;
 
 char_code = ''' .any ''' | number;
 
 bindgt:  '0'|'1';
 
 octdgt:  bindgt | '2'|'3'|'4'|'5'|'6'|'7';
 
 dgt:     octdgt |'8'|'9';
 
 hexdgt:  dgt |'a'|'b'|'c'|'d'|'e'|'f'
             |'A'|'B'|'C'|'D'|'E'|'F';
 
 number .. ("0H"|"0h") hexdgt $hexdgt MAKEHEX()
         |("0O"|"0o") octdgt $octdgt MAKEOCT()
         |("0B"|"0b") bindgt $bindgt MAKEBIN()
         | dgt $dgt MAKINT();
 

The name of a syntax rule appears at the start of the rule followed by the = symbol. The right-hand side of a rule indicates which entities are to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves literals or other named rules. The process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic tokens are recognised.

It starts with a top rule reducing complex language structures into simpler and simpler structures down to atomic elements. I think reductive or analytical may be explained by the above.

edit 06

edit

I had not studied the earlier metacompiler until recently. Around 1994-95 I found documents on the web. As far as I can find in verious compiler-compiler documents only the Schorre line of compiler-compilers define themselves as metacompilers. In META II, TREE-META, and CWIC a metacompiler is defined as a compiler who's input is a metalanguage program. These old compiler compilers refer to their syntax language as a reductive grammar an thus should have a description other then the simplistic McGraw Hill scientific terms dictionary one. I am more frustrated that this more advanced matacompiler technology developed at SDC has been ignored:

"The ideal compiler-compiler takes a description of a programming language and a target instruction set architecture, and automatically generates a usable compiler from them. In practice, the state of the art has yet to reach this degree of sophistication." compiler-compiler

CWIC has front and back ends that are compiled separately. They could be combined, if carefully designed, in different combinations. Language parser with code generator In the syntax language you program a parser of a language and it's transformation to an abstract syntax tree. ,


SLIC is probably as close to the ideal as pritical. SLIC translates a programming language into machaine code in five stages (or phases) of transformation. Each phase coded in a specilized programming language, or sublanguage as SLIC is a single compiler progran.

The first phases are the parser and code generator. These first two phases transform the input source code into sequential instructions. They are almost identical to CWIC's SYNTAX and GENERATOR languages. The major difference being the type of code generated. SLIC's generator language generates PSEUDO code lists were CWIC generated byte sequences into into memory blocks. The pseudo instructions are optionally processed by in sequence optimization transforms. PSEUDO instructions are coded in the PSEUDO language and output machine code using MACHOPs. MACHOPs in use appear to be assembly code. However they are functions coded in the MACHOP language. One might mistake a PSEUDO instruction for an assembly macro.

The target machine instruction output is coded in the final two phases and can be separately compiled and linked with language parsing and sequencing phases.

In theory using a common pseudo instruction set. We could have for example five separate language front ends, SYNTAX and GENERATOR, coded and compiled. And 10 different processor instruction sets defined in the MACHOP language and pseudo instructions for each instruction target processor. Linking each language front end with each target processor back end you would get 50 compilers. Many optimizations can and are better suited to the generator language. The ISO language has not been implemented. To by fair CWIC is able to accomplish the separation the division being between the syntax and generator languages. The advantage of SLIC is it's readability and less actual code changes. It would be expected to find multiple instances of a machine code sequence that would be coded as a single pseudo instruction. The function of PSEUDO instructions are well defined. In theory a PSEUDO instruction's operation is specified and can be independently implemented. This enables them to be coded independently. Targeting different instruction sets is separated from the more complex sequentalazation operations performed in the generator language.

  • Parser phase: The parser coded in the SYNTAX language is a set of grammar rule like test functions consisting of parsing (boolean) test expressions that return (Boolean) success or failure. The parser transforms input source code into abstract syntax trees using tree constructor node (:<node symbol>) and branch (!<branch count>) operators that minuplate pushdown node and parse stacks. Generator rule test functions are called from syntax rules passing them parse stack, abstract syntax tree, entries, Lexical elements of the language (symbols: words, number, and strings) are recognized by token rule test functions. Token rules if successful convert the recognized character sequence into an object and push it on the parse stack. The parser starts with a top driving rule defining a program:
program = $((?(".END"/".end") .BREAK / declaration)\ERRORX["Syntax error"] $(-';' .any) ';') end_statm;

? and - are look ahead operators. .BREAK breaks out of a loop. $ is the zero or more loop opetator. The \ is a backtracking alternative or operator. / the non-backtracking alternative or operator. They can not be used together on the same level in an expression. The rule processes declaration(s) until it finds ".end" or ".END" and breaks out of the loop calls end_statm and returns. If a declaration failure occurs The backtracking alternative ERRORX is called to flag the error. $(-';' .any) ';' skips characters not a ; matches ; and the declaration loop continues.

edit 20

edit
  • GENERATOR: A generator consist of a set of unparse => action transform pairs that transform the abstract parse tree into PSEUDO code lists. TREE-META includes a primitive form of unparse action rules. CWIC and SLIC have named sets of unparse rules whose actions are coded in a lisp 2 dialect. Unparse rule may call generator functions processing sub trees and assigning returned objects to local variables.
Gen_example[ADD[x,y]] => action
[SUB[x,y]] => action
[MPY[x,y]] => action
[DIV[x,y]] => action
•••

Above illiterates tree matching rules. ADD and SUB nodes are matched in the unparse rule assigning branches to local variables x and y. The actions would be performed with x assigned the left branch and y the right. The branches may be passed to generators in in the unparse rule and return values assigned to local variables.

Gen[ADD[Gen[x],Gen[y]]] => action
[SUB[Gen[x],Gen[y]]] => action
[MPY[Gen[x],Gen[y]]] => action
[DIV[Gen[x],Gen[y]]] => action
•••

In the above Gen is called with the tree branch were it resides and a return value assigned to the variable appearing to be an argument. In this the unparse rule crawls down the tree branches. The returned x and y should contain evaluated sub trees. If an inturpeter thay would be numeric values. Otherwise they would be a symbol, constant or a (%T<number>) gen_sym symbol that would be assigned a target machine specific tempory during pseudo code expansion.

The vectored shortcut notation below illiterates vector usage, an original CWIC feature.

:Gen[#N[Gen[x],Gen[y]]] => #O;

::#N = ADD,SUB,MPY,DIV;
::#O = x+y,x-y,x*y,x/y;

A vector list #N of nodes and #O the corosponding operation vector.

edit 30

edit
  • PSEUDO: The pseudo language is a subset of the generator procedural action language. PSEUDO instruction call MACHOPs to output code to an object code file. One might look at them as assembly macros. However they are compiled to executable code and do have considerably more abilities able to access symbol attributes, section and global variables. PSEUDO code generation appends a pseudo instruction to a section. Section are declared. Besides the pseudo instruction list a section contains several properties and declared section variables that can be used during pseudo expansion. ISOs are specific to a section. Sections are used to separate code types. One might declare separate code and data sections. Sometimes a constant section is useful. ISOs can easily combine equilavant constants. A PSEUDO list is executed when a section is fludhed. The flush statement is a part of the action language of a generator.

edit 40

edit
  • MACHOP: The target machine instructions are coded in the MACHOP language. A MACHOP defines a target machine instruction or a family of instruction. Assembly syntax is specified and prodedural code directs the production of the instruction as a sequence of bit fields in a bit addressable memory space. The compiler list output may optionally include generated assembly code. The MACHOPs specify the numeric formating of generated code in assembly line output. Assembly output is sometimes useful to the programmer using the compiler. It is of great in debugging the compiler.

edit 50

edit
  • ISO: The optional ISO (In Sequence Optimization) phase transforms PSEUDO instruction code to PSEUDO instruction code.

The ISO is an optional phase ran on pseudo instruction before their expansion to machine code.

edit 60

edit

A total of five stages. Each stage a phase in the translation process. A phase is not a pass. The phase switching is syntax directed, calling a generator function to process a SYNTAX created tree. PSEUDO instructions are planted (CWIC terminology, that in SLIC means appending the instruction to a sections code list). The PSEUDO code list is executed when flush section name statement is encountered executing a generator function.

TREE-META goal

edit
A top-down method of syntax analysis is used in the TREE-META system. The main goal is to match the complete program with the Syntax Rule whose name follows .META at the head of the program. The name of the Syntax Rule appears at the start of the rule followed by the = symbol. The right-hand side of this Syntax Rule indicates which entities have to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves defined in terms of other entities, and so the process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic symbols such as characters and numbers are recognised. TREE-META

The above applies to all Schorre based metacompilers. META II, TREE-META, CWIC and many others

edit 80

edit

A computer is a programable data processor that executes instructions stored in memory. All instructions and data in a computer are numeric and held in numeric addressed locations. In order to utilize a computer one must be able to describe what functions it is to perform. A process known as programming. And a very laborious time consuming task in the numeric code language's of computers. Mnemonic instruction coding was the first step. Each computer instructions operation was assigned a mnemonic name. Other instruction parameters remained numeric. Eventually assemblers were developed. Followed by problem oriented languages. FORTRAN was origionally developed for scientific computing. FLOW-MATIC was the first data processing or business orianted language. The IAL group was formed to define an International Algebraic Language ALGOL 58 which lead t the ALGOL 60 report which used a new specification metalanguage called BNF. A number of compiler compiler systems using BNF were described in ACM publications. META II the first compiler compiler to use meta in it's name defined itself as a computer program that takes as input a metalanguage. Successive metacompiles have stated that vary same concept of taking a metalanguage as source code. The metalanguage was said to be simular to BNF. However it was not BNF. The metalanguage is more algebraic including grouping and alternative constructs. Looping is also a feature using a Kleene star equilivant zero or more operator. The metacompiler metalanguages resemble a grammar. They can be interpret as a grammar by ignoring the object creating functionality and excluding the structure transforming operations.

parse it

edit

Parsing#Computer languages#Parser

Parsing#Parser

Scannerless parsing

Sense I brought up the question of reductive grammar I have tried to explain them in linguistic terms. The reason I brought is because it is used in historical papers on metacompilers. Maybe reductive has more to do with how they are usrd. In the META II document the language is described as BNF like. In subsequent Schorre metacompiler documents: the language is described as a "reductive or analytical" grammar. The questions are:

  • What makes them reductive or analytical?
  • What did that mean in 1968?
  • Is it right to call them a granmar?
  • ...^^

The syntax language or grammar in question is first a programming language used to program scannerless parsers. They are a part of a compiler compiler that actually produce complete functioning executable compilers.

A programming example writing a set parser rules that transforms a simple mathematical expression into an abstract syntax tree may help. The function of the parser is to analyze the input character stream for correctness and transform it into an internal structure, an Abstract Syntax Tree, maintaining the original intent.

The input is best described as a character stream. Input is processed one character at a time. The stream file I/O supports marking positions, buffering the input to include oldest active backtrack or backup. A backup is used in token processing where only the input state need be reset on failure. Backtracking is much more complicated restoring to a previous state undoing a partial structure match having created objects that have to be undone.

The syntax language described here is based on the syntax language of the CWIC-360 compiler compiler system with a few changes. The Kleene star * notation is used replacing the $ zero or more precedence operator of CWIC. The alternative or operator / replaced by |.

In researching, parsing expressing grammars has come up. Checking it out I did find some common things of interest. Like pegs the metacompilers evaluate alternatives in order.

There are three types of grammar rules. Syntax rules analyze the language structurs at the word level and above. Token rules analyze characters and create tokens. Token rules reference character classes defined by character class rules.

Character class rules define constants. They are pseudo rules that define a list of character to be members of a character class. They are not functions as are syntax and token rules. Character class tests deturmin a characters membership in a class using its ordinal index into the _class_map table. Bits at the ordinal location denote class membership. A character class rule creates a class mask that is tested against the class map table. Classed having literal characters are allocated a bit set in the class map ordinal of those characters. The class mask would include the included clas's bits and the allocated bit.

Token rules are simple. The string rule defines the lexing rule of a string token and calls a built in function that creates a string object. A token rule is used to match a character sequence. The string rule is used to match a string constant and create a string object. String constants used in rules are not normally kept. The string rule itself uses string constants.

string .. ('''.any''' | '"' (~'"'|"""""",'"')* '"') MAKESTR();

The above rule defines a single character string inclosed in single quotes "'" or a general string bounded by double quotes. In the case a " character is desired in the string one may use two "" that will be replaced by one ". The """""" test in the string rule is an example of itself being applied. The insert ,'"' operator inserts the single character " into the token. Character string tests, """""" and '"' are not copied to the token. .any matches any single character. Failing only at the end of input. Token rules create objects. The default when no intercede function is called the token is cataloged in the dictionary or symbol table. An object has a handle or pointer. When a token rule starts initial skip_class or white space characters are ignored. The default skip_class contains all white space characters including line breaks and other non-printing characters. It may be defined overriding the default. Token rules are limited to character matching (character class rules or literals) and may not call other rules except as their last operation a generator function may be called. The id token rule defines a symbol that will be cataloged in a symbol table.

id .. ltr (alpha_num|'_')*;

An identifier id is a ltr followed be a sequence of alpha numeric or '_' characters. What if it were the case we did not wish an id to end with a '_' character.

id .. ltr (alpha_num|'_' ?alpha_num)*;

The ? lookahead operator operator allows testing with out consuming the input. Although one could acomplish the same by:

id .. ltr ('_' alpha_num|alpha_num)*;

Were you paying atenchion when it was explained string constants are not kept as part of tokens. The above will create symbol with the _ character removed. If _ is to be matched and kept in the symbol they need be preceeded with a +.

id .. ltr (+'_' alpha_num|alpha_num)*;

Now in the rule above the _ would be copied to the token buffer.

Character class

bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'
     |'a''|'b'|'c'|'d'|'e'|'f';

lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'
    |'h'|'i'|'j'|'k'|'l'|'m'|'n'
    |'o'|'p'|'q'|'r'|'s'|'t'|'u'
    |'v'|'w'|'x'|'y'|'z';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'
    |'H'|'I'|'J'|'K'|'L'|'M'|'N'
    |'O'|'P'|'Q'|'R'|'S'|'T'|'U'
    |'V'|'W'|'X'|'Y'|'Z';

ltr:  upr|lwr;

alph_num: ltr|dgt;

symch: alph_num|'_';

One of the first things we learn about eveluating mathematical formula is that operators have an evaluation hierarchy. Multiplication and division befor addition or substraction. (Ignoring exponentition for simplicity.) We can describe a mathematical expression as a sequence of terms that are added or subtracted.

expr = term (('+'|'-') term)*;

The expr rule defines a mathematical expression as a term followed by zero or more '+' or '-' term using the Kleene star * zero or more notation We can think of expr and term as named language constructs. They are actually programed test functions that specifies a series of tests to perform on input charactes returning success or failure. One could think of test as a boolean type having the possible values of success(true) or failure(false). A string or character constant is a test. '+' compares the input stream characters to the '+' character string constant. A test performs a match of its pattern on the input stream. If successful the input is advanced over the matched characters.

A rule not only parses an input character stream but is also programed to construct an Abstract Syntax Tree or AST that is an equilivant representation of its input. In programming the AST transformations, operators are used to form parts of the AST. A term element may be thought of a branch of the tree we wish to build. In writing the expr rule the expectation is that term will create a term object. We write the expression rule to form tree objects of terms as:

expr = term (('+':ADD|'-':SUB) term!2)*;

The '+':ADD sequence creates an add node when a '+' is matched by the parser. Likewise for the '-':SUB sequance. It also expects that a term will follow. Failure to recognize the term results in the rule failure. A backtracking failure, as a partial recognition of the Kleene star grouped parse has occured. The sequence

('+':ADD|'-':SUB) term!2

will build a list whose first element is the most recent, ADD or SUB, node.[NODE,term,term]

a+b-c -> [SUB,[ADD,a,b],c]

     SUB
    /   \
  ADD    c
 /   \
a     b

The above operation is expr first calls term and a is recognized as a term. The * loop begins recognizing a + creating an ADD node. b is recognized as a term and !2 creates the [ADD,a,b] list and the * loop repeates. The - recognized and SUB node created then c recognized and !2 again creates a three element list [SUB,[ADD,a,b],c].

The loop operation avoids left recursion used in grammars that would be a programming error in the parsing language. And no different than in c or c++ a function calling itself when no state has changed causes a recursive loop and having no termination condition usually results in a stack overflow. These are first programming languages. A function immediately calling itself on entry is a programming error. The grammar alone could be considered a functional programming language. And indeed has many trates of a functional programming language.

The rule expr is a programed boolean function (returning success or failure) that parses a mathematical expression and transforms it into a tree. The parsing language is a stack programming language. The ':' and '!' are stack operators. They operate on the node and parse stacks. The :node sequence creates a node and pushes it on the node stack. The !num sequence creates a tree structure of num branches and pushes it on the parse stack. The tree is created as a list of num+1 elements, poping the node stack into the first list element. The num number of parse stack entries are then poped into the list. The grammar language includes token rules that create and push token objects onto the parse stack Tokens are symbols, strings, or numbers of the source language.. symbols are entered into symbol tables.

Token rules like syntax rule may (no should) have descriptive names. Though we tend toward laziness using short names. "id" for identifier. Token rules are used to parse a lexeme creating a token object. A token rule tries to match a character pattern to the input character stream. It skips leading skip_class characters not part of the pattern,
id .. ltr (alphanum|'_')*;

The .. is used to identify a token rule function. Like syntax rules a token rule returns success or failure. The id token function will take a character from the input stream as long as it is a in the xkip_class and not a ltr. If it is not a ltr the input stream is reset to state at id's entry and failure returned. A token rule failure is a backup of of the input stream to the rule's entry state. Backtracking is undoing a partial tree. releasing objects created etc. Backtracking and backup are very different. All token rules backup to their entry state on failure. Backup only deals with file positions. No complicated releasing of created objects. By default a recognized token would be cataloged to a symbol yable. Conversion function's may be used to convert them to numeric or string objects instead. MAKEINT() for example will convert a digit sequence into an integer object.


The referanced ltr and alphanum are defined using character class rules:

bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'
     |'a''|'b'|'c'|'d'|'e'|'f';

lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'
    |'h'|'i'|'j'|'k'|'l'|'m'|'n'
    |'o'|'p'|'q'|'r'|'s'|'t'|'u'
    |'v'|'w'|'x'|'y'|'z';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'
    |'H'|'I'|'J'|'K'|'L'|'M'|'N'
    |'O'|'P'|'Q'|'R'|'S'|'T'|'U'
    |'V'|'W'|'X'|'Y'|'Z';

ltr:  upr|lwr;

alphanum: ltr|dgt;

symch: alphanum|'_';

Character class rules define named literal sets. The class ltr represents all characters a thru z and A thru Z.

The parer language is intentially non-optimizing. Code is explicitly translated as written. The id rule as written above "ltr (alphanum|'_')*" translates to two separate tests for alphanum or '_' in a loop. Written as below, using the symch character class results in a single test.

id ..     ltr symch*;

Of note efficient code maght depend on the number of character classes. A byte addressable machine could use a byte character class array map when 8 or less character classes counting skip_class are defined.

The following token rules define most common integer constants used in modern languages.

number .. ("0b"|"0B")            bin $bin MAKEBIN()  // binary integer
          |("0o"|"0O")           oct $oct MAKEOCT()  // octal integer
          |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX()  // hex integer
          |                      dgt $dgt MAKEINT(); // decimal integer;

Most parsers of character-level grammars are nondeterministic


There is no guarantee t



Likewise a term is a product of factors. term = factor (('*'|'/') factor)*;



Maybe it can only be answered by understanding their function as programming languages. So in order to get at the distinguishing properties of reductive grammars their programming is explained here.

These metalanguages are very simular to Parsing Expression Grammars. Many attributes are the same. Though a better descriptive name is Parser Expressing Grammer as the grammar expresses the detailed programming of its parser.

The language described here and used in examples is functionally equivalent to the CWIC 360 metacompiler's syntax language. The alternative separator / (or operator) changed from / to |. The precedence $ (zero or more) operator changed to Kleene star (zero or more)* form.


As a programming language, in which a parser is programed, the language must follow rules making it possible to specify the words and symbols of the language and their correct organizational structure. In fact these language define the structure of a valid program modual. They start from a top driving rule. A file containing a c program is simply a collection of declarations. One simply declares function and variables of the program in any order. The only requirement is that an item be declared before its use.. So the driving program rule for c is:

program = declaration*;

A program is a sequence of zero or more declaration. Most programming language have an organization to their source file. A COBOL program is divided into specific ordered divisions.


Characters are smallest piece of data making up a languages source code. Character class rules define named literal constants

bin:    '0'|'1';

The bin character class above is defined to be the digit 0 or 1. Binary numbers may only have the digits 0 and 1. 110 being the sum of 1×4+1×2+0×1 or 6 decmal. That character class only defines the valid digits in a binary integer. We can, using the bin class define a binary integer:

binary  .. bin bin*;

The above token rule defines a binary as a string of one or more bin characters. The Klenne star operator repetes the second test zero or more times. The rule however is incomplete. By default a token rule returns a symbol. The parsed token is looked up in a dictionary/(symbol table) or an entry created if no dictionary entry exists. The symbol is pushed on the parse stack. The parse stack holds items created in parsing the input source code.

There are three stacks. The parse stack is were the transform is created. All data is contained in objects. There are many object types defined in CWIC based languages. CWIC was an extended combination of Schorre's META language and LISP 2 developed at Systems Development Corporation. All data is contained in objects. If we wish the parsed bin string be converted to an integer object befor pushing it on the parse stack we may call a conversion function as the last operation of the token rule.

binary  .. bin bin* MAKEBIN();

MAKEBIN() will convert the characters parsed into an integer object. The string must be only 0 and 1 characters. MAKEBIN() is one of several token intercede function that may be used. Interceding conversion functions intercede the normal dictionary processing on successful termination of token rules. MAKEBIN, MAKEOCT, MAKEHEX, and MAKEINT all perform numerical conversions to integer objects. MAKESTR, MAKEFLOT and others serve simular functions. Token rules skip _skip_class characters ahead of the fist character matched. Skiping terminates once the first token character is matched. Quoted strings in token rules are tested against the input character stream but are normally not kept as part of a token. A + preceeding a string literal will copy the matched string to the token. A - preceeding a string or character literal nagates the test redult. The successful match is nagated to unsuccessful and visevesa. A ? preceeding a string test never advances over the matched characters. Success or failure is however unchanged.

edit 01

edit
Token rule operators
Operator example discreption
quoted string "--" compares the input stream to the quoted string. skip_class characters may be skipped when no input has been consumed by the rule. A matched string is advanced over and not kept.
+ +'=' compares the input stream to the quoted string. skip_class characters may be skipped when no input has been consumed by the rule. A matched string is copied/appended to the token buffer.
- -"=<" compares the input stream to the quoted string. skip_class characters may be skipped when no input has been consumed by the rule. On return the parse state is unchanged and the returned success / failure state is reversed.
? ?'=' compares the input stream to the quoted string. skip_class characters may be skipped when no input has been consumed by the rule. The parse state is reset to initial state and result of string test returned.
~ ~'"' compares the input stream to the quoted string. skip_class characters may be skipped when no input has been consumed by the rule. The test result reversed. Matches any character not the given one.
"0b" compares the input stream to the quoted string. skip_class characters may be skipped when no input has been consumed by the rule.
bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'
     |'a''|'b'|'c'|'d'|'e'|'f';

lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'
    |'h'|'i'|'j'|'k'|'l'|'m'|'n'
    |'o'|'p'|'q'|'r'|'s'|'t'|'u'
    |'v'|'w'|'x'|'y'|'z';

upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'
    |'H'|'I'|'J'|'K'|'L'|'M'|'N'
    |'O'|'P'|'Q'|'R'|'S'|'T'|'U'
    |'V'|'W'|'X'|'Y'|'Z';

ltr:  upr|lwr;

alph_num: ltr|dgt;

symch: alph_num|'_';

id ..     ltr symch*;

number .. ("0b"|"0B")            bin $bin MAKEBIN()  // binary integer
          |("0o"|"0O")           oct $oct MAKEOCT()  // octal integer
          |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX()  // hex integer
          |                      dgt $dgt MAKEINT(); // decimal integer;

A string constant may be compared to the input stream. String constants are bounded by a " character. a " character with in a string is formed using two consecutive " characters. """" forms the string of a single " character. A single character constant may be specified using ' charscter. '/' is the single character /.

char .. ''' ,any ''';

The .any above matches any character. Only at end of input (no more character to read) does .any fail. The char rule will take any single character as a char. In the syntax language a single character is a string. Only characters numeric codes or the single character constants may be used in defining character classes

char  .. ''' ,any ''' MAKESTR();
string .. '"' (-'"' .any |"""""",'"')* '"'Makestr();

The .. rule is a token making rule. A token rule parses a character sequence that by defult creates a token class object. A token object may be a string, numeric, or symbol. A symbol is automatically found or created, cataloged and it's object pushed onto the parse stack.

A programming metalanguage used to program a parser. The Schorre metalanguages were said to be BNF like. They were described as reductive analytical grammes in documentation originally published between 1965 and 1975. As a programming language they are analytical. Programing operators add additional transformational and analysis features.


There are three levels of grammar structural rules. syntax, token, and character class. A character class can be considered a literal constant. Usually the implementation of character class rules are combined to generate a class map array(or table). Testing a class mask against the class map, indexed by the character yealds non zero if the character is a class member.

bin:    '0'|'1';

Assuming ASCII character codes and bin is the first class rule defined, Then the bin class mask would be 0B00000010. (The predefined skip_class is always 0B00000001. A skip_class can be defined overriding the default.)

oct: bin|'2'|'3'|'4'|'5'|'6'|'7';

oct includes the bin class and has additional characters '2'..'7'. Its class mask is (bin | 0b00000110). A class only a combination of other class would be those classes ored together.

The term parse as uses here refers to the testing or analysis process of a bounded sequence. The parser takes as input a character stream and directs the testing of each character. On success the input is advanced to the next character. On failing a test an alternative may be tried. When all alternative have been tried with no success the input may be backtracked and alternatives form that point atempted. When no alternativea remain the parse fails. Backtracking may be nested. Nothing really special as far as function. It's the spicifics of the programming in the parsing language that is special. Note no attempt to find the longest parse. Alternatives are atempted in the, programed, order written.

Each rule is a named parse function, a sub-routine returning success or failure . Where a parse is a sequence of one or more tests matching a pattern of words, symbols and or characters that returns success or failure. A sequence, grouped in penthouse ( ) may be refered to as a parse. A token rule is a parse that recognizes a character sequence and creates a token object. A token object could be a string, number or symbol. A symbol object is a dictionary entry. Symbols may have attributes and assign values to them. Even a symbol table object could be assigned to a symbol's attribute. The only way I can explain these grammer programming languages is by example.

Token rules recognize character sequences using only character classes and/or literal strings. Literal strings are nat kept as part of a token unless overridden by a copy or past operator. ,'%' inserts the character % into the token. A +'%' tests that the % character is in the next character, Advances over it, and copies it onto the token. A single character string may be bounded by by a '. Including the ' charaxter.

character .. ''' .any ''' MAKESTR();

.any matches any single character. It only returns failure when at the end of input stream. The character rule defines the character constants used in it's own defination.

The example will follow the parsing of an expression. But first looking at just the expr rule and expectations when writing it. pre> expr = term (('+':ADD|'-':SUB) term!2)*;

Simply expr is to look for a term followed by any number of '+' or '-' terms.

expr = term (('+':ADD|'-':SUB) term!2)*;
term = factor (('*':MPY|'/':DIV) factor!2)*;
factor = '(' expr ')' | id | number;

bin: '0'|'1';
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';
dgt: oct|'8'|'9';
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'
     |'a''|'b'|'c'|'d'|'e'|'f';


id ..     let (alph-num|'_')*;
number .. ("0b"|"0B")            bin $bin MAKEBIN()  // binary integer
          |("0o"|"0O")           oct $oct MAKEOCT()  // octal integer
          |("0x"|"0X"|"0h"|"0H") hex $hex MAKEHEX()  // hex integer
          |                      dgt $dgt MAKEINT(); // decimal integer;

to parsing: a/(b+2)-5
input stream discreption parse stack node stack
a/(b+2)-5 1) Starting with the expr rule:
expr = term (('+':ADD|'-':SUB) term!2)*;
a sequence of two tests:
term
and
(('+':ADD|'-':SUB) term!2)*
Parenthèse () form a group test. The second test a parenthèsized group specifies a sequence, or loop, by the '*' following it. An expr is a term followed by zero or more '+' or '-' terms Note the Kleene star and alternative operator were different in original metacompilers. $ used for zero or more and / the alternative operator i.e.
expr = term $(('+':ADD/'-':SUB) term!2);

A \ was used for the backtracking alternative operator. The first term test rule is called. If successful the second sequence test will be attempted. As there no alternatives specified then if no first term is found express will return failure. If a first term is found the grouped indefinite loop test is attempted. The indefinite loop failing on it first test, ('+'|'-') on any iteration, is successful (zero or more) success on zero matched. However when the first test of a sequence is successful and a following parse fails the loop fails. Unless back tracking is used to catch this failure the rule will fail.

a/(b+2)-5 2) called from the expr rule the term rule
term = factor (('*':MPY|'/':DIV) factor!2)*;
first calls factor. Thelis rule is quite like the expression rule in form. One might ask why not combine them. The answer is that we are specifying mathematical hierarchy. Multiplication and division before addition and subtraction. Also using a loop a left to right ecualuation is being specified.
a/b*c

is evaluated as

((a/b)*c)

The rule could be written to specify right to left

(a/(b/c))
a/(b+2)-5 3)The rule:
factor = 
'(' expr ')' | id | number;
has three alternatives. The first alternative a sequence of tests
'(' expr ')'
calls cmpstr '(' a string compare test that returns failure. cmpstr automatically skips leading _skip_class characters. Next alternative
id | number;

calls id a token rule

id  .. alpha symchar*;
that matches and creates an "a" symbol returning success with symbol "a" pushed on the parse stack. Token rules also skip leading _skip_class characters. The last alternative
number
is not atempted. Alternatives are atempted in rule specified order.
a
/(b+2)-5 4) returning to
term = factor (('*':MPY|'/':DIV) factor!2)*;
the initial factor successful we are at the grouped sequence..
 (('*':MPY|'/':DIV) factor!2)*;
The first test of which is a grouped alternative
 ('*':MPY|'/':DIV);
. The first '*' string test, calling cmpstr '*', returns failure and the alternative '/':DIV is attempted. The '/' is matched and :DIV creates a DIV node and pushes it on the node stack. Having successfully matched
('*':MPY|'/':DIV)
factor is called.
a DIV
/(b+2)-5 5)
expr = term (('+':ADD|'-':SUB) term!2)*;
term = factor (('*':MPY|'/':DIV) factor!2)*;
factor = '(' expr ')' | id | number;

Given the expression a/(b+2)-5

Why?

edit

I wonder way many of the ideas pioneered and profected by these early metacompiler are so negatively dismissed. I have worked on and written many compilers. The later ones written in the 80s using yacc generated parsers and code generation written in PASCAL or c. From experience the later compilers were the harder to maintain.

Take the scannerless parsing as an example. It is said of a scannerless parser:

  • "Since the lexical scanning and syntactic parsing processing is combined, the resulting parser tends to be harder to understand and debug for more complex languages."

That is simply not true of these languages. And would question it in any case. If anything the scanerless way is simpler. Using a common notation in expressing tokens and language constructs is easier to learn. Syntax directed lexing eliminates problems. There is no confusion as to the types of lixem expected.

edit area

edit

Terminology has changed over the years. In my attempts to explain Schorre metacompilers I find many terms no longer have the meaning they had in the 60's. Rp, You look at these languages as grammars that are translated into a parser. I can not argue that to be wrong but it really does not describe them. They are programming languages. Actually a functional procedural language. They are not just a parser but are programed to produce code directly or an intermediate abstract syntax tree.


The problem is that what I am trying do is put a programming language into a grammar class. Maybe that is just wrong. But these programming languages specify the grammar of programming languages and a lot more.

As a grammar they are very much like a parsing expression grammar. Like PEGs the choice operator selects the first match. They have look ahead operators as well. In CWIC backtracking and look ahead are programming constructs. A backtracking form of alternative (choice) is used to allow programed control of parser backtracking. They all compile to a scannerless parser. Early version had built-in rules that recognize numbers, string constants, and symbols. Later versions have token rules. A token rule parses a character sequence. Its default action is to create a symbol object and catalog it in the dictionary or symbol table. i.e.

id .. let alpha_num*;

let and alpha_num are defined by character class rules. The original CWIC language used $alpha_num instead of alpha_num* to mean zero or more alpha_num. I am taking the liberty of using current zero or more Kleene star notation. Syntax and token rules are boolean functions returning success or failure. There can be only one rule of a given name. It is in a way a functional programming language. A rule (function) takes as input a character sequence (string or stream) tests it as programmed returning success or failure. On success the input is advanced over matched characters. On failure the input is left unchanged. The CWIC metacompiler being described here provided symbol tables and a procedural (LISP 2 like) generator language. The rules produce objects. I would say these are pushdown automatons or stack automatons but as described here on wiki do not describe their operations. Maybe a pseudo stack machine.

As explained a token rule creates an object. It also pushes the object (pointer or address) onto the parse stack. There are three stacks. The call stack, parse stack, and node stack. The parse stack is were the abstract syntax tree is constructed. In general the expectation is that when a rule is successful it leaves the parsed language construct transformed into a tree on the parse stack. Not a hard & fast rule as the transforming is programed and multiple entries or no entries could result. The node stack is used to hold tree nodes. The :<node name> creates a node object and pushes it on the node stack. As already noted token rules push token objects on the parse stack. The !<number> sequence creates a tree having the given <number> of branches. The branches are poped from the parse stack into the tree. The tree is simply a list whose first entry is a node. The tree branchs follow in the order they were recognized. The tree is then pushed on the parse stack.

expr = term (('+':ADD|'-':SUB)term!2)*;

In the above rule we expect term to leave a single object on the parse stack. It is expected to be an atomic object (symbol or number) or a tree object. The grouped construct (('+':ADD|'-':SUB)term!2)* specifies that zero or more + or - term may follow. If on ant iteration a ('+'|'-') is not matched the zeto or more loop terminates successful. However if after matching a + or -, and calling term who returns failure a backtrack failure occures. Backtrack blocks are on the call stack. The most recent BTblock address is loaded into the call stack pointer and failure is returned.

It is more a grammar then syntax. As it includes token rules. No separate lexer pass. The fact that it is a parser writing/programming language makes it analytical. One has to code from an analytical point of view. I think reductive comes from the top down analysis. We use the language to define a valid program. Or at least a valid module of a program. The input is program code in some programming language.

Though "reductive grammars you describe are the technology that parser generators and similar pieces of software actually use when describing and recognizing languages; the generative grammars currently described in this article do not correspond with this technology." is true. The systems in whose documentation uses the term reductive grammar are true compiler compiler. The reductive grammar is actually a programming language in which a parser is programed. The have transformation operators or constructs that operate on parsed enoties. These metacompillers advanced considerably from the first META II, TREE-META, CWIC to SLIC. META II the first documented Schorre metacompiler was vary simplistic producing stack machine instructions. TREE-META brought abstract tree transforms and unparse rules to the language. CWIC combined LISP 2 with Schorre metalanguages, added token rules and symbol tables. SLIC extended CWIC adding machine code specification and pseudo code languages.

The grammer was called the syntax language in these compiler compilers

The reductive grammars in question are actually programming language. I think there there is a distinctive difference between a parser generator and programming a parser in a grammar like language. The parser generators I am Firmilure are not specifying their input like a programming language. There are simularities.


Specifically the programming META languages developed by D. Val Schorre. They have precise semantic interpretation. They define their grammar and semantic's in their own language. The significance of that statement has been greatly misunderstood. I was at early ACM L.A. SegPLAN meetings during some of their presentations. The "defining their self in their own language" was totally about the ease of extending their capabilities. Schorre was the first to call his compiler a META compiler. Defining it a metalanguage compilet. META II being a metalanguage compiled be his META ii compiler 

Rp you said: However, your idea on the rrelationship between generative and reductive grammars is different from how I, Arthur Rubin and standard compiler technology textbooks see it.

That is not true. Older books from the 1950s & 60s explain things quite differently. History is important. I also believe we havn't always chosen the best path. Some old ideas were missed for various reasions. The industry has changed over the years. When I was in collage computer science was offered only at a few universities. Vocational data processing classes were the norm. Thing's always change.

I was unable to describe the differences between productive and reductive grammar in linguistic terms. I went about that all wrong. You have touched on the main differance. That being the relation to a parser. But to be more specific the systems documwnting their self as reductive languages are programming languages in which one writes a parser. They are documented as programming languages. The languages have semantic specification. To be specific each rule is a boolean function returning success or failure.

I found that a PEG (Parsing 3xpression Grammar) finds the first match sequence as opposed to the longest sequence. That is also the case with metalanguage

Edit ref 1

edit

So what is this difference? According to you, Chomsky's generative grammars and the reductive grammars used by the parser generators you have seen are two different kinds of things; you grant that there is overlap in what they do, but they are fundamentally different in nature. This is because Chomsky grammars take the grammar's start symbol and generate the strings of a language from it, while reductive grammars reduce the strings of a language until the start symbol remains. This is, in fact, exactly how the textbooks on formal language theory (the mathematical foundation behind parsing technology) describe it: except that instead of reductive grammars, they introduce automata as the means to reduce strings of languages, and they don't call it reduction but recognition of the strings of a language.

To be precise: the textbooks prove:

Initially, Chomsky considered unrestricted grammars to be a good basis for the description and machine translation of natural languages. He changed his mind when it was discovered that they have unrestricted descriptive power (any language that can be formally described at all can be described by an unrestricted grammar) and an undecidable recognition problem (it is impossible to create an algorithm that given an arbitrary unrestricted grammar creates a Turing machine, or any other kind of program, that given a string decides whether it can be generated by that grammar).

For context-sensitive languages, recognition is decidable, but hard (PSPACE-complete in general).

So these two variants were never seriously considered as the basis for parsing languages.

Description of BNF - by Saul Rosen

edit

The syntax of a language is a set of rules by means of which it is possible to determine whether any given string is a valid string in the language.[3]

The syntax rules define classes [4] of admissible strings, and, in a language of any degree of complexity, will provide rules by means of which more complex classes can be built up from (defined in turns of) simpler class. The semantics of a language is a set of rules that assigns meaning to valid strings in the language.

In BNF a class name is delimited by angle brackets. i.e. <class name>. A class name defines the format of a valid string. Class names are unique and descriptions of their meaning were given in textual descriptions in the ALGOL 60 report.


The metasymbols used in BNF are ::=, |, <, >. There are only four symbols defined here. The , and . are part of the metalanguage, english, in which we are describing the Backus Normal Form.

We write:
<digit> ::= 0|1|2|3|4|5|6|7|8|9

The metasymbols < > are used as delimiters to enclose the name of a class. The metasymbol ::= may be read as "is defined as" or "consists of". The | is read as "or". The above phrase defines the class digit. There are ten alternatives which satisfy the defination and these alternatives are listed explicity and seperated by |. We could define a class <digit pair> as follows.

<digit pair> ::= <digit>,<digit>

Here <digit pair> is defined in terms of the previously defined class <digit>. The class name <digit> appearing to the right of the metasymbol ::= is read as "any member of the class <digit>". The comma between the two apperances of <digit> stands for itself. In words a <digit pair> consists of any member of the class <digit> followed by a comma followed by any member of the <digit> class. Thus 3,4 and 3,3 and 0,0 are all members of the <digit pair> class. 36 and 44 are not members of that class.

Note. The above descriibes a reductive grammar as described in the McGraw-Hill technical term dictionarry. [3]

BNF as used in the ALGOL 60 report

edit

Original description of BNF - Backus Normal Form from the ALGOL reports.

Source of BNF comes from ALGOL 60

edit

The following is from "Programming Systems and Languages" Edited by Saul Rosen Professor of Mathematics and Computer Science Purdue University McGraw-Hill 1967.

The book is a wealth of information on early computer language and operatoring system development. It contains a colection of early published papers by many authors. Many from the ACM.

This article provides an incite as to the function of BNF as used in the early ALGOL reports.

"The ALGOL Programming Language" by Saul Rosen Pages 48 - 78 and "Revised Report on Algorithmic Language--ALGOL 60" Pages 79 - 118.

BNF was created by John Backus to specify the ALGOL programming language syntax. He at one time said it came out of meeting notes he had taken. On another it was from a course by Martin Davis. I can relate to that. I have a highly analytical mind but a poor memory. I retain things I use a lot. But I repeat things when I am writing forgotten what that I had already covered it. I forget what I have written. My spelling is atrocious. But what I do is know how things work. Call it scientific intuition. Anyway the point is Backus was a highly intelligent man. And may not of remembered where he got the idea. In fact if he had gotten the idea from Chomsky or were ever it was not a production grammar.

Anyway Naur had little to do with inventing BNF. He used it in the ALGOL 60 report to describe language constructs. It was not explained what could have been changed. BNF only has four metasymbols. ::=,|,<,> The commas are not a part of BNF but English puntuation marks. As Saul Rosen puts it they are part of the metalanguage English being used to describe BNF.


While Saul Rosen was not listed in the ALGOL 60 report. According to his biography he was a member of the ACM working group. His description of BNF is much more interesting in lite of his participation in ACM ALGOL working group.

The ALGOL 60 report was written by J. W. Backus, F.L. Bauer, J. Green, C. Katz, J. McCarthy, A. J. Perlis, H. Rutishauser, K. Samelson, B. Vauquios, J. H. Wegstein, A. Van Wijugaargden and M. Woodger, Peter Naur(Editor)

Where did the idea of BNF come from? The actual history is mostly lost. This book I think gives some insite on that question. Martin Davis was primarly a mathematician. Maybe as rummored it was a course by Davis. Soul Rosen also a mathematician discribs it as a analytical language. The ALGOL 60 report uses BNF as a reductive grammar metalanguage. The term reductive came from some ware. It is used in early meta compiler documents.

Saul Rosen became active in the Association for Computing Machinery (ACM) in 1947, first on the languages committee that eventually led to the ALGOL programming language, and then as first managing editor of the Communications of the ACM. He wrote extensively on practical systems programming and authored his major book, Programming Systems and Languages (McGraw-Hill, New York), in 1967.

Reading carfully the ALGOL 60 report and Saul Rosen's ALGOL article it seams that BNF is a reductive grammar. Reductive being the polar opsite of productive. Basically the original term for what some call an analytical grammar.[3]

That is a bit vague but maybe looking at Saul's explanation and the ALGOL 60 report it can be refined. BNF is used to describe ALGOL language constructs and to attach semantic meaning to them. A named language construct is defined by a BNF rule having a the same name. Every BNF rule names a language construct.

<language construct name>::=<construct pattern>

i.e.

<arithmetic expression>::=<arithmetic term>|<arithmetic expression><adding operator><arithmetic term>

The ALGOL paper uses descriptive names. The full <arithmetic expression> would be given including <arithmetic term> etc. Following the BNF syntax rules would examples and a semantic description describing it's operation and valid value ranges etc.

The were not only used to describe the syntax but the allowable variable types etc. An integer rule specifically defined how a integer should be implemente hardware dependently. The number of bits or digits matching the hardware implementation.

<arithmetic expression> had semantic description describing conversion rules to higher types. Integer to floating. Upscaling etc

BNF as used in the ALGOL 60 report

edit

Original description of BNF - Backus Normal Form from the ALGOL reports.

Source of BNF comes from ALGOL 60

edit

The following is from "Programming Systems and Languages" Edited by Saul Rosen Professor of Mathematics and Computer Science Purdue University McGraw-Hill 1967.

The book is a wealth of information on early computer language and operatoring system development. It contains a colection of early published papers by many authors. Many from the ACM.

This article provides an incite as to the function of BNF as used in the early ALGOL reports.

"The ALGOL Programming Language" by Saul Rosen Pages 48 - 78 and "Revised Report on Algorithmic Language--ALGOL 60" Pages 79 - 118.

BNF was created by John Backus to specify the ALGOL programming language syntax. He at one time said it came out of meeting notes he had taken. On another it was from a course by Martin Davis. I can relate to that. I have a highly analytical mind but a poor memory. I retain things I use a lot. But I repeat things when I am writing forgotten what that I had already covered it. I forget what I have written. My spelling is atrocious. But what I do is know how things work. Call it scientific intuition. Anyway the point is Backus was a highly intelligent man. And may not of remembered where he got the idea. In fact if he had gotten the idea from Chomsky or were ever it was not a production grammar.

Anyway Naur had little to do with inventing BNF. He used it in the ALGOL 60 report to describe language constructs. It was not explained what could have been changed. BNF only has four metasymbols. ::=,|,<,> The commas are not a part of BNF but English puntuation marks. As Saul Rosen puts it they are part of the metalanguage English being used to describe BNF.


While Saul Rosen was not listed in the ALGOL 60 report. According to his biography he was a member of the ACM working group. His description of BNF is much more interesting in lite of his participation in ACM ALGOL working group.

The ALGOL 60 report was written by J. W. Backus, F.L. Bauer, J. Green, C. Katz, J. McCarthy, A. J. Perlis, H. Rutishauser, K. Samelson, B. Vauquios, J. H. Wegstein, A. Van Wijugaargden and M. Woodger, Peter Naur(Editor)

Where did the idea of BNF come from? The actual history is mostly lost. This book I think gives some insite on that question. Martin Davis was primarly a mathematician. Maybe as rummored it was a course by Davis. Soul Rosen also a mathematician discribs it as a analytical language. The ALGOL 60 report uses BNF as a reductive grammar metalanguage. The term reductive came from some ware. It is used in early meta compiler documents.

Saul Rosen became active in the Association for Computing Machinery (ACM) in 1947, first on the languages committee that eventually led to the ALGOL programming language, and then as first managing editor of the Communications of the ACM. He wrote extensively on practical systems programming and authored his major book, Programming Systems and Languages (McGraw-Hill, New York), in 1967.

Reading carfully the ALGOL 60 report and Saul Rosen's ALGOL article it seams that BNF is a reductive grammar. Reductive being the polar opsite of productive. Basically the original term for what some call an analytical grammar.[3]

That is a bit vague but maybe looking at Saul's explanation and the ALGOL 60 report it can be refined. BNF is used to describe ALGOL language constructs and to attach semantic meaning to them. A named language construct is defined by a BNF rule having a the same name. Every BNF rule names a language construct.

<language construct name>::=<construct pattern>

i.e.

<arithmetic expression>::=<arithmetic term>|<arithmetic expression><adding operator><arithmetic term>

The ALGOL paper uses descriptive names. The full <arithmetic expression> would be given including <arithmetic term> etc. Following the BNF syntax rules would examples and a semantic description describing it's operation and valid value ranges etc.

The were not only used to describe the syntax but the allowable variable types etc. An integer rule specifically defined how a integer should be implemente hardware dependently. The number of bits or digits matching the hardware implementation.

<arithmetic expression> had semantic description describing conversion rules to higher types. Integer to floating. Upscaling etc

Edit ref 3 META II

edit

My claim is that what is reductive is not the BNF rules themselves, but the parsers that are automatically generated from them. This was the innovation brought about in 1962, apparently by META II: the automatic generation of parsers from grammars.

How to do this? This wasn't familiar territory in 1962. It was known that if you take a naive approach to parsing, allowing arbitrary context-fee grammars will get you into trouble.

If you do top-down, left-to-right parsing, and you allow left recursion:

 A := A B | C

you get stuck into an infinite loop. All left recursion must be eliminated. (It was later discovered that this is impossible to do in general.)

Edit ref 4 ambiguity

edit

Another problem is ambiguity along the way: when doing top-down, left-to-right parsing, these rules

 A :== B C | D
 D :=  B E

present a problem: you don't know whether to interpret A as B C or as B E as they both start with B. This can be solved by backtracking, but such a parser can take an exponential time to run.

A more radical approach is to simply disallow all combinations of rules that may produce such problems. For instance: require that all right hand sides start with terminals (i.e. tokens, not variables) and hat all right hand sides of the same nonterminal (variable) start with different terminals. So

 A := a A
 A := b B a C B b

is fine, but

 A := a A
 A := a C

is not, and neither is

 A := B

This type of grammar is known as simple deterministic or LL(1). For these, parsing can be done top-down, and it takes no more than linear time: you can proceed at each token and you never need to backtrack.

The drawback of this approach is that it reduces expressive power: these grammars cannot describe all context-free grammars. They can only describe a subset (which, naturally, is know as the set of LL(1) languages).

This makes such grammars unsuitable for use in natural language processing: natural language contains non-LL(1) constructs. When designing a new artificial language, you are free to choose its syntax such that an LL(1) grammar (or similar) can describe it.

People have done this a lot. The resulting languages are fast and easy to parse, but often curbed in expressive power. For instance, when I encounter an XML-based languages, I want to express its allowable syntax with DTDs or XML Schema; I have met cases where it couldn't be done because DTDs and XML Schema are required to be LL(1), while the syntax I needed to describe wasn't.

So do we have to choose between crippled languages and exponential parsing times? Not at all.

Instead of doing top-down parsing we can use bottom-up parsing (see LR parser and LALR parser), a smarter technique that supports more grammars than top-down parsing (but still not everything), and takes quadratic time rather than exponential time. Parser generators such as Yacc are based on this.

Furthermore, we can parse arbitrary context-free grammars in cubic time, using some kind of extension of bottom-up parsing such as an Earley parser or GLR parser. Instead of backtracking, we can process all alternative parses at once, if our bookkeeping is smart enough. This essentially turns parsing into a variant of matrix multiplication.

Where does this leave reductive grammars? I suspect that META II generates LL(1) parsers and therefore requires LL(1) grammars. So you might define reductive grammars as LL(1) grammars. Later parser generators use bottom-up parsers - are the grammars that these allow reductive, too? Even later parser generators support general context-free parsing; now, all context-free grammars have become "reductive". Hence my statement: the generated parser is reductive, not the grammar itself.


My claim is that what is reductive is not the BNF rules themselves, but the parsers that are automatically generated from them. This was the innovation brought about in 1962, apparently by META II: the automatic generation of parsers from grammars.

How to do this? This wasn't familiar territory in 1962. It was known that if you take a naive approach to parsing, allowing arbitrary context-fee grammars will get you into trouble.

If you do top-down, left-to-right parsing, and you allow left recursion:

 A := A B | C

you get stuck into an infinite loop. All left recursion must be eliminated. (It was later discovered that this is impossible to do in general.)

Another problem is ambiguity along the way: when doing top-down, left-to-right parsing, these rules

 A :== B C | D
 D :=  B E

present a problem: you don't know whether to interpret A as B C or as B E as they both start with B. This can be solved by backtracking, but such a parser can take an exponential time to run.

A more radical approach is to simply disallow all combinations of rules that may produce such problems. For instance: require that all right hand sides start with terminals (i.e. tokens, not variables) and hat all right hand sides of the same nonterminal (variable) start with different terminals. So

 A := a A
 A := b B a C B b

is fine, but

 A := a A
 A := a C

is not, and neither is

 A := B

This type of grammar is known as simple deterministic or LL(1). For these, parsing can be done top-down, and it takes no more than linear time: you can proceed at each token and you never need to backtrack.

The drawback of this approach is that it reduces expressive power: these grammars cannot describe all context-free grammars. They can only describe a subset (which, naturally, is know as the set of LL(1) languages).

This makes such grammars unsuitable for use in natural language processing: natural language contains non-LL(1) constructs. When designing a new artificial language, you are free to choose its syntax such that an LL(1) grammar (or similar) can describe it.

People have done this a lot. The resulting languages are fast and easy to parse, but often curbed in expressive power. For instance, when I encounter an XML-based languages, I want to express its allowable syntax with DTDs or XML Schema; I have met cases where it couldn't be done because DTDs and XML Schema are required to be LL(1), while the syntax I needed to describe wasn't.

So do we have to choose between crippled languages and exponential parsing times? Not at all.

Instead of doing top-down parsing we can use bottom-up parsing (see LR parser and LALR parser), a smarter technique that supports more grammars than top-down parsing (but still not everything), and takes quadratic time rather than exponential time. Parser generators such as Yacc are based on this.

Furthermore, we can parse arbitrary context-free grammars in cubic time, using some kind of extension of bottom-up parsing such as an Earley parser or GLR parser. Instead of backtracking, we can process all alternative parses at once, if our bookkeeping is smart enough. This essentially turns parsing into a variant of matrix multiplication.

Where does this leave reductive grammars? I suspect that META II generates LL(1) parsers and therefore requires LL(1) grammars. So you might define reductive grammars as LL(1) grammars. Later parser generators use bottom-up parsers - are the grammars that these allow reductive, too? Even later parser generators support general context-free parsing; now, all context-free grammars have become "reductive". Hence my statement: the generated parser is reductive, not the grammar itself.


This makes such grammars unsuitable for use in natural language processing: natural language contains non-LL(1) constructs. When designing a new artificial language, you are free to choose its syntax such that an LL(1) grammar (or similar) can describe it.

People have done this a lot. The resulting languages are fast and easy to parse, but often curbed in expressive power. For instance, when I encounter an XML-based languages, I want to express its allowable syntax with DTDs or XML Schema; I have met cases where it couldn't be done because DTDs and XML Schema are required to be LL(1), while the syntax I needed to describe wasn't.

So do we have to choose between crippled languages and exponential parsing times? Not at all.

Instead of doing top-down parsing we can use bottom-up parsing (see LR parser and LALR parser), a smarter technique that supports more grammars than top-down parsing (but still not everything), and takes quadratic time rather than exponential time. Parser generators such as Yacc are based on this.

Furthermore, we can parse arbitrary context-free grammars in cubic time, using some kind of extension of bottom-up parsing such as an Earley parser or GLR parser. Instead of backtracking, we can process all alternative parses at once, if our bookkeeping is smart enough. This essentially turns parsing into a variant of matrix multiplication.

Where does this leave reductive grammars? I suspect that META II generates LL(1) parsers and therefore requires LL(1) grammars. So you might define reductive grammars as LL(1) grammars. Later parser generators use bottom-up parsers - are the grammars that these allow reductive, too? Even later parser generators support general context-free parsing; now, all context-free grammars have become "reductive". Hence my statement: the generated parser is reductive, not the grammar itself.

However, you mentioned another thing that I've ignored thus far: the reductive grammars support additional constructs in their rules that take their expressive power out of the context-free realm. You mentioned how this is necessary for COBOL, but C has a similar problem. If additional constructs in rules can indeed express such things, it is indeed a feature of grammars and worth being mentioned in this article. Furthermore, such features are probably going to be "reductive" in the sense that their interpretation assumes a particular parsing algorithm.

Furthermore, rules written for parser generators also have actions associated with them, code written in a programming language. Such actions are also "reductive", they also (probably) assume a particular parsing algorithm. If you include the action code in your notion of what a reductive grammar is, clearly a reductive grammar is far more than a generative grammar. The action code helps produce sensible syntax trees for the sentences being parsed, but it can just as well perform additional syntax checking, such as verifying whether every variable has been declared prior to use. So in practice, parsers do recognize non-context-free features in languages. This must certainly be explained in an article about parsing and translation, but to what extent does it belong in the present article? I'm not sure. I was more or less assuming it would be confined to the notion of grammar defined by Chomsky. But should it be?

There is much to improve in the article as it is now, but I'd like to see its scope defined more clearly before I attempt to make any changes

Points of contention

edit

First. Thank you Rp. For the kind words and complete explanation you gave. I agree with you on most points. The main point being:

As far as I'm concerned, the discussion is not on whether to cover this technology in Wikipedia; the discussion is on how to cover it and where, and how to avoid describing the same things in different places by different names.-Rp

I have not been able to explain these grammars in linguistic terms. The grammar languages I have been talking about are from the Schorre metacompilers. The major thing that I failed to get across is that the reductive grammar languages are actually programming languages in which one codes a parser. These are all META languages decended from META II. Starting with the hand compiled META I used to compile META II META 5, TREE-META, CWIC, SLIC and others all share a common subsetp of their grammar language:

1. All syntax rules are of the form:

<name> = <language structure pattern>;
i.e. rule = id '=' <language structure pattern>;
The <language structure pattern> is a logic expression and parsing specification.

2. The rule <name> uniquely identifies the rule. Multiple rules of the same name are not allowed.

3. A rule is an executable function, a programed test, that returns success or failure, i.e. (true or false).

4. The <language structure pattern> is a logic expression consisting of tests and operators. Logical and is implied by a sequence of tests. The '/' is the alternative (logical or) operator. Expressions may be grouped using parentheses i.e. (<language structure pattern>).

5. They all compile to a scanner less parser.

Rp calmed that Backus knew of Chomsky before introducing BNF to the IAL (International Algebraic Language) group. There is no proof that Backus knew of Chomsky or attended a lecture by him before BNF was introduced. An intesting ACM paper by a computer designer at IBM describes using boolean equations in circuit design. The syntax of the boolean equation are conceptually equilvant. The paper describes boolean algebraic equation used in designing logic circuits. Names corospond to the circuit function. Backus when asked about the origin of BNF gave different answers. Maybe the idea came from a proprietary project within IBM. Reading the ALGOL 60 report, Naur the efitor, you find that the < > delimit a metalingustic variable. Later also described as a class in ALGOL class notes by Naur. What you have within the < > is not a non-terminal symbol. It may be a description. As below from the ALGOL 60 report explaining comment syntax in table form:

The sequence of basic symbols: is equivalent to
; comment <any sequence not containing ;>; ;
begin comment <any sequence not containing ;>; begin
end <any sequence not containing end or ; or else> end

In rule form:

; comment <any sequence not containing ;> ;  := ;
begin comment <any sequence not containing  ;> ;  := begin
end <any sequence not containing end or ; or else>  := end

An interesting way of negation: "<any sequence not containing ;>" wouldn't you say.

And hay. Notice. They do not have a single non-terminal on the left. Would <any sequence not containing  ;> be considered a non-terminal symbol. Instead of a symbol we have a pattern replacement rule. They were in a table form. But what those BNF rules define is the removal of comments to be ignored as if not in the program. A pattern on the left makes it a context sensitive rule according to the Chomski paper I read. A CFG grammar only has a single non-terminal on the left of the replace operator. However the := operator as explained in the ALGOL 60 Report literally means "is defined as". BNF as documented by Naur in the report described a metalanguage. It later was formalized to fit in the Chomsky production grammars. However Naur may have played a bigger part as he edited the ALGOL report to use BNF.

Why the table instead of rules? Maybe because because the := meaning "is defined as" didn't fit it's defination.


Rp said "However, your idea on the relationship between generative and reductive grammars is different from how I, Arthur Rubin and standard compiler technology textbooks see it."

That may be the case. Except for the 2006 McGraw Hill technical term dictionary. I wasn't concerned with what type of Grammer they were until I started contributing here on wiki. I have been programming sense 1965. I met some of the computing pioneers at ACM L.A. SegPLAN meetings.

I lost most of my library due to water damage when our washer flooded the garage. My books were still in boxes on the garage floor after just moving. Anyway can you point out a compiler text book that defines reductive grammar I have looked for a diffination of reductive grammar. The only one I found is from 2006 McGraw Hill technical term dictionary. The only other references I found were in old philosophy books. I really couldn't see enough of the philosophy book's online scanned pages to get it's defination. Just that they seamed to be an analytical type of phrase structure grammers? The books were early 1900s. Pre Chomsky The metacompiler documents do not explain reductive. Apparently it was in common use in the 50's and 60's. I think I recall phrase structure grammers was mentioned in some of thoes early SegPLAN meeting. But that was a long time ago.

In the case of metacompilers I think reductive has to do with the parser it produces. So I half agree with it being based on the parser produced. However it is a much stronger connection in the metacompilersame whose document's describe their language as reductive, You actually program the parser in their grammar metalanguage. Each rule is compiled into a boolean function. A test function returning success(TRUE) or failure(FALSE). Every thing that defines a language construc is a test. A quoted string is a test.

expr = term $(('+':ADD / '-':SUB) term !2);

In the expr, test function, above term, '+', and '-' are tests. The $ is a loop operator. We see nested grouping avoiding ambiguous language constructs and bounding the $ loop scope. The loop will continue as long as there are more terms following a + or - string. The expr rule function is programed to recognize an arithmetic expression and transform it into an abstract syntax tree.

The metalanguage has character class rules. One of which is skip_class. skip_class is a predefined character class. String tests in syntax rules skip skip_class characters until they find a first character match or a non-skio_class character that doesn't match. The :<node name> operator creates a node and pushes it on the node stack. The Abstract Parse Tree is built on the parse stack. The term rule is expected to leave a single object on the parse stack as does the expr rule function. The !<number> creates a tree whose node is the top node on the node stack and the parse stack entries to fill the !<number> operator specified the number of branches poped from the parse stack. Trees are actually a list whose first element is a node. The expression rule creates a list of the form [ADD,(left)term,(right)term]. The expr rule produces a left handed tree.

a - b + c

     ADD
    /   \  
  SUB    c
 /   \
a     b

If one instead uses right recursion:

expr = term (('+':ADD / '-':SUB)expr!2):

a right handed tree is produced. (not expressing the normal algebraic inturpation).

a - b + c

  SUB
 /   \
a    ADD
    /   \  
   b     c

It acts as if groupted:
(a - (b + c))

One might instead create lists:

expr = +[term $('+' term / '-' term:NEG!1)]+:EXP!1;

The sequence:
a - b + c
producing:
EXP[[a,NEG[b],c]]

The syntax language is directing the translation. The fact they are programming languages in which you code a parser requires them to be analytical. Reductive describe the top down decent describing a language construct as a pattern of language constructs. CWIC has two alternative operators. The forward / is non-backtracking while the back-slash \ is backtracking. The term backup refers to only backing up the input. A backup example is performed when a string test fails the input state is restored to the point before the test. Only the input state is effected. Backtracking restores the parser state of which the input state is a part, releasing created objects and reseting parse and node stacks. Backtracking fails if code generation has occured and an internal compiler error is reported.

Rp, your example:

The grammar is written in a reductive way because it is actually a program. If you have a program error. As you said the parser is reductive or analytical. But so is the grammar being so closely coupled with the generated parser code.

TREE-META was much the same as CWIC. Simpler not having backtracking or a list processing generator language. (A later version of TREE-META added back tracking using "<-" to set a backtrack point.)

Rp, You describe the metalanguage extension as syntactic sugar. But that is not the case. Some are required as it is apecificly a programming language. Second the alternative operator is very definitely not syntactic sugar. The alternative analysis is ordered taking the first matched. The parser tries them in order until a match is found or all alternatives are exhausted and failure results Second a normal alternative is tried in first failure. That is when a sequence is specified the alternative is only atempted on a first tests failure. When a recognition occurs and a subsequent test files a backtrack failure occurs. The parse state input and stacks etc are cleared back to the backtrack point and failure returned. A backtrack failure is partially a c longjmp. The parser state is saved on the call stack and call stack point is saved. A backtrack fail loads the call stack pointer with the saved backtrack stack pointer and return to backtrack that restores the parser state saved in the backtrack stack frame and returns failure. The previous backtrack frame point is restored. Backtrack stack frames are linked back to an initial non-restoring backtrack stack frame.

Rp: Your claim is that what is reductive is not the BNF rules themselves, but the parsers that are automatically generated from them.
This was the innovation brought about in 1962, apparently by META II: the automatic generation of parsers from grammars.

Yes and no. Parser generators starting with yacc work as you describe. However META II does not take BNF. It is not a parser generator as it's output is a compiler. A compiler that compiles it's input into assembly code. META II compiles the META II programming metalanguage. It has ordered left to right execution of alternative taking first match. (not longest. FIRST). PEGs are somewhat recognized as grammes distinguished as analytical because of the ordered first match feature. How can you have first match in a productive grammer. In META I, META II, TREE-META and CHIC alternatives are tested in the order they are written. I have already explained my position on the original BNF. It has changed sense. But I would suggest that in order to use a parser generator it must usually be written from an analytical viewpoint.

For the record my clame is that BNF as used in the ALGOL 60 report was reductive. I think one thing making a grammar reductive is using descriptive rule names. It is then a metalanguage that can be used in talking about the language. For example a term rule describing or following standard algebraic rules. We know algebraic hierarchy and can see it in the metalanguage expr, term and factor rules.

Rp gave this example:

Another problem is ambiguity along the way:
when doing top-down, left-to-right parsing, these rules
  A := B C | D
  D := B E
present a problem: you don't know whether to interpret A as B C or as B E as they both start with B.
This can be solved by backtracking, but such a parser can take an exponential time to run.

Bull puka. The languages have grouping like algebraic or logic expressions. One simply wrote:

  A = B (C | E);

No backtracking involved. Really poor example. Not really as problematic as the expr example abive. CWIC though it had backtracking. was seldom used. I do not remember ever using it other then for error recovery.

program     = $(-".end" declaration) ".end"
declaration = (assignment / if_st) gencode[*1] 
              \ errorx["error"] $(-';' .any)';';
assignment  = id '=' expr ';':asgn!2;
expr        = term $(('+':ADD / '-':SUB) term !2):
term.       = $(factor ('*':MPY/'/':DIV)factor!2);
factor.     = id / number / '(' expr ')';

id .. let $alphanum;
number .. dgt $dgt makenum[];


Rp covered or alluded to the points I gave. Aperently not familiar with META II or the metacompilers that followed based on META II.

The point that Rp messed is that they are programming languages. The are not compiled to a specific type of parser. They basicly operate top down left to right. One might say they are recursive decent. But they are not limited by their catagorization. As is usually explained. The difference is the builtin programed backtracking and extensive look ahead features. They handle languages that have context sensitive lexing. Lexing rules programed function called from syntax rules. In many cases this results in a compile speed improvement. The syntax is controlling the lexing. A lexing rule is relative simple simple And the syntax knows what type of tokens are valid. Lexing rules are only conserved with constants and variables of the language. Operator strings are tested for, using string matching. Only symbols and constants of the source are kept. Constant are released once used. Variables of the language are cataloged in symbol tables.

Because one programs the parser with awareness of the semintic transformation all the pitfalls described can be avoided. With CWIC, released in 1970, most limitations if not all were gone. CWIC provided programmed backtracking. You are not going to get time consuming backtrack you do not specify. The error recoveey example is about the only use I have ever had for backtracking. Backtracking may be used in performing look ahead. But then only looking ahead parsing a structure. A string literal look ahead like -';' does not invoke backtracking. It does invoke backup. It can involve backtracking in a complex situation. -expr would have to process an expr to deturm its truth. ? like the nagate operator can take any syntax expression as its test. An expression that creates objects requires backtracking.

META II parses no different then most of examples given. Except it has no backtracking features. In most cases backtracking is avoided by factoring so only the next object is of interest. Such parsing has little or no error recovery ability. It had no look ahead. I think that LL(1) can handle the ambiguitits presented if the rules are written factoring out alternatives left to right choice on next symbol.

CWIC had many advantages overy parser generators. It produced machine code that ran extremely fast. Had debug features that output the tree structures in symbolic form.

SLIC (System of Languages for Implementing Compilers) was based on CWIC. I wrote SLIC while at collage. It was implemented on a DEC-System-10 timesharing system. I implemented an interactive debugger that could break on entry or exit of a syntax rule or lines of a generator action. One could examine the parse and node stacks. Display generator variables. One could single step generator actions. The call stack could be examined showing rule's or generators stacked up. SLIC added three additional languages. Two of which separated binary machine code from tree crawling code generation. The third language performed in sequence optimization.

It is hard to explain these languages in today's terminology.


Success or Failure

edit

The terms success and failure refer to the status of a test. A test may be considered a type having a value of success or failure. But the in most cases success and failure being the state of a bit the processor's status register. Most commonly the zero status flag bit. After a test a machine branch on condition is used to test the returned status (success or failure)

     call   a_test_rule
     je     a_success_loc
     jne    a_failure_loc

Assuming a_test_rule returns to the instruction after the call instruction. The be instruction can be used to transfer control to a a_success_loc or a bne to a_failure_loc. Note. The branch instructions illiterate the condition testing. Not actual code. Normally on one or the other would be generated. What ever follows would be the opsite condition.

expr = term $(('+':ADD|'-':SUB) term !2);

Would generate the following IA86 code sequence.

%s1:    ASCIIZ   "+"
%S2:    ASCIIZ   "-"
_ADD_N: ASCIIZ   "ADD"
_SUB_N: ASCIIZ   "SUB"

expr:   call term    // term
        be   %g1     // matched term?
        ret          // failure
%g1:    Lea  si,%S1  // "+"
        call _cmp_str
        jne  %g2
        lea  si,_ADD_N
        jmp  %g3
%g2:    Lea  si,%S2   // "-"
        call _cmp_str
        jne  %g4
        lea  si,_SUB_N
%g3:    CALL  _node
        call term
        jne   %g6
        mov  ecx,2
        call _tree_
        jmp  %g1
%g4:    cmp  al,al
%g5:    ret           // return success
%g6:    jmp  _backtrk_

Parser Generator vs Parser Programming Language

edit

Thank you Rp for the kind words and complete explanation you gave on the reductive grammar subject. Where or how should we explain technology common to multiple disciplines is a good question?

You seam very knowledgeable on the subjects of grammars and parsers. However you and I see the technology using reductive grammars differently. My understanding comes from developing SLIC, and using it to develop a COBOL compiler and TI990 assembler. The difference is in the details of how one programs a compiler using the technology. Explaining their programming I hope will help show the difference. They are simular to parsing expression grammars.

The terms compiler-compiler, metacompiler and parser generator have been confused and becone mangled together over the years to the point they have lost all distinctiveness. I learned about these systems in the 1960's when metacompiler refered to a specific type of compiler-compiler and a compiler-compiler was a real compiler compiler producing complete running compilers. When the description of a metacompiler was a compiler taking a metalanguage program as input or what today would could be called a metasyntax parser programming language.

You mentioned META II as being a parser generator. While the general concept of a parser generator covers META II the operational details of parser generators do not exactly fit META II or other Schorre based META languages. An interesting point META II is described by Schorre as a linguage.

Like other programming languages their compilers have the same name.

Schorre based meta languages are basicly recursive descent parser programming languages. The details of how they transform source code into object code is programed in the language. Today the focus is on transforming BNF into a parser. Usually a bottom up parser.

I first ran across the term reductive grammer in a 1968 TREE-META document. I recently found in a 1974 TREE-META document:

A top-down method of syntax analysis is used in the TREE-META system. The main goal is to match the complete program with the Syntax Rule whose name follows .META at the head of the 'program. The name of the Syntax Rule appears at the start of the rule followed by the = symbol. The right-hand side of this Syntax Rule indicates which entities have to be recognised to achieve this goal. The recogniser achieves its main goal by looking for these smaller entities, from left to right, as individual subgoals. These subgoals are themselves defined in terms of other entities, and so the process of recognising a complete program becomes a matter of looking for progressively smaller entities, right down to the level at which basic symbols such as characters and numbers are recognised. TREE-META

The above applies to all Schorre based metacompiler languages. META II, TREE-META, CWIC and many others

I think these old metacompilers developed in the 1960s are distinctly different from parser generstors. But it takes an understanding of using them to program a compiler to discern that differance. They are actually patser programming languages in which a programming language is defined. They define the source language and it's transformation into machine code.

There is a procedural element in programming a parser in these languages. We ask the question what is X? And write the equation that defines X:

X = ....

X names a linguistic constituent test function defined by a boolean expression on the right. They are equations that express actions that test the input character stream.

  1. ^ "A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token." -- page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted in http://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme
  2. ^ Revised ALGOL 60 report section. 1.1."ALGOL 60". Retrieved April 18, 2015.
  3. ^ a b c d
    Reductive grammar: (cogmputer science) A set of syntactic rules for the analysis of strings to determine whether the strings exist in a language. "Sci-Tech Dictionary McGraw-Hill Dictionary of Scientific and Technical Terms, 6th edition, published by The McGraw-Hill Companies, Inc".
  4. ^ The meaning of syntactic formula may be futher explained by saying that words enclosed in the brackets < >, like <ab>, denote classses whose members are sequences of basic symbols. Class designations of this kind are found in any description of a language. For describing ordinary natural languages designation like word, verb, noun, are used. , Peter Naur (1961)."A COURSE ON ALGOL PROGRAMMING". p. 5 Note1. Retrieved 26 March 2015.