New section "Simple GLR Parsers".

This commit is contained in:
Paul Eggert
2004-06-21 20:55:20 +00:00
parent 85f23fae45
commit 99a9344e77
2 changed files with 212 additions and 3 deletions

View File

@@ -1,3 +1,7 @@
2004-06-21 Frank Heckenbach <frank@g-n-u.de>
* doc/bison.texinfo (Simple GLR Parsers): New section.
2004-06-21 Paul Eggert <eggert@cs.ucla.edu>
* NEWS, TODO, doc/bison.texinfo:

View File

@@ -135,7 +135,8 @@ The Concepts of Bison
a semantic value (the value of an integer,
the name of an identifier, etc.).
* Semantic Actions:: Each rule can have an action containing C code.
* GLR Parsers:: Writing parsers for general context-free languages
* GLR Parsers:: Writing parsers for general context-free languages.
* Simple GLR Parsers:: Using GLR in its simplest form.
* Locations Overview:: Tracking Locations.
* Bison Parser:: What are Bison's input and output,
how is the output used?
@@ -381,7 +382,8 @@ use Bison or Yacc, we suggest you start by reading this chapter carefully.
a semantic value (the value of an integer,
the name of an identifier, etc.).
* Semantic Actions:: Each rule can have an action containing C code.
* GLR Parsers:: Writing parsers for general context-free languages
* GLR Parsers:: Writing parsers for general context-free languages.
* Simple GLR Parsers:: Using GLR in its simplest form.
* Locations Overview:: Tracking Locations.
* Bison Parser:: What are Bison's input and output,
how is the output used?
@@ -860,6 +862,208 @@ will suffice. Otherwise, we suggest
%@}
@end example
@node Simple GLR Parsers
@section Using @acronym{GLR} in its Simplest Form
@cindex @acronym{GLR} parsing, unambiguous grammars
@cindex generalized @acronym{LR} (@acronym{GLR}) parsing, unambiguous grammars
@findex %glr-parser
@findex %expect-rr
@cindex conflicts
@cindex reduce/reduce conflicts
The C++ example for @acronym{GLR} (@pxref{GLR Parsers}) explains how to use
the @acronym{GLR} parsing algorithm with some advanced features such as
@samp{%dprec} and @samp{%merge} to handle syntactically ambiguous
grammars. However, the @acronym{GLR} algorithm can also be used in a simpler
way to parse grammars that are unambiguous, but fail to be @acronym{LALR}(1).
Such grammars typically require more than one symbol of lookahead,
or (in rare cases) fall into the category of grammars in which the
@acronym{LALR}(1) algorithm throws away too much information (they are in
@acronym{LR}(1), but not @acronym{LALR}(1), @ref{Mystery Conflicts}).
Here is an example of this situation, using a problem that
arises in the declaration of enumerated and subrange types in the
programming language Pascal. These declarations look like this:
@example
type subrange = lo .. hi;
type enum = (a, b, c);
@end example
@noindent
The original language standard allows only numeric
literals and constant identifiers for the subrange bounds (@samp{lo}
and @samp{hi}), but Extended Pascal (ISO/IEC 10206:1990) and many other
Pascal implementations allow arbitrary expressions there. This gives
rise to the following situation, containing a superfluous pair of
parentheses:
@example
type subrange = (a) .. b;
@end example
@noindent
Compare this to the following declaration of an enumerated
type with only one value:
@example
type enum = (a);
@end example
@noindent
(These declarations are contrived, but they are syntactically
valid, and more-complicated cases can come up in practical programs.)
These two declarations look identical until the @samp{..} token.
With normal @acronym{LALR}(1) one-token look-ahead it is not
possible to decide between the two forms when the identifier
@samp{a} is parsed. It is, however, desirable
for a parser to decide this, since in the latter case
@samp{a} must become a new identifier to represent the enumeration
value, while in the former case @samp{a} must be evaluated with its
current meaning, which may be a constant or even a function call.
You could parse @samp{(a)} as an ``unspecified identifier in parentheses'',
to be resolved later, but this typically requires substantial
contortions in both semantic actions and large parts of the
grammar, where the parentheses are nested in the recursive rules for
expressions.
You might think of using the lexer to distinguish between the two
forms by returning different tokens for currently defined and
undefined identifiers. But if these declarations occur in a local
scope, and @samp{a} is defined in an outer scope, then both forms
are possible---either locally redefining @samp{a}, or using the
value of @samp{a} from the outer scope. So this approach cannot
work.
A solution to this problem is to use a @acronym{GLR} parser in its simplest
form, i.e., without using special features such as @samp{%dprec} and
@samp{%merge}. When the @acronym{GLR} parser reaches the critical state, it
simply splits into two branches and pursues both syntax rules
simultaneously. Sooner or later, one of them runs into a parsing
error. If there is a @samp{..} token before the next
@samp{;}, the rule for enumerated types fails since it cannot
accept @samp{..} anywhere; otherwise, the subrange type rule
fails since it requires a @samp{..} token. So one of the branches
fails silently, and the other one continues normally, performing
all the intermediate actions that were postponed during the split.
If the input is syntactically incorrect, both branches fail and the parser
reports a syntax error as usual.
The effect of all this is that the parser seems to ``guess'' the
correct branch to take, or in other words, it seems to use more
look-ahead than the underlying @acronym{LALR}(1) algorithm actually allows
for. In this example, @acronym{LALR}(2) would suffice, but also some cases
that are not @acronym{LALR}(@math{k}) for any @math{k} can be handled this way.
Since there can be only two branches and at least one of them
must fail, you need not worry about merging the branches by
using dynamic precedence or @samp{%merge}.
Another potential problem of @acronym{GLR} does not arise here, either. In
general, a @acronym{GLR} parser can take quadratic or cubic worst-case time,
and the current Bison parser even takes exponential time and space
for some grammars. In practice, this rarely happens, and for many
grammars it is possible to prove that it cannot happen. In
in the present example, there is only one conflict between two
rules, and the type-declaration context where the conflict
arises cannot be nested. So the number of
branches that can exist at any time is limited by the constant 2,
and the parsing time is still linear.
So here we have a case where we can use the benefits of @acronym{GLR}, almost
without disadvantages. There are two things to note, though.
First, one should carefully analyze the conflicts reported by
Bison to make sure that @acronym{GLR} splitting is done only where it is
intended to be. A @acronym{GLR} parser splitting inadvertently may cause
problems less obvious than an @acronym{LALR} parser statically choosing the
wrong alternative in a conflict.
Second, interactions with the lexer (@pxref{Semantic Tokens}) must
be considered with great care. Since a split parser consumes tokens
without performing any actions during the split, the lexer cannot
obtain information via parser actions. Some cases of
lexer interactions can simply be eliminated by using @acronym{GLR}, i.e.,
shifting the complications from the lexer to the parser. Remaining
cases have to be checked for safety.
In our example, it would be safe for the lexer to return tokens
based on their current meanings in some symbol table, because no new
symbols are defined in the middle of a type declaration. Though it
is possible for a parser to define the enumeration
constants as they are parsed, before the type declaration is
completed, it actually makes no difference since they cannot be used
within the same enumerated type declaration.
Here is a Bison grammar corresponding to the example above. It
parses a vastly simplified form of Pascal type declarations.
@example
%token TYPE DOTDOT ID
@group
%left '+' '-'
%left '*' '/'
@end group
%%
@group
type_decl:
TYPE ID '=' type ';'
;
@end group
@group
type: '(' id_list ')'
| expr DOTDOT expr
;
@end group
@group
id_list: ID
| id_list ',' ID
;
@end group
@group
expr: '(' expr ')'
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| ID
;
@end group
@end example
When used as a normal @acronym{LALR}(1) grammar, Bison correctly complains
about one reduce/reduce conflict. In the conflicting situation the
parser chooses one of the alternatives, arbitrarily the one
declared first. Therefore the following correct input is not
recognized:
@example
type t = (a) .. b;
@end example
The parser can be turned into a @acronym{GLR} parser, while also telling Bison
to be silent about the one known reduce/reduce conflict, simply by
adding these two declarations to the Bison input file:
@example
%glr-parser
%expect-rr 1
@end example
@noindent
No change in the grammar itself is required. Now the
parser recognizes all valid declarations, according to the
limited syntax above, transparently. In fact, the user does not even
notice when the parser splits.
@node Locations Overview
@section Locations
@cindex location
@@ -1290,7 +1494,7 @@ not require it. You can add or change white space as much as you wish.
For example, this:
@example
exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{}
exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{} ;
@end example
@noindent
@@ -1300,6 +1504,7 @@ means the same thing as this:
exp: NUM
| exp exp '+' @{ $$ = $1 + $2; @}
| @dots{}
;
@end example
@noindent