mirror of
https://git.savannah.gnu.org/git/bison.git
synced 2026-03-20 01:33:03 +00:00
New section "Simple GLR Parsers".
This commit is contained in:
@@ -1,3 +1,7 @@
|
|||||||
|
2004-06-21 Frank Heckenbach <frank@g-n-u.de>
|
||||||
|
|
||||||
|
* doc/bison.texinfo (Simple GLR Parsers): New section.
|
||||||
|
|
||||||
2004-06-21 Paul Eggert <eggert@cs.ucla.edu>
|
2004-06-21 Paul Eggert <eggert@cs.ucla.edu>
|
||||||
|
|
||||||
* NEWS, TODO, doc/bison.texinfo:
|
* NEWS, TODO, doc/bison.texinfo:
|
||||||
|
|||||||
@@ -135,7 +135,8 @@ The Concepts of Bison
|
|||||||
a semantic value (the value of an integer,
|
a semantic value (the value of an integer,
|
||||||
the name of an identifier, etc.).
|
the name of an identifier, etc.).
|
||||||
* Semantic Actions:: Each rule can have an action containing C code.
|
* Semantic Actions:: Each rule can have an action containing C code.
|
||||||
* GLR Parsers:: Writing parsers for general context-free languages
|
* GLR Parsers:: Writing parsers for general context-free languages.
|
||||||
|
* Simple GLR Parsers:: Using GLR in its simplest form.
|
||||||
* Locations Overview:: Tracking Locations.
|
* Locations Overview:: Tracking Locations.
|
||||||
* Bison Parser:: What are Bison's input and output,
|
* Bison Parser:: What are Bison's input and output,
|
||||||
how is the output used?
|
how is the output used?
|
||||||
@@ -381,7 +382,8 @@ use Bison or Yacc, we suggest you start by reading this chapter carefully.
|
|||||||
a semantic value (the value of an integer,
|
a semantic value (the value of an integer,
|
||||||
the name of an identifier, etc.).
|
the name of an identifier, etc.).
|
||||||
* Semantic Actions:: Each rule can have an action containing C code.
|
* Semantic Actions:: Each rule can have an action containing C code.
|
||||||
* GLR Parsers:: Writing parsers for general context-free languages
|
* GLR Parsers:: Writing parsers for general context-free languages.
|
||||||
|
* Simple GLR Parsers:: Using GLR in its simplest form.
|
||||||
* Locations Overview:: Tracking Locations.
|
* Locations Overview:: Tracking Locations.
|
||||||
* Bison Parser:: What are Bison's input and output,
|
* Bison Parser:: What are Bison's input and output,
|
||||||
how is the output used?
|
how is the output used?
|
||||||
@@ -860,6 +862,208 @@ will suffice. Otherwise, we suggest
|
|||||||
%@}
|
%@}
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
|
@node Simple GLR Parsers
|
||||||
|
@section Using @acronym{GLR} in its Simplest Form
|
||||||
|
@cindex @acronym{GLR} parsing, unambiguous grammars
|
||||||
|
@cindex generalized @acronym{LR} (@acronym{GLR}) parsing, unambiguous grammars
|
||||||
|
@findex %glr-parser
|
||||||
|
@findex %expect-rr
|
||||||
|
@cindex conflicts
|
||||||
|
@cindex reduce/reduce conflicts
|
||||||
|
|
||||||
|
The C++ example for @acronym{GLR} (@pxref{GLR Parsers}) explains how to use
|
||||||
|
the @acronym{GLR} parsing algorithm with some advanced features such as
|
||||||
|
@samp{%dprec} and @samp{%merge} to handle syntactically ambiguous
|
||||||
|
grammars. However, the @acronym{GLR} algorithm can also be used in a simpler
|
||||||
|
way to parse grammars that are unambiguous, but fail to be @acronym{LALR}(1).
|
||||||
|
Such grammars typically require more than one symbol of lookahead,
|
||||||
|
or (in rare cases) fall into the category of grammars in which the
|
||||||
|
@acronym{LALR}(1) algorithm throws away too much information (they are in
|
||||||
|
@acronym{LR}(1), but not @acronym{LALR}(1), @ref{Mystery Conflicts}).
|
||||||
|
|
||||||
|
Here is an example of this situation, using a problem that
|
||||||
|
arises in the declaration of enumerated and subrange types in the
|
||||||
|
programming language Pascal. These declarations look like this:
|
||||||
|
|
||||||
|
@example
|
||||||
|
type subrange = lo .. hi;
|
||||||
|
type enum = (a, b, c);
|
||||||
|
@end example
|
||||||
|
|
||||||
|
@noindent
|
||||||
|
The original language standard allows only numeric
|
||||||
|
literals and constant identifiers for the subrange bounds (@samp{lo}
|
||||||
|
and @samp{hi}), but Extended Pascal (ISO/IEC 10206:1990) and many other
|
||||||
|
Pascal implementations allow arbitrary expressions there. This gives
|
||||||
|
rise to the following situation, containing a superfluous pair of
|
||||||
|
parentheses:
|
||||||
|
|
||||||
|
@example
|
||||||
|
type subrange = (a) .. b;
|
||||||
|
@end example
|
||||||
|
|
||||||
|
@noindent
|
||||||
|
Compare this to the following declaration of an enumerated
|
||||||
|
type with only one value:
|
||||||
|
|
||||||
|
@example
|
||||||
|
type enum = (a);
|
||||||
|
@end example
|
||||||
|
|
||||||
|
@noindent
|
||||||
|
(These declarations are contrived, but they are syntactically
|
||||||
|
valid, and more-complicated cases can come up in practical programs.)
|
||||||
|
|
||||||
|
These two declarations look identical until the @samp{..} token.
|
||||||
|
With normal @acronym{LALR}(1) one-token look-ahead it is not
|
||||||
|
possible to decide between the two forms when the identifier
|
||||||
|
@samp{a} is parsed. It is, however, desirable
|
||||||
|
for a parser to decide this, since in the latter case
|
||||||
|
@samp{a} must become a new identifier to represent the enumeration
|
||||||
|
value, while in the former case @samp{a} must be evaluated with its
|
||||||
|
current meaning, which may be a constant or even a function call.
|
||||||
|
|
||||||
|
You could parse @samp{(a)} as an ``unspecified identifier in parentheses'',
|
||||||
|
to be resolved later, but this typically requires substantial
|
||||||
|
contortions in both semantic actions and large parts of the
|
||||||
|
grammar, where the parentheses are nested in the recursive rules for
|
||||||
|
expressions.
|
||||||
|
|
||||||
|
You might think of using the lexer to distinguish between the two
|
||||||
|
forms by returning different tokens for currently defined and
|
||||||
|
undefined identifiers. But if these declarations occur in a local
|
||||||
|
scope, and @samp{a} is defined in an outer scope, then both forms
|
||||||
|
are possible---either locally redefining @samp{a}, or using the
|
||||||
|
value of @samp{a} from the outer scope. So this approach cannot
|
||||||
|
work.
|
||||||
|
|
||||||
|
A solution to this problem is to use a @acronym{GLR} parser in its simplest
|
||||||
|
form, i.e., without using special features such as @samp{%dprec} and
|
||||||
|
@samp{%merge}. When the @acronym{GLR} parser reaches the critical state, it
|
||||||
|
simply splits into two branches and pursues both syntax rules
|
||||||
|
simultaneously. Sooner or later, one of them runs into a parsing
|
||||||
|
error. If there is a @samp{..} token before the next
|
||||||
|
@samp{;}, the rule for enumerated types fails since it cannot
|
||||||
|
accept @samp{..} anywhere; otherwise, the subrange type rule
|
||||||
|
fails since it requires a @samp{..} token. So one of the branches
|
||||||
|
fails silently, and the other one continues normally, performing
|
||||||
|
all the intermediate actions that were postponed during the split.
|
||||||
|
|
||||||
|
If the input is syntactically incorrect, both branches fail and the parser
|
||||||
|
reports a syntax error as usual.
|
||||||
|
|
||||||
|
The effect of all this is that the parser seems to ``guess'' the
|
||||||
|
correct branch to take, or in other words, it seems to use more
|
||||||
|
look-ahead than the underlying @acronym{LALR}(1) algorithm actually allows
|
||||||
|
for. In this example, @acronym{LALR}(2) would suffice, but also some cases
|
||||||
|
that are not @acronym{LALR}(@math{k}) for any @math{k} can be handled this way.
|
||||||
|
|
||||||
|
Since there can be only two branches and at least one of them
|
||||||
|
must fail, you need not worry about merging the branches by
|
||||||
|
using dynamic precedence or @samp{%merge}.
|
||||||
|
|
||||||
|
Another potential problem of @acronym{GLR} does not arise here, either. In
|
||||||
|
general, a @acronym{GLR} parser can take quadratic or cubic worst-case time,
|
||||||
|
and the current Bison parser even takes exponential time and space
|
||||||
|
for some grammars. In practice, this rarely happens, and for many
|
||||||
|
grammars it is possible to prove that it cannot happen. In
|
||||||
|
in the present example, there is only one conflict between two
|
||||||
|
rules, and the type-declaration context where the conflict
|
||||||
|
arises cannot be nested. So the number of
|
||||||
|
branches that can exist at any time is limited by the constant 2,
|
||||||
|
and the parsing time is still linear.
|
||||||
|
|
||||||
|
So here we have a case where we can use the benefits of @acronym{GLR}, almost
|
||||||
|
without disadvantages. There are two things to note, though.
|
||||||
|
First, one should carefully analyze the conflicts reported by
|
||||||
|
Bison to make sure that @acronym{GLR} splitting is done only where it is
|
||||||
|
intended to be. A @acronym{GLR} parser splitting inadvertently may cause
|
||||||
|
problems less obvious than an @acronym{LALR} parser statically choosing the
|
||||||
|
wrong alternative in a conflict.
|
||||||
|
|
||||||
|
Second, interactions with the lexer (@pxref{Semantic Tokens}) must
|
||||||
|
be considered with great care. Since a split parser consumes tokens
|
||||||
|
without performing any actions during the split, the lexer cannot
|
||||||
|
obtain information via parser actions. Some cases of
|
||||||
|
lexer interactions can simply be eliminated by using @acronym{GLR}, i.e.,
|
||||||
|
shifting the complications from the lexer to the parser. Remaining
|
||||||
|
cases have to be checked for safety.
|
||||||
|
|
||||||
|
In our example, it would be safe for the lexer to return tokens
|
||||||
|
based on their current meanings in some symbol table, because no new
|
||||||
|
symbols are defined in the middle of a type declaration. Though it
|
||||||
|
is possible for a parser to define the enumeration
|
||||||
|
constants as they are parsed, before the type declaration is
|
||||||
|
completed, it actually makes no difference since they cannot be used
|
||||||
|
within the same enumerated type declaration.
|
||||||
|
|
||||||
|
Here is a Bison grammar corresponding to the example above. It
|
||||||
|
parses a vastly simplified form of Pascal type declarations.
|
||||||
|
|
||||||
|
@example
|
||||||
|
%token TYPE DOTDOT ID
|
||||||
|
|
||||||
|
@group
|
||||||
|
%left '+' '-'
|
||||||
|
%left '*' '/'
|
||||||
|
@end group
|
||||||
|
|
||||||
|
%%
|
||||||
|
|
||||||
|
@group
|
||||||
|
type_decl:
|
||||||
|
TYPE ID '=' type ';'
|
||||||
|
;
|
||||||
|
@end group
|
||||||
|
|
||||||
|
@group
|
||||||
|
type: '(' id_list ')'
|
||||||
|
| expr DOTDOT expr
|
||||||
|
;
|
||||||
|
@end group
|
||||||
|
|
||||||
|
@group
|
||||||
|
id_list: ID
|
||||||
|
| id_list ',' ID
|
||||||
|
;
|
||||||
|
@end group
|
||||||
|
|
||||||
|
@group
|
||||||
|
expr: '(' expr ')'
|
||||||
|
| expr '+' expr
|
||||||
|
| expr '-' expr
|
||||||
|
| expr '*' expr
|
||||||
|
| expr '/' expr
|
||||||
|
| ID
|
||||||
|
;
|
||||||
|
@end group
|
||||||
|
@end example
|
||||||
|
|
||||||
|
When used as a normal @acronym{LALR}(1) grammar, Bison correctly complains
|
||||||
|
about one reduce/reduce conflict. In the conflicting situation the
|
||||||
|
parser chooses one of the alternatives, arbitrarily the one
|
||||||
|
declared first. Therefore the following correct input is not
|
||||||
|
recognized:
|
||||||
|
|
||||||
|
@example
|
||||||
|
type t = (a) .. b;
|
||||||
|
@end example
|
||||||
|
|
||||||
|
The parser can be turned into a @acronym{GLR} parser, while also telling Bison
|
||||||
|
to be silent about the one known reduce/reduce conflict, simply by
|
||||||
|
adding these two declarations to the Bison input file:
|
||||||
|
|
||||||
|
@example
|
||||||
|
%glr-parser
|
||||||
|
%expect-rr 1
|
||||||
|
@end example
|
||||||
|
|
||||||
|
@noindent
|
||||||
|
No change in the grammar itself is required. Now the
|
||||||
|
parser recognizes all valid declarations, according to the
|
||||||
|
limited syntax above, transparently. In fact, the user does not even
|
||||||
|
notice when the parser splits.
|
||||||
|
|
||||||
@node Locations Overview
|
@node Locations Overview
|
||||||
@section Locations
|
@section Locations
|
||||||
@cindex location
|
@cindex location
|
||||||
@@ -1290,7 +1494,7 @@ not require it. You can add or change white space as much as you wish.
|
|||||||
For example, this:
|
For example, this:
|
||||||
|
|
||||||
@example
|
@example
|
||||||
exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{}
|
exp : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{} ;
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
@noindent
|
@noindent
|
||||||
@@ -1300,6 +1504,7 @@ means the same thing as this:
|
|||||||
exp: NUM
|
exp: NUM
|
||||||
| exp exp '+' @{ $$ = $1 + $2; @}
|
| exp exp '+' @{ $$ = $1 + $2; @}
|
||||||
| @dots{}
|
| @dots{}
|
||||||
|
;
|
||||||
@end example
|
@end example
|
||||||
|
|
||||||
@noindent
|
@noindent
|
||||||
|
|||||||
Reference in New Issue
Block a user