mirror of
https://git.savannah.gnu.org/git/bison.git
synced 2026-03-10 21:03:04 +00:00
Initial check-in introducing experimental GLR parsing. See entry in
ChangeLog dated 2002-06-27 from Paul Hilfinger for details.
This commit is contained in:
@@ -282,6 +282,7 @@ The Bison Parser Algorithm
|
||||
* Parser States:: The parser is a finite-state-machine with stack.
|
||||
* Reduce/Reduce:: When two rules are applicable in the same situation.
|
||||
* Mystery Conflicts:: Reduce/reduce conflicts that look unjustified.
|
||||
* Generalized LR Parsing:: Parsing arbitrary context-free grammars.
|
||||
* Stack Overflow:: What happens when stack gets full. How to avoid it.
|
||||
|
||||
Operator Precedence
|
||||
@@ -388,6 +389,7 @@ use Bison or Yacc, we suggest you start by reading this chapter carefully.
|
||||
a semantic value (the value of an integer,
|
||||
the name of an identifier, etc.).
|
||||
* Semantic Actions:: Each rule can have an action containing C code.
|
||||
* GLR Parsers:: Writing parsers for general context-free languages
|
||||
* Locations Overview:: Tracking Locations.
|
||||
* Bison Parser:: What are Bison's input and output,
|
||||
how is the output used?
|
||||
@@ -418,8 +420,12 @@ specify the language Algol 60. Any grammar expressed in BNF is a
|
||||
context-free grammar. The input to Bison is essentially machine-readable
|
||||
BNF.
|
||||
|
||||
Not all context-free languages can be handled by Bison, only those
|
||||
that are LALR(1). In brief, this means that it must be possible to
|
||||
@cindex LALR(1) grammars
|
||||
@cindex LR(1) grammars
|
||||
There are various important subclasses of context-free grammar. Although it
|
||||
can handle almost all context-free grammars, Bison is optimized for what
|
||||
are called LALR(1) grammars.
|
||||
In brief, in these grammars, it must be possible to
|
||||
tell how to parse any portion of an input string with just a single
|
||||
token of look-ahead. Strictly speaking, that is a description of an
|
||||
LR(1) grammar, and LALR(1) involves additional restrictions that are
|
||||
@@ -427,6 +433,24 @@ hard to explain simply; but it is rare in actual practice to find an
|
||||
LR(1) grammar that fails to be LALR(1). @xref{Mystery Conflicts, ,
|
||||
Mysterious Reduce/Reduce Conflicts}, for more information on this.
|
||||
|
||||
@cindex GLR parsing
|
||||
@cindex generalized LR (GLR) parsing
|
||||
@cindex ambiguous grammars
|
||||
@cindex non-deterministic parsing
|
||||
Parsers for LALR(1) grammars are @dfn{deterministic}, meaning roughly that
|
||||
the next grammar rule to apply at any point in the input is uniquely
|
||||
determined by the preceding input and a fixed, finite portion (called
|
||||
a @dfn{look-ahead}) of the remaining input.
|
||||
A context-free grammar can be @dfn{ambiguous}, meaning that
|
||||
there are multiple ways to apply the grammar rules to get the some inputs.
|
||||
Even unambiguous grammars can be @dfn{non-deterministic}, meaning that no
|
||||
fixed look-ahead always suffices to determine the next grammar rule to apply.
|
||||
With the proper declarations, Bison is also able to parse these more general
|
||||
context-free grammars, using a technique known as GLR parsing (for
|
||||
Generalized LR). Bison's GLR parsers are able to handle any context-free
|
||||
grammar for which the number of possible parses of any given string
|
||||
is finite.
|
||||
|
||||
@cindex symbols (abstract)
|
||||
@cindex token
|
||||
@cindex syntactic grouping
|
||||
@@ -632,6 +656,180 @@ expr: expr '+' expr @{ $$ = $1 + $3; @}
|
||||
The action says how to produce the semantic value of the sum expression
|
||||
from the values of the two subexpressions.
|
||||
|
||||
@node GLR Parsers
|
||||
@section Writing GLR Parsers
|
||||
@cindex GLR parsing
|
||||
@cindex generalized LR (GLR) parsing
|
||||
@findex %glr-parser
|
||||
@cindex conflicts
|
||||
@cindex shift/reduce conflicts
|
||||
|
||||
In some grammars, there will be cases where Bison's standard LALR(1)
|
||||
parsing algorithm cannot decide whether to apply a certain grammar rule
|
||||
at a given point. That is, it may not be able to decide (on the basis
|
||||
of the input read so far) which of two possible reductions (applications
|
||||
of a grammar rule) applies, or whether to apply a reduction or read more
|
||||
of the input and apply a reduction later in the input. These are known
|
||||
respectively as @dfn{reduce/reduce} conflicts (@pxref{Reduce/Reduce}),
|
||||
and @dfn{shift/reduce} conflicts (@pxref{Shift/Reduce}).
|
||||
|
||||
To use a grammar that is not easily modified to be LALR(1), a more
|
||||
general parsing algorithm is sometimes necessary. If you include
|
||||
@code{%glr-parser} among the Bison declarations in your file
|
||||
(@pxref{Grammar Outline}), the result will be a Generalized LR (GLR)
|
||||
parser. These parsers handle Bison grammars that contain no unresolved
|
||||
conflicts (i.e., after applying precedence declarations) identically to
|
||||
LALR(1) parsers. However, when faced with unresolved shift/reduce and
|
||||
reduce/reduce conflicts, GLR parsers use the simple expedient of doing
|
||||
both, effectively cloning the parser to follow both possibilities. Each
|
||||
of the resulting parsers can again split, so that at any given time,
|
||||
there can be any number of possible parses being explored. The parsers
|
||||
proceed in lockstep; that is, all of them consume (shift) a given input
|
||||
symbol before any of them proceed to the next. Each of the cloned
|
||||
parsers eventually meets one of two possible fates: either it runs into
|
||||
a parsing error, in which case it simply vanishes, or it merges with
|
||||
another parser, because the two of them have reduced the input to an
|
||||
identical set of symbols.
|
||||
|
||||
During the time that there are multiple parsers, semantic actions are
|
||||
recorded, but not performed. When a parser disappears, its recorded
|
||||
semantic actions disappear as well, and are never performed. When a
|
||||
reduction makes two parsers identical, causing them to merge, Bison
|
||||
records both sets of semantic actions. Whenever the last two parsers
|
||||
merge, reverting to the single-parser case, Bison resolves all the
|
||||
outstanding actions either by precedences given to the grammar rules
|
||||
involved, or by performing both actions, and then calling a designated
|
||||
user-defined function on the resulting values to produce an arbitrary
|
||||
merged result.
|
||||
|
||||
Let's consider an example, vastly simplified from C++.
|
||||
|
||||
@example
|
||||
%@{
|
||||
#define YYSTYPE const char*
|
||||
%@}
|
||||
|
||||
%token TYPENAME ID
|
||||
|
||||
%right '='
|
||||
%left '+'
|
||||
|
||||
%glr-parser
|
||||
|
||||
%%
|
||||
|
||||
prog :
|
||||
| prog stmt @{ printf ("\n"); @}
|
||||
;
|
||||
|
||||
stmt : expr ';' %dprec 1
|
||||
| decl %dprec 2
|
||||
;
|
||||
|
||||
expr : ID @{ printf ("%s ", $$); @}
|
||||
| TYPENAME '(' expr ')'
|
||||
@{ printf ("%s <cast> ", $1); @}
|
||||
| expr '+' expr @{ printf ("+ "); @}
|
||||
| expr '=' expr @{ printf ("= "); @}
|
||||
;
|
||||
|
||||
decl : TYPENAME declarator ';'
|
||||
@{ printf ("%s <declare> ", $1); @}
|
||||
| TYPENAME declarator '=' expr ';'
|
||||
@{ printf ("%s <init-declare> ", $1); @}
|
||||
;
|
||||
|
||||
declarator : ID @{ printf ("\"%s\" ", $1); @}
|
||||
| '(' declarator ')'
|
||||
;
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
This models a problematic part of the C++ grammar---the ambiguity between
|
||||
certain declarations and statements. For example,
|
||||
|
||||
@example
|
||||
T (x) = y+z;
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
parses as either an @code{expr} or a @code{stmt}
|
||||
(assuming that @samp{T} is recognized as a TYPENAME and @samp{x} as an ID).
|
||||
Bison detects this as a reduce/reduce conflict between the rules
|
||||
@code{expr : ID} and @code{declarator : ID}, which it cannot resolve at the
|
||||
time it encounters @code{x} in the example above. The two @code{%dprec}
|
||||
declarations, however, give precedence to interpreting the example as a
|
||||
@code{decl}, which implies that @code{x} is a declarator.
|
||||
The parser therefore prints
|
||||
|
||||
@example
|
||||
"x" y z + T <init-declare>
|
||||
@end example
|
||||
|
||||
Consider a different input string for this parser:
|
||||
|
||||
@example
|
||||
T (x) + y;
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
Here, there is no ambiguity (this cannot be parsed as a declaration).
|
||||
However, at the time the Bison parser encounters @code{x}, it does not
|
||||
have enough information to resolve the reduce/reduce conflict (again,
|
||||
between @code{x} as an @code{expr} or a @code{declarator}). In this
|
||||
case, no precedence declaration is used. Instead, the parser splits
|
||||
into two, one assuming that @code{x} is an @code{expr}, and the other
|
||||
assuming @code{x} is a @code{declarator}. The second of these parsers
|
||||
then vanishes when it sees @code{+}, and the parser prints
|
||||
|
||||
@example
|
||||
x T <cast> y +
|
||||
@end example
|
||||
|
||||
Suppose that instead of resolving the ambiguity, you wanted to see all
|
||||
the possibilities. For this purpose, we must @dfn{merge} the semantic
|
||||
actions of the two possible parsers, rather than choosing one over the
|
||||
other. To do so, you could change the declaration of @code{stmt} as
|
||||
follows:
|
||||
|
||||
@example
|
||||
stmt : expr ';' %merge <stmtMerge>
|
||||
| decl %merge <stmtMerge>
|
||||
;
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
|
||||
and define the @code{stmtMerge} function as:
|
||||
|
||||
@example
|
||||
static YYSTYPE stmtMerge (YYSTYPE x0, YYSTYPE x1)
|
||||
@{
|
||||
printf ("<OR> ");
|
||||
return "";
|
||||
@}
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
with an accompanying forward declaration
|
||||
in the C declarations at the beginning of the file:
|
||||
|
||||
@example
|
||||
%@{
|
||||
#define YYSTYPE const char*
|
||||
static YYSTYPE stmtMerge (YYSTYPE x0, YYSTYPE x1);
|
||||
%@}
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
With these declarations, the resulting parser will parse the first example
|
||||
as both an @code{expr} and a @code{decl}, and print
|
||||
|
||||
@example
|
||||
"x" y z + T <init-declare> x T <cast> y z + = <OR>
|
||||
@end example
|
||||
|
||||
|
||||
@node Locations Overview
|
||||
@section Locations
|
||||
@cindex location
|
||||
@@ -2913,7 +3111,7 @@ the location of the grouping (the result of the computation). The second one
|
||||
is an array holding locations of all right hand side elements of the rule
|
||||
being matched. The last one is the size of the right hand side rule.
|
||||
|
||||
By default, it is defined this way:
|
||||
By default, it is defined this way for simple LALR(1) parsers:
|
||||
|
||||
@example
|
||||
@group
|
||||
@@ -2925,6 +3123,19 @@ By default, it is defined this way:
|
||||
@end group
|
||||
@end example
|
||||
|
||||
@noindent
|
||||
and like this for GLR parsers:
|
||||
|
||||
@example
|
||||
@group
|
||||
#define YYLLOC_DEFAULT(Current, Rhs, N) \
|
||||
Current.first_line = YYRHSLOC(Rhs,1).first_line; \
|
||||
Current.first_column = YYRHSLOC(Rhs,1).first_column; \
|
||||
Current.last_line = YYRHSLOC(Rhs,N).last_line; \
|
||||
Current.last_column = YYRHSLOC(Rhs,N).last_column;
|
||||
@end group
|
||||
@end example
|
||||
|
||||
When defining @code{YYLLOC_DEFAULT}, you should consider that:
|
||||
|
||||
@itemize @bullet
|
||||
@@ -3890,6 +4101,7 @@ Return immediately from @code{yyparse}, indicating success.
|
||||
@findex YYBACKUP
|
||||
Unshift a token. This macro is allowed only for rules that reduce
|
||||
a single value, and only when there is no look-ahead token.
|
||||
It is also disallowed in GLR parsers.
|
||||
It installs a look-ahead token with token type @var{token} and
|
||||
semantic value @var{value}; then it discards the value that was
|
||||
going to be reduced by this rule.
|
||||
@@ -4030,6 +4242,7 @@ This kind of parser is known in the literature as a bottom-up parser.
|
||||
* Parser States:: The parser is a finite-state-machine with stack.
|
||||
* Reduce/Reduce:: When two rules are applicable in the same situation.
|
||||
* Mystery Conflicts:: Reduce/reduce conflicts that look unjustified.
|
||||
* Generalized LR Parsing:: Parsing arbitrary context-free grammars.
|
||||
* Stack Overflow:: What happens when stack gets full. How to avoid it.
|
||||
@end menu
|
||||
|
||||
@@ -4624,6 +4837,82 @@ return_spec:
|
||||
;
|
||||
@end example
|
||||
|
||||
@node Generalized LR Parsing
|
||||
@section Generalized LR (GLR) Parsing
|
||||
@cindex GLR parsing
|
||||
@cindex generalized LR (GLR) parsing
|
||||
@cindex ambiguous grammars
|
||||
@cindex non-deterministic parsing
|
||||
|
||||
Bison produces @emph{deterministic} parsers that choose uniquely
|
||||
when to reduce and which reduction to apply
|
||||
based on a summary of the preceding input and on one extra token of lookahead.
|
||||
As a result, normal Bison handles a proper subset of the family of
|
||||
context-free languages.
|
||||
Ambiguous grammars, since they have strings with more than one possible
|
||||
sequence of reductions cannot have deterministic parsers in this sense.
|
||||
The same is true of languages that require more than one symbol of
|
||||
lookahead, since the parser lacks the information necessary to make a
|
||||
decision at the point it must be made in a shift-reduce parser.
|
||||
Finally, as previously mentioned (@pxref{Mystery Conflicts}),
|
||||
there are languages where Bison's particular choice of how to
|
||||
summarize the input seen so far loses necessary information.
|
||||
|
||||
When you use the @samp{%glr-parser} declaration in your grammar file,
|
||||
Bison generates a parser that uses a different algorithm, called
|
||||
Generalized LR (or GLR). A Bison GLR parser uses the same basic
|
||||
algorithm for parsing as an ordinary Bison parser, but behaves
|
||||
differently in cases where there is a shift-reduce conflict that has not
|
||||
been resolved by precedence rules (@pxref{Precedence}) or a
|
||||
reduce-reduce conflict. When a GLR parser encounters such a situation, it
|
||||
effectively @emph{splits} into a several parsers, one for each possible
|
||||
shift or reduction. These parsers then proceed as usual, consuming
|
||||
tokens in lock-step. Some of the stacks may encounter other conflicts
|
||||
and split further, with the result that instead of a sequence of states,
|
||||
a Bison GLR parsing stack is what is in effect a tree of states.
|
||||
|
||||
In effect, each stack represents a guess as to what the proper parse
|
||||
is. Additional input may indicate that a guess was wrong, in which case
|
||||
the appropriate stack silently disappears. Otherwise, the semantics
|
||||
actions generated in each stack are saved, rather than being executed
|
||||
immediately. When a stack disappears, its saved semantic actions never
|
||||
get executed. When a reduction causes two stacks to become equivalent,
|
||||
their sets of semantic actions are both saved with the state that
|
||||
results from the reduction. We say that two stacks are equivalent
|
||||
when they both represent the same sequence of states,
|
||||
and each pair of corresponding states represents a
|
||||
grammar symbol that produces the same segment of the input token
|
||||
stream.
|
||||
|
||||
Whenever the parser makes a transition from having multiple
|
||||
states to having one, it reverts to the normal LALR(1) parsing
|
||||
algorithm, after resolving and executing the saved-up actions.
|
||||
At this transition, some of the states on the stack will have semantic
|
||||
values that are sets (actually multisets) of possible actions. The
|
||||
parser tries to pick one of the actions by first finding one whose rule
|
||||
has the highest dynamic precedence, as set by the @samp{%dprec}
|
||||
declaration. Otherwise, if the alternative actions are not ordered by
|
||||
precedence, but there the same merging function is declared for both
|
||||
rules by the @samp{%merge} declaration,
|
||||
Bison resolves and evaluates both and then calls the merge function on
|
||||
the result. Otherwise, it reports an ambiguity.
|
||||
|
||||
It is possible to use a data structure for the GLR parsing tree that
|
||||
permits the processing of any LALR(1) grammar in linear time (in the
|
||||
size of the input), any unambiguous (not necessarily LALR(1)) grammar in
|
||||
quadratic worst-case time, and any general (possibly ambiguous)
|
||||
context-free grammar in cubic worst-case time. However, Bison currently
|
||||
uses a simpler data structure that requires time proportional to the
|
||||
length of the input times the maximum number of stacks required for any
|
||||
prefix of the input. Thus, really ambiguous or non-deterministic
|
||||
grammars can require exponential time and space to process. Such badly
|
||||
behaving examples, however, are not generally of practical interest.
|
||||
Usually, non-determinism in a grammar is local---the parser is ``in
|
||||
doubt'' only for a few tokens at a time. Therefore, the current data
|
||||
structure should generally be adequate. On LALR(1) portions of a
|
||||
grammar, in particular, it is only slightly slower than with the default
|
||||
Bison parser.
|
||||
|
||||
@node Stack Overflow
|
||||
@section Stack Overflow, and How to Avoid It
|
||||
@cindex stack overflow
|
||||
@@ -5912,10 +6201,17 @@ Equip the parser for debugging. @xref{Decl Summary}.
|
||||
Bison declaration to create a header file meant for the scanner.
|
||||
@xref{Decl Summary}.
|
||||
|
||||
@item %dprec
|
||||
Bison declaration to assign a precedence to a rule that is used at parse
|
||||
time to resolve reduce/reduce conflicts. @xref{GLR Parsers}.
|
||||
|
||||
@item %file-prefix="@var{prefix}"
|
||||
Bison declaration to set tge prefix of the output files. @xref{Decl
|
||||
Bison declaration to set the prefix of the output files. @xref{Decl
|
||||
Summary}.
|
||||
|
||||
@item %glr-parser
|
||||
Bison declaration to produce a GLR parser. @xref{GLR Parsers}.
|
||||
|
||||
@c @item %source-extension
|
||||
@c Bison declaration to specify the generated parser output file extension.
|
||||
@c @xref{Decl Summary}.
|
||||
@@ -5928,6 +6224,12 @@ Summary}.
|
||||
Bison declaration to assign left associativity to token(s).
|
||||
@xref{Precedence Decl, ,Operator Precedence}.
|
||||
|
||||
@item %merge
|
||||
Bison declaration to assign a merging function to a rule. If there is a
|
||||
reduce/reduce conflict with a rule having the same merging function, the
|
||||
function is applied to the two semantic values to get a single result.
|
||||
@xref{GLR Parsers}.
|
||||
|
||||
@item %name-prefix="@var{prefix}"
|
||||
Bison declaration to rename the external symbols. @xref{Decl Summary}.
|
||||
|
||||
@@ -6040,6 +6342,13 @@ machine. In the case of the parser, the input is the language being
|
||||
parsed, and the states correspond to various stages in the grammar
|
||||
rules. @xref{Algorithm, ,The Bison Parser Algorithm }.
|
||||
|
||||
@item Generalized LR (GLR)
|
||||
A parsing algorithm that can handle all context-free grammars, including those
|
||||
that are not LALR(1). It resolves situations that Bison's usual LALR(1)
|
||||
algorithm cannot by effectively splitting off multiple parsers, trying all
|
||||
possible parsers, and discarding those that fail in the light of additional
|
||||
right context. @xref{Generalized LR Parsing, ,Generalized LR Parsing}.
|
||||
|
||||
@item Grouping
|
||||
A language construct that is (in general) grammatically divisible;
|
||||
for example, `expression' or `declaration' in C.
|
||||
|
||||
Reference in New Issue
Block a user