New section "Simple GLR Parsers".

2026-03-20 01:33:03 +00:00 · 2004-06-21 20:55:20 +00:00
parent 85f23fae45
commit 99a9344e77
2 changed files with 212 additions and 3 deletions
--- a/4
+++ b/4
@@ -1,3 +1,7 @@
 2004-06-21  Frank Heckenbach  <frank@g-n-u.de>
 	* doc/bison.texinfo (Simple GLR Parsers): New section.
 2004-06-21  Paul Eggert  <eggert@cs.ucla.edu>
 	* NEWS, TODO, doc/bison.texinfo:
--- a/doc/bison.texinfo
+++ b/doc/bison.texinfo
@@ -135,7 +135,8 @@ The Concepts of Bison
                        a semantic value (the value of an integer,
                        the name of an identifier, etc.).
 * Semantic Actions::  Each rule can have an action containing C code.
-* GLR Parsers::       Writing parsers for general context-free languages
+* GLR Parsers::       Writing parsers for general context-free languages.
 * Simple GLR Parsers::    Using GLR in its simplest form.
 * Locations Overview::    Tracking Locations.
 * Bison Parser::      What are Bison's input and output,
                        how is the output used?
@@ -381,7 +382,8 @@ use Bison or Yacc, we suggest you start by reading this chapter carefully.
                        a semantic value (the value of an integer,
                        the name of an identifier, etc.).
 * Semantic Actions::  Each rule can have an action containing C code.
-* GLR Parsers::       Writing parsers for general context-free languages
+* GLR Parsers::       Writing parsers for general context-free languages.
 * Simple GLR Parsers::    Using GLR in its simplest form.
 * Locations Overview::    Tracking Locations.
 * Bison Parser::      What are Bison's input and output,
                        how is the output used?
@@ -860,6 +862,208 @@ will suffice.  Otherwise, we suggest
 %@}
@end example
@node Simple GLR Parsers
@section Using @acronym{GLR} in its Simplest Form
@cindex @acronym{GLR} parsing, unambiguous grammars
@cindex generalized @acronym{LR} (@acronym{GLR}) parsing, unambiguous grammars
@findex %glr-parser
@findex %expect-rr
@cindex conflicts
@cindex reduce/reduce conflicts
 The C++ example for @acronym{GLR} (@pxref{GLR Parsers}) explains how to use
 the @acronym{GLR} parsing algorithm with some advanced features such as
@samp{%dprec} and @samp{%merge} to handle syntactically ambiguous
 grammars.  However, the @acronym{GLR} algorithm can also be used in a simpler
 way to parse grammars that are unambiguous, but fail to be @acronym{LALR}(1).
 Such grammars typically require more than one symbol of lookahead,
 or (in rare cases) fall into the category of grammars in which the
@acronym{LALR}(1) algorithm throws away too much information (they are in
@acronym{LR}(1), but not @acronym{LALR}(1), @ref{Mystery Conflicts}).
 Here is an example of this situation, using a problem that
 arises in the declaration of enumerated and subrange types in the
 programming language Pascal.  These declarations look like this:
@example
 type subrange = lo .. hi;
 type enum = (a, b, c);
@end example
@noindent
 The original language standard allows only numeric
 literals and constant identifiers for the subrange bounds (@samp{lo}
 and @samp{hi}), but Extended Pascal (ISO/IEC 10206:1990) and many other
 Pascal implementations allow arbitrary expressions there.  This gives
 rise to the following situation, containing a superfluous pair of
 parentheses:
@example
 type subrange = (a) .. b;
@end example
@noindent
 Compare this to the following declaration of an enumerated
 type with only one value:
@example
 type enum = (a);
@end example
@noindent
 (These declarations are contrived, but they are syntactically
 valid, and more-complicated cases can come up in practical programs.)
 These two declarations look identical until the @samp{..} token.
 With normal @acronym{LALR}(1) one-token look-ahead it is not
 possible to decide between the two forms when the identifier
@samp{a} is parsed.  It is, however, desirable
 for a parser to decide this, since in the latter case
@samp{a} must become a new identifier to represent the enumeration
 value, while in the former case @samp{a} must be evaluated with its
 current meaning, which may be a constant or even a function call.
 You could parse @samp{(a)} as an ``unspecified identifier in parentheses'',
 to be resolved later, but this typically requires substantial
 contortions in both semantic actions and large parts of the
 grammar, where the parentheses are nested in the recursive rules for
 expressions.
 You might think of using the lexer to distinguish between the two
 forms by returning different tokens for currently defined and
 undefined identifiers.  But if these declarations occur in a local
 scope, and @samp{a} is defined in an outer scope, then both forms
 are possible---either locally redefining @samp{a}, or using the
 value of @samp{a} from the outer scope.  So this approach cannot
 work.
 A solution to this problem is to use a @acronym{GLR} parser in its simplest
 form, i.e., without using special features such as @samp{%dprec} and
@samp{%merge}.  When the @acronym{GLR} parser reaches the critical state, it
 simply splits into two branches and pursues both syntax rules
 simultaneously.  Sooner or later, one of them runs into a parsing
 error.  If there is a @samp{..} token before the next
@samp{;}, the rule for enumerated types fails since it cannot
 accept @samp{..} anywhere; otherwise, the subrange type rule
 fails since it requires a @samp{..} token.  So one of the branches
 fails silently, and the other one continues normally, performing
 all the intermediate actions that were postponed during the split.
 If the input is syntactically incorrect, both branches fail and the parser
 reports a syntax error as usual.
 The effect of all this is that the parser seems to ``guess'' the
 correct branch to take, or in other words, it seems to use more
 look-ahead than the underlying @acronym{LALR}(1) algorithm actually allows
 for.  In this example, @acronym{LALR}(2) would suffice, but also some cases
 that are not @acronym{LALR}(@math{k}) for any @math{k} can be handled this way.
 Since there can be only two branches and at least one of them
 must fail, you need not worry about merging the branches by
 using dynamic precedence or @samp{%merge}.
 Another potential problem of @acronym{GLR} does not arise here, either.  In
 general, a @acronym{GLR} parser can take quadratic or cubic worst-case time,
 and the current Bison parser even takes exponential time and space
 for some grammars.  In practice, this rarely happens, and for many
 grammars it is possible to prove that it cannot happen.  In
 in the present example, there is only one conflict between two
 rules, and the type-declaration context where the conflict
 arises cannot be nested.  So the number of
 branches that can exist at any time is limited by the constant 2,
 and the parsing time is still linear.
 So here we have a case where we can use the benefits of @acronym{GLR}, almost
 without disadvantages.  There are two things to note, though.
 First, one should carefully analyze the conflicts reported by
 Bison to make sure that @acronym{GLR} splitting is done only where it is
 intended to be.  A @acronym{GLR} parser splitting inadvertently may cause
 problems less obvious than an @acronym{LALR} parser statically choosing the
 wrong alternative in a conflict.
 Second, interactions with the lexer (@pxref{Semantic Tokens}) must
 be considered with great care.  Since a split parser consumes tokens
 without performing any actions during the split, the lexer cannot
 obtain information via parser actions.  Some cases of
 lexer interactions can simply be eliminated by using @acronym{GLR}, i.e.,
 shifting the complications from the lexer to the parser.  Remaining
 cases have to be checked for safety.
 In our example, it would be safe for the lexer to return tokens
 based on their current meanings in some symbol table, because no new
 symbols are defined in the middle of a type declaration.  Though it
 is possible for a parser to define the enumeration
 constants as they are parsed, before the type declaration is
 completed, it actually makes no difference since they cannot be used
 within the same enumerated type declaration.
 Here is a Bison grammar corresponding to the example above.  It
 parses a vastly simplified form of Pascal type declarations.
@example
 %token TYPE DOTDOT ID
@group
 %left '+' '-'
 %left '*' '/'
@end group
 %%
@group
 type_decl:
          TYPE ID '=' type ';'
 ;
@end group
@group
 type:     '(' id_list ')'
        | expr DOTDOT expr
 ;
@end group
@group
 id_list:  ID
        | id_list ',' ID
 ;
@end group
@group
 expr:     '(' expr ')'
        | expr '+' expr
        | expr '-' expr
        | expr '*' expr
        | expr '/' expr
        | ID
 ;
@end group
@end example
 When used as a normal @acronym{LALR}(1) grammar, Bison correctly complains
 about one reduce/reduce conflict.  In the conflicting situation the
 parser chooses one of the alternatives, arbitrarily the one
 declared first.  Therefore the following correct input is not
 recognized:
@example
 type t = (a) .. b;
@end example
 The parser can be turned into a @acronym{GLR} parser, while also telling Bison
 to be silent about the one known reduce/reduce conflict, simply by
 adding these two declarations to the Bison input file:
@example
 %glr-parser
 %expect-rr 1
@end example
@noindent
 No change in the grammar itself is required.  Now the
 parser recognizes all valid declarations, according to the
 limited syntax above, transparently.  In fact, the user does not even
 notice when the parser splits.
@node Locations Overview
@section Locations
@cindex location
@@ -1290,7 +1494,7 @@ not require it.  You can add or change white space as much as you wish.
 For example, this:
@example
-exp   : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{}
+exp   : NUM | exp exp '+' @{$$ = $1 + $2; @} | @dots{} ;
@end example
@noindent
@@ -1300,6 +1504,7 @@ means the same thing as this:
 exp:      NUM
        | exp exp '+'    @{ $$ = $1 + $2; @}
        | @dots{}
 ;
@end example
@noindent