parse.lac: document.

* NEWS (2.5): Add entry for LAC, and mention LAC in entry for
other corrections to verbose syntax error messages.
* doc/bison.texinfo (Decl Summary): Rewrite entries for
lr.default-reductions and lr.type to be clearer, to mention
%nonassoc's effect on canonical LR, and to mention LAC.  Add entry
for parse.lac.
(Glossary): Add entry for LAC.
This commit is contained in:
Joel E. Denny
2010-12-19 22:12:32 -05:00
parent 107844a3ee
commit fcf834f9ec
3 changed files with 214 additions and 59 deletions

View File

@@ -1,3 +1,14 @@
2010-12-19 Joel E. Denny <jdenny@clemson.edu>
parse.lac: document.
* NEWS (2.5): Add entry for LAC, and mention LAC in entry for
other corrections to verbose syntax error messages.
* doc/bison.texinfo (Decl Summary): Rewrite entries for
lr.default-reductions and lr.type to be clearer, to mention
%nonassoc's effect on canonical LR, and to mention LAC. Add entry
for parse.lac.
(Glossary): Add entry for LAC.
2010-12-11 Joel E. Denny <jdenny@clemson.edu> 2010-12-11 Joel E. Denny <jdenny@clemson.edu>
parse.lac: implement exploratory stack reallocations. parse.lac: implement exploratory stack reallocations.

72
NEWS
View File

@@ -117,6 +117,46 @@ Bison News
These features are experimental. More user feedback will help to These features are experimental. More user feedback will help to
stabilize them. stabilize them.
** LAC (lookahead correction) for syntax error handling:
Canonical LR, IELR, and LALR can suffer from a couple of problems
upon encountering a syntax error. First, the parser might perform
additional parser stack reductions before discovering the syntax
error. Such reductions perform user semantic actions that are
unexpected because they are based on an invalid token, and they
cause error recovery to begin in a different syntactic context than
the one in which the invalid token was encountered. Second, when
verbose error messages are enabled (with %error-verbose or `#define
YYERROR_VERBOSE'), the expected token list in the syntax error
message can both contain invalid tokens and omit valid tokens.
The culprits for the above problems are %nonassoc, default
reductions in inconsistent states, and parser state merging. Thus,
IELR and LALR suffer the most. Canonical LR can suffer only if
%nonassoc is used or if default reductions are enabled for
inconsistent states.
LAC is a new mechanism within the parsing algorithm that completely
solves these problems for canonical LR, IELR, and LALR without
sacrificing %nonassoc, default reductions, or state mering. When
LAC is in use, canonical LR and IELR behave exactly the same for
both syntactically acceptable and syntactically unacceptable input.
While LALR still does not support the full language-recognition
power of canonical LR and IELR, LAC at least enables LALR's syntax
error handling to correctly reflect LALR's language-recognition
power.
Currently, LAC is only supported for deterministic parsers in C.
You can enable LAC with the following directive:
%define parse.lac full
See the documentation for `%define parse.lac' in the section `Bison
Declaration Summary' in the Bison manual for additional details.
LAC is an experimental feature. More user feedback will help to
stabilize it.
** Unrecognized %code qualifiers are now an error not a warning. ** Unrecognized %code qualifiers are now an error not a warning.
** %define improvements. ** %define improvements.
@@ -225,11 +265,11 @@ Bison News
** Verbose syntax error message fixes: ** Verbose syntax error message fixes:
When %error-verbose or `#define YYERROR_VERBOSE' is specified, syntax When %error-verbose or `#define YYERROR_VERBOSE' is specified,
error messages produced by the generated parser include the unexpected syntax error messages produced by the generated parser include the
token as well as a list of expected tokens. The effect of %nonassoc unexpected token as well as a list of expected tokens. The effect
on these verbose messages has been corrected in two ways, but of %nonassoc on these verbose messages has been corrected in two
additional fixes are still being implemented: ways, but a complete fix requires LAC, described above:
*** When %nonassoc is used, there can exist parser states that accept no *** When %nonassoc is used, there can exist parser states that accept no
tokens, and so the parser does not always require a lookahead token tokens, and so the parser does not always require a lookahead token
@@ -248,16 +288,18 @@ Bison News
tokens are now properly omitted from the list. tokens are now properly omitted from the list.
*** Expected token lists are still often wrong due to state merging *** Expected token lists are still often wrong due to state merging
(from LALR or IELR) and default reductions, which can both add and (from LALR or IELR) and default reductions, which can both add
subtract valid tokens. Canonical LR almost completely fixes this invalid tokens and subtract valid tokens. Canonical LR almost
problem by eliminating state merging and default reductions. completely fixes this problem by eliminating state merging and
However, there is one minor problem left even when using canonical default reductions. However, there is one minor problem left even
LR and even after the fixes above. That is, if the resolution of a when using canonical LR and even after the fixes above. That is,
conflict with %nonassoc appears in a later parser state than the one if the resolution of a conflict with %nonassoc appears in a later
at which some syntax error is discovered, the conflicted token is parser state than the one at which some syntax error is
still erroneously included in the expected token list. We are discovered, the conflicted token is still erroneously included in
currently working on a fix to eliminate this problem and to the expected token list. Bison's new LAC implementation,
eliminate the need for canonical LR. described above, eliminates this problem and the need for
canonical LR. However, LAC is still experimental and is disabled
by default.
** Destructor calls fixed for lookaheads altered in semantic actions. ** Destructor calls fixed for lookaheads altered in semantic actions.

View File

@@ -5230,57 +5230,61 @@ Boolean.
@findex %define lr.default-reductions @findex %define lr.default-reductions
@cindex delayed syntax errors @cindex delayed syntax errors
@cindex syntax errors delayed @cindex syntax errors delayed
@cindex @acronym{LAC}
@findex %nonassoc
@itemize @bullet @itemize @bullet
@item Language(s): all @item Language(s): all
@item Purpose: Specifies the kind of states that are permitted to @item Purpose: Specify the kind of states that are permitted to
contain default reductions. contain default reductions.
That is, in such a state, Bison declares the reduction with the largest That is, in such a state, Bison selects the reduction with the largest
lookahead set to be the default reduction and then removes that lookahead set to be the default parser action and then removes that
lookahead set. lookahead set.
The advantages of default reductions are discussed below. (The ability to specify where default reductions should be used is
The disadvantage is that, when the generated parser encounters a experimental.
syntactically unacceptable token, the parser might then perform
unnecessary default reductions before it can detect the syntax error.
(This feature is experimental.
More user feedback will help to stabilize it.) More user feedback will help to stabilize it.)
@item Accepted Values: @item Accepted Values:
@itemize @itemize
@item @code{all}. @item @code{all}.
For @acronym{LALR} and @acronym{IELR} parsers (@pxref{Decl This is the traditional Bison behavior.
Summary,,lr.type}) by default, all states are permitted to contain The main advantage is a significant decrease in the size of the parser
default reductions. tables.
The advantage is that parser table sizes can be significantly reduced. The disadvantage is that, when the generated parser encounters a
The reason Bison does not by default attempt to address the disadvantage syntactically unacceptable token, the parser might then perform
of delayed syntax error detection is that this disadvantage is already unnecessary default reductions before it can detect the syntax error.
inherent in @acronym{LALR} and @acronym{IELR} parser tables. Such delayed syntax error detection is usually inherent in
That is, unlike in a canonical @acronym{LR} state, the lookahead sets of @acronym{LALR} and @acronym{IELR} parser tables anyway due to
reductions in an @acronym{LALR} or @acronym{IELR} state can contain @acronym{LR} state merging (@pxref{Decl Summary,,lr.type}).
tokens that are syntactically incorrect for some left contexts. Furthermore, the use of @code{%nonassoc} can contribute to delayed
syntax error detection even in the case of canonical @acronym{LR}.
As an experimental feature, delayed syntax error detection can be
overcome in all cases by enabling @acronym{LAC} (@pxref{Decl
Summary,,parse.lac}, for details, including a discussion of the effects
of delayed syntax error detection).
@item @code{consistent}. @item @code{consistent}.
@cindex consistent states @cindex consistent states
A consistent state is a state that has only one possible action. A consistent state is a state that has only one possible action.
If that action is a reduction, then the parser does not need to request If that action is a reduction, then the parser does not need to request
a lookahead token from the scanner before performing that action. a lookahead token from the scanner before performing that action.
However, the parser only recognizes the ability to ignore the lookahead However, the parser recognizes the ability to ignore the lookahead token
token when such a reduction is encoded as a default reduction. in this way only when such a reduction is encoded as a default
Thus, if default reductions are permitted in and only in consistent reduction.
states, then a canonical @acronym{LR} parser reports a syntax error as Thus, if default reductions are permitted only in consistent states,
soon as it @emph{needs} the syntactically unacceptable token from the then a canonical @acronym{LR} parser that does not employ
scanner. @code{%nonassoc} detects a syntax error as soon as it @emph{needs} the
syntactically unacceptable token from the scanner.
@item @code{accepting}. @item @code{accepting}.
@cindex accepting state @cindex accepting state
By default, the only default reduction permitted in a canonical In the accepting state, the default reduction is actually the accept
@acronym{LR} parser is the accept action in the accepting state, which action.
the parser reaches only after reading all tokens from the input. In this case, a canonical @acronym{LR} parser that does not employ
Thus, the default canonical @acronym{LR} parser reports a syntax error @code{%nonassoc} detects a syntax error as soon as it @emph{reaches} the
as soon as it @emph{reaches} the syntactically unacceptable token syntactically unacceptable token in the input.
without performing any extra reductions. That is, it does not perform any extra reductions.
@end itemize @end itemize
@item Default Value: @item Default Value:
@@ -5400,17 +5404,23 @@ This can significantly reduce the complexity of developing of a grammar.
@item @code{canonical-lr}. @item @code{canonical-lr}.
@cindex delayed syntax errors @cindex delayed syntax errors
@cindex syntax errors delayed @cindex syntax errors delayed
The only advantage of canonical @acronym{LR} over @acronym{IELR} is @cindex @acronym{LAC}
that, for every left context of every canonical @acronym{LR} state, the @findex %nonassoc
set of tokens accepted by that state is the exact set of tokens that is While inefficient, canonical @acronym{LR} parser tables can be an
syntactically acceptable in that left context. interesting means to explore a grammar because they have a property that
Thus, the only difference in parsing behavior is that the canonical @acronym{IELR} and @acronym{LALR} tables do not.
@acronym{LR} parser can report a syntax error as soon as possible That is, if @code{%nonassoc} is not used and default reductions are left
without performing any unnecessary reductions. disabled (@pxref{Decl Summary,,lr.default-reductions}), then, for every
@xref{Decl Summary,,lr.default-reductions}, for further details. left context of every canonical @acronym{LR} state, the set of tokens
Even when canonical @acronym{LR} behavior is ultimately desired, accepted by that state is guaranteed to be the exact set of tokens that
@acronym{IELR}'s elimination of duplicate conflicts should still is syntactically acceptable in that left context.
facilitate the development of a grammar. It might then seem that an advantage of canonical @acronym{LR} parsers
in production is that, under the above constraints, they are guaranteed
to detect a syntax error as soon as possible without performing any
unnecessary reductions.
However, @acronym{IELR} parsers using @acronym{LAC} (@pxref{Decl
Summary,,parse.lac}) are also able to achieve this behavior without
sacrificing @code{%nonassoc} or default reductions.
@end itemize @end itemize
@item Default Value: @code{lalr} @item Default Value: @code{lalr}
@@ -5448,7 +5458,7 @@ destroyed properly. This option checks these constraints.
@findex %define parse.error @findex %define parse.error
@itemize @itemize
@item Languages(s): @item Languages(s):
all. all
@item Purpose: @item Purpose:
Control the kind of error messages passed to the error reporting Control the kind of error messages passed to the error reporting
function. @xref{Error Reporting, ,The Error Reporting Function function. @xref{Error Reporting, ,The Error Reporting Function
@@ -5469,6 +5479,90 @@ ones.
@c parse.error @c parse.error
@c ================================================== parse.lac
@item parse.lac
@findex %define parse.lac
@cindex @acronym{LAC}
@cindex lookahead correction
@itemize
@item Languages(s): C
@item Purpose: Enable @acronym{LAC} (lookahead correction) to improve
syntax error handling.
Canonical @acronym{LR}, @acronym{IELR}, and @acronym{LALR} can suffer
from a couple of problems upon encountering a syntax error. First, the
parser might perform additional parser stack reductions before
discovering the syntax error. Such reductions perform user semantic
actions that are unexpected because they are based on an invalid token,
and they cause error recovery to begin in a different syntactic context
than the one in which the invalid token was encountered. Second, when
verbose error messages are enabled (with @code{%error-verbose} or
@code{#define YYERROR_VERBOSE}), the expected token list in the syntax
error message can both contain invalid tokens and omit valid tokens.
The culprits for the above problems are @code{%nonassoc}, default
reductions in inconsistent states, and parser state merging. Thus,
@acronym{IELR} and @acronym{LALR} suffer the most. Canonical
@acronym{LR} can suffer only if @code{%nonassoc} is used or if default
reductions are enabled for inconsistent states.
@acronym{LAC} is a new mechanism within the parsing algorithm that
completely solves these problems for canonical @acronym{LR},
@acronym{IELR}, and @acronym{LALR} without sacrificing @code{%nonassoc},
default reductions, or state mering. Conceptually, the mechanism is
straight-forward. Whenever the parser fetches a new token from the
scanner so that it can determine the next parser action, it immediately
suspends normal parsing and performs an exploratory parse using a
temporary copy of the normal parser state stack. During this
exploratory parse, the parser does not perform user semantic actions.
If the exploratory parse reaches a shift action, normal parsing then
resumes on the normal parser stacks. If the exploratory parse reaches
an error instead, the parser reports a syntax error. If verbose syntax
error messages are enabled, the parser must then discover the list of
expected tokens, so it performs a separate exploratory parse for each
token in the grammar.
There is one subtlety about the use of @acronym{LAC}. That is, when in
a consistent parser state with a default reduction, the parser will not
attempt to fetch a token from the scanner because no lookahead is needed
to determine the next parser action. Thus, whether default reductions
are enabled in consistent states (@pxref{Decl
Summary,,lr.default-reductions}) affects how soon the parser detects a
syntax error: when it @emph{reaches} an erroneous token or when it
eventually @emph{needs} that token as a lookahead. The latter behavior
is probably more intuitive, so Bison currently provides no way to
achieve the former behavior while default reductions are fully enabled.
Thus, when @acronym{LAC} is in use, for some fixed decision of whether
to enable default reductions in consistent states, canonical
@acronym{LR} and @acronym{IELR} behave exactly the same for both
syntactically acceptable and syntactically unacceptable input. While
@acronym{LALR} still does not support the full language-recognition
power of canonical @acronym{LR} and @acronym{IELR}, @acronym{LAC} at
least enables @acronym{LALR}'s syntax error handling to correctly
reflect @acronym{LALR}'s language-recognition power.
Because @acronym{LAC} requires many parse actions to be performed twice,
it can have a performance penalty. However, not all parse actions must
be performed twice. Specifically, during a series of default reductions
in consistent states and shift actions, the parser never has to initiate
an exploratory parse. Moreover, the most time-consuming tasks in a
parse are often the file I/O, the lexical analysis performed by the
scanner, and the user's semantic actions, but none of these are
performed during the exploratory parse. Finally, the base of the
temporary stack used during an exploratory parse is a pointer into the
normal parser state stack so that the stack is never physically copied.
In our experience, the performance penalty of @acronym{LAC} has proven
insignificant for practical grammars.
@item Accepted Values: @code{none}, @code{full}
@item Default Value: @code{none}
@end itemize
@c parse.lac
@c ================================================== parse.trace @c ================================================== parse.trace
@item parse.trace @item parse.trace
@findex %define parse.trace @findex %define parse.trace
@@ -11241,6 +11335,14 @@ performs some operation.
@item Input stream @item Input stream
A continuous flow of data between devices or programs. A continuous flow of data between devices or programs.
@item @acronym{LAC} (Lookahead Correction)
A parsing mechanism that fixes the problem of delayed syntax error
detection, which is caused by LR state merging, default reductions, and
the use of @code{%nonassoc}. Delayed syntax error detection results in
unexpected semantic actions, initiation of error recovery in the wrong
syntactic context, and an incorrect list of expected tokens in a verbose
syntax error message. @xref{Decl Summary,,parse.lac}.
@item Language construct @item Language construct
One of the typical usage schemas of the language. For example, one of One of the typical usage schemas of the language. For example, one of
the constructs of the C language is the @code{if} statement. the constructs of the C language is the @code{if} statement.
@@ -11397,7 +11499,7 @@ grammatically indivisible. The piece of text it represents is a token.
@c LocalWords: hbox hss hfill tt ly yyin fopen fclose ofirst gcc ll lookahead @c LocalWords: hbox hss hfill tt ly yyin fopen fclose ofirst gcc ll lookahead
@c LocalWords: nbar yytext fst snd osplit ntwo strdup AST Troublereporting th @c LocalWords: nbar yytext fst snd osplit ntwo strdup AST Troublereporting th
@c LocalWords: YYSTACK DVI fdl printindex IELR nondeterministic nonterminals ps @c LocalWords: YYSTACK DVI fdl printindex IELR nondeterministic nonterminals ps
@c LocalWords: subexpressions declarator nondeferred config libintl postfix @c LocalWords: subexpressions declarator nondeferred config libintl postfix LAC
@c LocalWords: preprocessor nonpositive unary nonnumeric typedef extern rhs @c LocalWords: preprocessor nonpositive unary nonnumeric typedef extern rhs
@c LocalWords: yytokentype filename destructor multicharacter nonnull EBCDIC @c LocalWords: yytokentype filename destructor multicharacter nonnull EBCDIC
@c LocalWords: lvalue nonnegative XNUM CHR chr TAGLESS tagless stdout api TOK @c LocalWords: lvalue nonnegative XNUM CHR chr TAGLESS tagless stdout api TOK