Currently our scanner decodes all the escapes in the strings, and we
later reescape the strings when we emit them.
This is troublesome, as we do not respect the user input. For
instance, when the user writes in UTF-8, we destroy her string when we
write it back. And this shows everywhere: in the reports we show the
escaped string instead of the actual alias:
0 $accept: . exp $end
1 exp: . exp "\342\212\225" exp
2 | . exp "+" exp
3 | . exp "+" exp
4 | . "number"
5 | . "\303\221\303\271\341\271\203\303\251\342\204\235\303\264"
"number" shift, and go to state 1
"\303\221\303\271\341\271\203\303\251\342\204\235\303\264" shift, and go to state 2
This commit preserves the user's exact spelling of the string aliases,
instead of interpreting the escapes and then reescaping. The report
now shows:
0 $accept: . exp $end
1 exp: . exp "⊕" exp
2 | . exp "+" exp
3 | . exp "+" exp
4 | . "number"
5 | . "Ñùṃéℝô"
"number" shift, and go to state 1
"Ñùṃéℝô" shift, and go to state 2
Likewise, the XML (and therefore HTML) outputs are fixed.
* src/scan-gram.l (STRING, TSTRING): Do not interpret the escapes in
the resulting string.
* src/parse-gram.y (unquote, parser_init, parser_free, unquote_free)
(handle_defines, handle_language, obstack_for_unquote): New.
Use them to unquote where needed.
* tests/regression.at, tests/report.at: Update.
* gnulib: Update.
* bootstrap.conf: Use attribute.
* src/system.h: Remove macros for attributes.
Adjust dependencies.
* src/scan-gram.l (DEPRECATED): Rename as...
(DEPRECATED_DIRECTIVE): this, to avoid the clash with the DEPRECATED macro.
On an invalid character literal such as "'\777'" we used to produce
two errors:
input.y:2.9-12: error: invalid number after \-escape: 777
input.y:2.8-13: error: empty character literal
Get rid of the second one.
* src/scan-gram.l (STRING_GROW_ESCAPE): New.
* tests/input.at: Adjust.
In addition to
%token NUM "number"
accept
%token NUM _("number")
in which case the token will be translated in error messages.
Do not use _() in the output if there are no translatable tokens.
* src/symtab.h, src/symtab.c (symbol): Add a 'translatable' member.
* src/parse-gram.y (TSTRING): New token.
(string_as_id.opt): Replace with...
(alias): this.
Use it.
* src/scan-gram.l (SC_ESCAPED_TSTRING): New start conditions, to match
TSTRINGs.
* src/output.c (prepare_symbols): Define b4_translatable if there are
translatable strings.
* data/skeletons/glr.c, data/skeletons/lalr1.cc,
* data/skeletons/yacc.c (yytnamerr): Receive b4_translatable, and use it.
We have too many global variables, adding structure would help. For a
start, let's hide some of the variables closer to their usage.
* src/getargs.c, src/files.h (current_file): Move to...
* src/scan-gram.c: here.
* src/scan-gram.h (gram_in, gram__flex_debug): Remove, make them
private to the scanner.
* src/reader.h, src/reader.c (reader): Take a grammar file as argument.
Move the handling of scanner variables to...
* src/scan-gram.l (gram_scanner_open, gram_scanner_close): here.
(gram_scanner_initialize): Remove, replaced by gram_scanner_open.
* src/main.c: Adjust.
* src/scan-gram.l: Include errno.h, for errno.
(scan_integer, handle_syncline): Check for integer overflow.
* tests/input.at (too-large.y): Adjust to match new diagnostics.
This patch contains more fixes to prefer signed to unsigned
integer types, as modern tools like 'gcc -fsanitize=undefined'
can check for signed integer overflow but not unsigned overflow.
* NEWS: Document the API change.
* boostrap.conf (gnulib_modules): Add intprops.
* data/skeletons/glr.c: Include stddef.h and stdint.h,
since this skeleton can assume C99 or later.
(YYSIZEMAX): Now signed, and the minimum of SIZE_MAX and PTRDIFF_MAX.
(yybool) [!__cplusplus]: Now signed (which is how bool behaves).
(YYTRANSLATE): Avoid use of unsigned, and make the macro
safe even for values greater than UINT_MAX.
(yytnamerr, struct yyGLRState, struct yyGLRStateSet, struct yyGLRStack)
(yyaddDeferredAction, yyinitStateSet, yyinitGLRStack)
(yyexpandGLRStack, yymarkStackDeleted, yyremoveDeletes)
(yyglrShift, yyglrShiftDefer, yy_reduce_print, yydoAction)
(yyglrReduce, yysplitStack, yyreportTree, yycompressStack)
(yyprocessOneStack, yyreportSyntaxError, yyrecoverSyntaxError)
(yyparse, yy_yypstack, yypstack, yypdumpstack):
* tests/input.at (Torturing the Scanner):
Prefer ptrdiff_t to size_t.
* data/skeletons/c++.m4 (b4_yytranslate_define):
* src/AnnotationList.c (AnnotationList__computePredecessorAnnotations):
* src/AnnotationList.h (AnnotationIndex):
* src/InadequacyList.h (InadequacyListNodeCount):
* src/closure.c (closure_new):
* src/complain.c (error_message, complains, complain_indent)
(complain_args, duplicate_directive, duplicate_rule_directive):
* src/gram.c (nritems, ritem_print, grammar_dump):
* src/ielr.c (ielr_compute_ritem_sees_lookahead_set)
(ielr_item_has_lookahead, ielr_compute_annotation_lists)
(ielr_compute_lookaheads):
* src/location.c (columns, boundary_print, location_print):
* src/muscle-tab.c (muscle_percent_define_insert)
(muscle_percent_define_check_values):
* src/output.c (prepare_rules, prepare_actions):
* src/parse-gram.y (id, handle_require):
* src/reader.c (record_merge_function_type, packgram):
* src/reduce.c (nuseless_productions, nuseless_nonterminals)
(inaccessable_symbols):
* src/relation.c (relation_print):
* src/scan-code.l (variant, variant_table_size, variant_count)
(variant_add, get_at_spec, show_sub_message, show_sub_messages)
(parse_ref):
* src/scan-gram.l (<SC_ESCAPED_STRING,SC_ESCAPED_CHARACTER>)
(scan_integer, convert_ucn_to_byte, handle_syncline):
* src/scan-skel.l (at_complain):
* src/symtab.c (complain_symbol_redeclared)
(complain_semantic_type_redeclared, complain_class_redeclared)
(symbol_class_set, complain_user_token_number_redeclared):
* src/tables.c (conflict_tos, conflrow, conflict_table)
(conflict_list, save_row, pack_vector):
* tests/local.at (AT_YYLEX_DEFINE(c)):
Prefer signed to unsigned integer.
* data/skeletons/lalr1.cc (yy_lac_check_):
* tests/actions.at (_AT_CHECK_PRINTER_AND_DESTRUCTOR):
* tests/local.at (AT_YYLEX_DEFINE(c)):
Omit now-unnecessary casts.
* data/skeletons/location.cc (b4_location_define):
* doc/bison.texi (Mfcalc Lexer, C++ position, C++ location):
Prefer int to unsigned for line and column numbers.
Change example to abort explicitly on memory exhaustion,
and fix an off-by-one bug that led to undefined behavior.
* data/skeletons/stack.hh (stack::operator[]):
Also allow ptrdiff_t indexes.
(stack::pop, slice::slice, slice::operator[]):
Index arg is now ptrdiff_t, not int.
(stack::ssize): New method.
(slice::range_): Now ptrdiff_t, not int.
* data/skeletons/yacc.c (b4_state_num_type): Remove.
All uses replaced by b4_int_type.
(YY_CONVERT_INT_BEGIN, YY_CONVERT_INT_END): New macros.
(yylac, yyparse): Use them around conversions that -Wconversion
would give false alarms about. Omit unnecessary casts.
(yy_stack_print): Use int rather than unsigned, and omit
a cast that doesn’t seem to be needed here any more.
* examples/c++/variant.yy (yylex):
* examples/c++/variant-11.yy (yylex):
Omit no-longer-needed conversions to unsigned.
* src/InadequacyList.c (InadequacyList__new_conflict):
Don’t assume *node_count is unsigned.
* src/output.c (muscle_insert_unsigned_table):
Remove; no longer used.
We used to treat lone CRs (\r, aka ^M) as regular NLs (\n), probably
to please Classic MacOS. As of today, it makes more sense to treat \r
like a plain white space character.
https://lists.gnu.org/archive/html/bison-patches/2019-09/msg00027.html
* src/scan-gram.l (no_cr_read): Remove. Instead, use...
(eol): this new abbreviation denoting end-of-line.
* src/location.c (caret_getc): New.
(location_caret): Use it.
* tests/diagnostics.at (Carriage return): Adjust expectations.
(CR NL): New.
The name fixed-output-files is pretty clear: generate y.tab.c, as Yacc
does. So let's detach this from %yacc which does more: it requires
POSIX Yacc behavior.
This directive is obsolete since December 29th 2001
8c9a50bee1. It does not show in the
doc. I don't want to spend more time on improving its diagnostics, it
could be removed just as well as far as I'm concerned.
* src/scan-gram.l, src/parse-gram.y (%fixed-output-files): Detach from
%yacc.
Some members are called foo_location, others are foo_loc. Stick to
the latter.
* src/gram.h, src/location.h, src/location.c, src/output.c,
* src/parse-gram.y, src/reader.h, src/reader.c, src/reduce.c,
* src/scan-gram.l, src/symlist.h, src/symlist.c, src/symtab.h,
* src/symtab.c:
Use _loc consistently, not _location.
The "identifier and colon" of a rule is implemented as a single token,
but whose location is only that of the identifier (so that messages
about the lhs of a rule are accurate). When reducing empty rules, the
default location is the single point location on the end of the
previous symbol. As a consequence, when Bison parses a grammar, the
location of the right-hand side of an empty rule is based on the
lhs, *independently of the position of the colon*. And the colon can
be way farther, separated by comments, white spaces, including empty
lines.
As a result, some messages look really bad. For instance:
$ cat foo.y
%%
foo : /* empty */
bar
: /* empty */
gives
$ bison -Wall foo.y
foo.y:2.4: warning: empty rule without %empty [-Wempty-rule]
2 | foo : /* empty */
| ^
foo.y:3.4: warning: empty rule without %empty [-Wempty-rule]
3 | bar
| ^
The carets are not at the right column, not even the right line.
This commit passes the colon "again" after the "id colon" token, which
gives more accurate locations for these messages:
$ bison -Wall foo.y
foo.y:2.10: warning: empty rule without %empty [-Wempty-rule]
2 | foo : /* empty */
| ^
foo.y:4.2: warning: empty rule without %empty [-Wempty-rule]
4 | : /* empty */
| ^
* src/scan-gram.l (SC_AFTER_IDENTIFIER): Rollback the colon, so that
we scan it again afterwards.
(INITIAL): Scan colons.
* src/parse-gram.y (COLON): New.
(rules): Parse the colon after the rule's id_colon (and possible
named reference).
* tests/actions.at, tests/conflicts.at, tests/diagnostics.at,
* tests/existing.at: Adjust.
This is a pity: efforts were invested in computing correctly the
number of screen columns consumed by multibyte characters, but the
routines that do that were fed by single-byte inputs...
As a consequence Bison never displayed correctly locations when there
are multibyte characters.
* src/scan-gram.l (mbchar): New.
Use it instead of . in the catch-all clause.
* tests/diagnostics.at (Tabulations): Enhance into...
(Tabulations and multibyte characters): this.
Single point locations (equal boundaries) are troublesome, and we were
incorrectly ending the style in their case. Which results in an abort
in libtextstyle.
There is also a confusion between columns as displayed on the
screen (which take into account multibyte characters and tabulations),
and the number of bytes. Counting the screen-column
incrementally (character by character) is uneasy (because of multibyte
characters), and I don't want to maintain a buffer of the current line
when displaying the diagnostic. So I believe the simplest solution is
to track the byte number in addition to the screen column.
* src/location.h, src/location.c (boundary): Add the byte-column.
Adjust dependencies.
* src/getargs.c, src/scan-gram.l: Adjust.
* tests/diagnostics.at: Check zero-width locations.
We should use -ffixit and --update to clean files with duplicate
directives. And we should complain only once about duplicate obsolete
directives: keep only the "duplicate" warning. Let's start with %yacc.
For instance on:
%fixed-output_files
%fixed-output-files
%yacc
%%
exp:
This run of bison:
$ bison /tmp/foo.y -u
foo.y:1.1-19: warning: deprecated directive, use '%fixed-output-files' [-Wdeprecated]
%fixed-output_files
^~~~~~~~~~~~~~~~~~~
foo.y:2.1-19: warning: duplicate directive [-Wother]
%fixed-output-files
^~~~~~~~~~~~~~~~~~~
foo.y:1.1-19: previous declaration
%fixed-output_files
^~~~~~~~~~~~~~~~~~~
foo.y:3.1-5: warning: duplicate directive [-Wother]
%yacc
^~~~~
foo.y:1.1-19: previous declaration
%fixed-output_files
^~~~~~~~~~~~~~~~~~~
bison: file 'foo.y' was updated (backup: 'foo.y~')
gives:
%fixed-output-files
%%
exp:
* src/location.h, src/location.c (location_empty): New.
* src/complain.h, src/complain.c (duplicate_directive): New.
* src/getargs.h, src/getargs.c (yacc_flag): Instead of a Boolean, be
the location of the definition.
Update dependencies.
* src/scan-gram.l (%yacc, %fixed-output-files): Move the handling of
its warnings to...
* src/parse-gram.y (do_yacc): This new function.
* tests/input.at (Deprecated Directives): Adjust expectations.
Avoid duplicate warnings about %error-verbose, once for deprecation,
another for duplicate. Keep only the duplicate warning for the second
occurrence of %error-verbose.
This will help removal fixits.
* src/scan-gram.l (%error-verbose): Return as a PERCENT_ERROR_VERBOSE
token.
* src/parse-gram.y (do_error_verbose): New.
Use it.
* src/muscle-tab.c (muscle_percent_variable_update): Handle pseudo
variables such as %error-verbose.
Currently the diagnostics for %name-prefix are not precise enough. In
particular, they does not show that braces must be used instead of
quotes.
Before:
foo.y:3.1-14: warning: deprecated directive, use '%define api.prefix' [-Wdeprecated]
%name-prefix = "foo"
^^^^^^^^^^^^^^
After:
foo.y:3.1-20: warning: deprecated directive, use '%define api.prefix {foo}' [-Wdeprecated]
%name-prefix = "foo"
^^^^^^^^^^^^^^^^^^^^
To do this we need the value passed to %name-prefix, so move the
warning from the scanner to the parser.
Accuracy will be very important for the forthcoming changes.
* src/parse-gram.y (do_name_prefix): New.
(PERCENT_NAME_PREFIX): Have a semantic value: the raw source, with
possibly underscores, equal sign, and spaces. This is used to provide
a more accurate message. It does not take comments into account,
but...
* src/scan-gram.l (%name-prefix): Delegate the warnings to the parser.
* tests/headers.at, tests/input.at: Adjust expectations.
After having spent quite some time on cleaning the handling of symbol
declarations in the grammar files, I believe we should keep it.
It looks like it's a duplicate of %type, but it is not. While POSIX
Yacc requires %type to apply only to nonterminal symbols, it appears
that both byacc and bison accept it for tokens too. And some
experienced users do actually expect this feature to group
symbols (terminal or not) by type ("On the other hand, it is generally
more useful IMHO to group terminals and non-terminals with the same
type tag together",
http://lists.gnu.org/archive/html/bug-bison/2018-10/msg00000.html).
Even Bison's own parser does this today (see CHAR).
Basically reverts 7928c3e6fb.
* src/scan-gram.l (%nterm): Dedeprecate, but issue a Wyacc warning.
* tests/input.at: Adjust expectations.
(Yacc warnings on symbols): New.
* src/symtab.c (symbol_class_set): Fix error introduced in
20b0746793.
It is unfortunate that %error_verbose was properly diagnosed as
obsoleted by "%define parse.error verbose", but %error-verbose was
not.
* src/parse-gram.y (%error-verbose): Remove support.
* src/scan-gram.l: Do it here instead, with a warning.
* tests/input.at (Deprecated directives): Check it.
* src/parse-gram.y (api.value.type): Set to union.
Replace occurrences of %union with explicit %types.
* src/scan-gram.l: Adjust yylval's field names.
(RETURN_VALUE): No longer needs the Field argument.
Use it more.