jcs's openbsd hax
openbsd
1.\" $OpenBSD: flex.1,v 1.47 2025/05/22 07:31:18 bentley Exp $
2.\"
3.\" Copyright (c) 1990 The Regents of the University of California.
4.\" All rights reserved.
5.\"
6.\" This code is derived from software contributed to Berkeley by
7.\" Vern Paxson.
8.\"
9.\" The United States Government has rights in this work pursuant
10.\" to contract no. DE-AC03-76SF00098 between the United States
11.\" Department of Energy and the University of California.
12.\"
13.\" Redistribution and use in source and binary forms, with or without
14.\" modification, are permitted provided that the following conditions
15.\" are met:
16.\"
17.\" 1. Redistributions of source code must retain the above copyright
18.\" notice, this list of conditions and the following disclaimer.
19.\" 2. Redistributions in binary form must reproduce the above copyright
20.\" notice, this list of conditions and the following disclaimer in the
21.\" documentation and/or other materials provided with the distribution.
22.\"
23.\" Neither the name of the University nor the names of its contributors
24.\" may be used to endorse or promote products derived from this software
25.\" without specific prior written permission.
26.\"
27.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
28.\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
29.\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
30.\" PURPOSE.
31.\"
32.Dd $Mdocdate: May 22 2025 $
33.Dt FLEX 1
34.Os
35.Sh NAME
36.Nm flex ,
37.Nm flex++ ,
38.Nm lex
39.Nd fast lexical analyzer generator
40.Sh SYNOPSIS
41.Nm
42.Bk -words
43.Op Fl 78BbdFfhIiLlnpsTtVvw+?
44.Op Fl C Ns Op Cm aeFfmr
45.Op Fl Fl help
46.Op Fl Fl version
47.Op Fl o Ns Ar output
48.Op Fl P Ns Ar prefix
49.Op Fl S Ns Ar skeleton
50.Op Ar
51.Ek
52.Sh DESCRIPTION
53.Nm
54is a tool for generating
55.Em scanners :
56programs which recognize lexical patterns in text.
57.Nm
58reads the given input files, or its standard input if no file names are given,
59for a description of a scanner to generate.
60The description is in the form of pairs of regular expressions and C code,
61called
62.Em rules .
63.Nm
64generates as output a C source file,
65.Pa lex.yy.c ,
66which defines a routine
67.Fn yylex .
68This file is compiled and linked with the
69.Fl lfl
70library to produce an executable.
71When the executable is run, it analyzes its input for occurrences
72of the regular expressions.
73Whenever it finds one, it executes the corresponding C code.
74.Pp
75.Nm lex
76is a synonym for
77.Nm flex .
78.Nm flex++
79is a synonym for
80.Nm
81.Fl + .
82.Pp
83The manual includes both tutorial and reference sections:
84.Bl -ohang
85.It Sy Some Simple Examples
86.It Sy Format of the Input File
87.It Sy Patterns
88The extended regular expressions used by
89.Nm .
90.It Sy How the Input is Matched
91The rules for determining what has been matched.
92.It Sy Actions
93How to specify what to do when a pattern is matched.
94.It Sy The Generated Scanner
95Details regarding the scanner that
96.Nm
97produces;
98how to control the input source.
99.It Sy Start Conditions
100Introducing context into scanners, and managing
101.Qq mini-scanners .
102.It Sy Multiple Input Buffers
103How to manipulate multiple input sources;
104how to scan from strings instead of files.
105.It Sy End-of-File Rules
106Special rules for matching the end of the input.
107.It Sy Miscellaneous Macros
108A summary of macros available to the actions.
109.It Sy Values Available to the User
110A summary of values available to the actions.
111.It Sy Interfacing with Yacc
112Connecting flex scanners together with
113.Xr yacc 1
114parsers.
115.It Sy Options
116.Nm
117command-line options, and the
118.Dq %option
119directive.
120.It Sy Performance Considerations
121How to make scanners go as fast as possible.
122.It Sy Generating C++ Scanners
123The
124.Pq experimental
125facility for generating C++ scanner classes.
126.It Sy Incompatibilities with Lex and POSIX
127How
128.Nm
129differs from
130.At
131.Nm lex
132and the
133.Tn POSIX
134.Nm lex
135standard.
136.It Sy Files
137Files used by
138.Nm .
139.It Sy Diagnostics
140Those error messages produced by
141.Nm
142.Pq or scanners it generates
143whose meanings might not be apparent.
144.It Sy See Also
145Other documentation, related tools.
146.It Sy Authors
147Includes contact information.
148.It Sy Bugs
149Known problems with
150.Nm .
151.El
152.Sh SOME SIMPLE EXAMPLES
153First some simple examples to get the flavor of how one uses
154.Nm .
155The following
156.Nm
157input specifies a scanner which whenever it encounters the string
158.Qq username
159will replace it with the user's login name:
160.Bd -literal -offset indent
161%%
162username printf("%s", getlogin());
163.Ed
164.Pp
165By default, any text not matched by a
166.Nm
167scanner is copied to the output, so the net effect of this scanner is
168to copy its input file to its output with each occurrence of
169.Qq username
170expanded.
171In this input, there is just one rule.
172.Qq username
173is the
174.Em pattern
175and the
176.Qq printf
177is the
178.Em action .
179The
180.Qq %%
181marks the beginning of the rules.
182.Pp
183Here's another simple example:
184.Bd -literal -offset indent
185%{
186int num_lines = 0, num_chars = 0;
187%}
188
189%%
190\en ++num_lines; ++num_chars;
191\&. ++num_chars;
192
193%%
194main()
195{
196 yylex();
197 printf("# of lines = %d, # of chars = %d\en",
198 num_lines, num_chars);
199}
200.Ed
201.Pp
202This scanner counts the number of characters and the number
203of lines in its input
204(it produces no output other than the final report on the counts).
205The first line declares two globals,
206.Qq num_lines
207and
208.Qq num_chars ,
209which are accessible both inside
210.Fn yylex
211and in the
212.Fn main
213routine declared after the second
214.Qq %% .
215There are two rules, one which matches a newline
216.Pq \&"\en\&"
217and increments both the line count and the character count,
218and one which matches any character other than a newline
219(indicated by the
220.Qq \&.
221regular expression).
222.Pp
223A somewhat more complicated example:
224.Bd -literal -offset indent
225/* scanner for a toy Pascal-like language */
226
227DIGIT [0-9]
228ID [a-z][a-z0-9]*
229
230%%
231
232{DIGIT}+ {
233 printf("An integer: %s\en", yytext);
234}
235
236{DIGIT}+"."{DIGIT}* {
237 printf("A float: %s\en", yytext);
238}
239
240if|then|begin|end|procedure|function {
241 printf("A keyword: %s\en", yytext);
242}
243
244{ID} printf("An identifier: %s\en", yytext);
245
246"+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
247
248"{"[^}\en]*"}" /* eat up one-line comments */
249
250[ \et\en]+ /* eat up whitespace */
251
252\&. printf("Unrecognized character: %s\en", yytext);
253
254%%
255
256int
257main(int argc, char *argv[])
258{
259 ++argv; --argc; /* skip over program name */
260 if (argc > 0)
261 yyin = fopen(argv[0], "r");
262 else
263 yyin = stdin;
264
265 yylex();
266}
267.Ed
268.Pp
269This is the beginnings of a simple scanner for a language like Pascal.
270It identifies different types of
271.Em tokens
272and reports on what it has seen.
273.Pp
274The details of this example will be explained in the following sections.
275.Sh FORMAT OF THE INPUT FILE
276The
277.Nm
278input file consists of three sections, separated by a line with just
279.Qq %%
280in it:
281.Bd -unfilled -offset indent
282definitions
283%%
284rules
285%%
286user code
287.Ed
288.Pp
289The
290.Em definitions
291section contains declarations of simple
292.Em name
293definitions to simplify the scanner specification, and declarations of
294.Em start conditions ,
295which are explained in a later section.
296.Pp
297Name definitions have the form:
298.Pp
299.D1 name definition
300.Pp
301The
302.Qq name
303is a word beginning with a letter or an underscore
304.Pq Sq _
305followed by zero or more letters, digits,
306.Sq _ ,
307or
308.Sq -
309.Pq dash .
310The definition is taken to begin at the first non-whitespace character
311following the name and continuing to the end of the line.
312The definition can subsequently be referred to using
313.Qq {name} ,
314which will expand to
315.Qq (definition) .
316For example:
317.Bd -literal -offset indent
318DIGIT [0-9]
319ID [a-z][a-z0-9]*
320.Ed
321.Pp
322This defines
323.Qq DIGIT
324to be a regular expression which matches a single digit, and
325.Qq ID
326to be a regular expression which matches a letter
327followed by zero-or-more letters-or-digits.
328A subsequent reference to
329.Pp
330.Dl {DIGIT}+"."{DIGIT}*
331.Pp
332is identical to
333.Pp
334.Dl ([0-9])+"."([0-9])*
335.Pp
336and matches one-or-more digits followed by a
337.Sq .\&
338followed by zero-or-more digits.
339.Pp
340The
341.Em rules
342section of the
343.Nm
344input contains a series of rules of the form:
345.Pp
346.Dl pattern action
347.Pp
348The pattern must be unindented and the action must begin
349on the same line.
350.Pp
351See below for a further description of patterns and actions.
352.Pp
353Finally, the user code section is simply copied to
354.Pa lex.yy.c
355verbatim.
356It is used for companion routines which call or are called by the scanner.
357The presence of this section is optional;
358if it is missing, the second
359.Qq %%
360in the input file may be skipped too.
361.Pp
362In the definitions and rules sections, any indented text or text enclosed in
363.Sq %{
364and
365.Sq %}
366is copied verbatim to the output
367.Pq with the %{}'s removed .
368The %{}'s must appear unindented on lines by themselves.
369.Pp
370In the rules section,
371any indented or %{} text appearing before the first rule may be used to
372declare variables which are local to the scanning routine and
373.Pq after the declarations
374code which is to be executed whenever the scanning routine is entered.
375Other indented or %{} text in the rule section is still copied to the output,
376but its meaning is not well-defined and it may well cause compile-time
377errors (this feature is present for
378.Tn POSIX
379compliance; see below for other such features).
380.Pp
381In the definitions section
382.Pq but not in the rules section ,
383an unindented comment
384(i.e., a line beginning with
385.Qq /* )
386is also copied verbatim to the output up to the next
387.Qq */ .
388.Sh PATTERNS
389The patterns in the input are written using an extended set of regular
390expressions.
391These are:
392.Bl -tag -width "XXXXXXXX"
393.It x
394Match the character
395.Sq x .
396.It .\&
397Any character
398.Pq byte
399except newline.
400.It [xyz]
401A
402.Qq character class ;
403in this case, the pattern matches either an
404.Sq x ,
405a
406.Sq y ,
407or a
408.Sq z .
409.It [abj-oZ]
410A
411.Qq character class
412with a range in it; matches an
413.Sq a ,
414a
415.Sq b ,
416any letter from
417.Sq j
418through
419.Sq o ,
420or a
421.Sq Z .
422.It [^A-Z]
423A
424.Qq negated character class ,
425i.e., any character but those in the class.
426In this case, any character EXCEPT an uppercase letter.
427.It [^A-Z\en]
428Any character EXCEPT an uppercase letter or a newline.
429.It r*
430Zero or more r's, where
431.Sq r
432is any regular expression.
433.It r+
434One or more r's.
435.It r?
436Zero or one r's (that is,
437.Qq an optional r ) .
438.It r{2,5}
439Anywhere from two to five r's.
440.It r{2,}
441Two or more r's.
442.It r{4}
443Exactly 4 r's.
444.It {name}
445The expansion of the
446.Qq name
447definition
448.Pq see above .
449.It \&"[xyz]\e\&"foo\&"
450The literal string: [xyz]"foo.
451.It \eX
452If
453.Sq X
454is an
455.Sq a ,
456.Sq b ,
457.Sq f ,
458.Sq n ,
459.Sq r ,
460.Sq t ,
461or
462.Sq v ,
463then the ANSI-C interpretation of
464.Sq \eX .
465Otherwise, a literal
466.Sq X
467(used to escape operators such as
468.Sq * ) .
469.It \e0
470A NUL character
471.Pq ASCII code 0 .
472.It \e123
473The character with octal value 123.
474.It \ex2a
475The character with hexadecimal value 2a.
476.It (r)
477Match an
478.Sq r ;
479parentheses are used to override precedence
480.Pq see below .
481.It rs
482The regular expression
483.Sq r
484followed by the regular expression
485.Sq s ;
486called
487.Qq concatenation .
488.It r|s
489Either an
490.Sq r
491or an
492.Sq s .
493.It r/s
494An
495.Sq r ,
496but only if it is followed by an
497.Sq s .
498The text matched by
499.Sq s
500is included when determining whether this rule is the
501.Qq longest match ,
502but is then returned to the input before the action is executed.
503So the action only sees the text matched by
504.Sq r .
505This type of pattern is called
506.Qq trailing context .
507(There are some combinations of r/s that
508.Nm
509cannot match correctly; see notes in the
510.Sx BUGS
511section below regarding
512.Qq dangerous trailing context . )
513.It ^r
514An
515.Sq r ,
516but only at the beginning of a line
517(i.e., just starting to scan, or right after a newline has been scanned).
518.It r$
519An
520.Sq r ,
521but only at the end of a line
522.Pq i.e., just before a newline .
523Equivalent to
524.Qq r/\en .
525.Pp
526Note that
527.Nm flex Ns 's
528notion of
529.Qq newline
530is exactly whatever the C compiler used to compile
531.Nm
532interprets
533.Sq \en
534as.
535.\" In particular, on some DOS systems you must either filter out \er's in the
536.\" input yourself, or explicitly use r/\er\en for
537.\" .Qq r$ .
538.It <s>r
539An
540.Sq r ,
541but only in start condition
542.Sq s
543.Pq see below for discussion of start conditions .
544.It <s1,s2,s3>r
545The same, but in any of start conditions s1, s2, or s3.
546.It <*>r
547An
548.Sq r
549in any start condition, even an exclusive one.
550.It <<EOF>>
551An end-of-file.
552.It <s1,s2><<EOF>>
553An end-of-file when in start condition s1 or s2.
554.El
555.Pp
556Note that inside of a character class, all regular expression operators
557lose their special meaning except escape
558.Pq Sq \e
559and the character class operators,
560.Sq - ,
561.Sq ]\& ,
562and, at the beginning of the class,
563.Sq ^ .
564.Pp
565The regular expressions listed above are grouped according to
566precedence, from highest precedence at the top to lowest at the bottom.
567Those grouped together have equal precedence.
568For example,
569.Pp
570.D1 foo|bar*
571.Pp
572is the same as
573.Pp
574.D1 (foo)|(ba(r*))
575.Pp
576since the
577.Sq *
578operator has higher precedence than concatenation,
579and concatenation higher than alternation
580.Pq Sq |\& .
581This pattern therefore matches
582.Em either
583the string
584.Qq foo
585.Em or
586the string
587.Qq ba
588followed by zero-or-more r's.
589To match
590.Qq foo
591or zero-or-more "bar"'s,
592use:
593.Pp
594.D1 foo|(bar)*
595.Pp
596and to match zero-or-more "foo"'s-or-"bar"'s:
597.Pp
598.D1 (foo|bar)*
599.Pp
600In addition to characters and ranges of characters, character classes
601can also contain character class
602.Em expressions .
603These are expressions enclosed inside
604.Sq [:
605and
606.Sq :]
607delimiters (which themselves must appear between the
608.Sq \&[
609and
610.Sq ]\&
611of the
612character class; other elements may occur inside the character class, too).
613The valid expressions are:
614.Bd -unfilled -offset indent
615[:alnum:] [:alpha:] [:blank:]
616[:cntrl:] [:digit:] [:graph:]
617[:lower:] [:print:] [:punct:]
618[:space:] [:upper:] [:xdigit:]
619.Ed
620.Pp
621These expressions all designate a set of characters equivalent to
622the corresponding standard C
623.Fn isXXX
624function.
625For example, [:alnum:] designates those characters for which
626.Xr isalnum 3
627returns true \- i.e., any alphabetic or numeric.
628Some systems don't provide
629.Xr isblank 3 ,
630so
631.Nm
632defines [:blank:] as a blank or a tab.
633.Pp
634For example, the following character classes are all equivalent:
635.Bd -unfilled -offset indent
636[[:alnum:]]
637[[:alpha:][:digit:]]
638[[:alpha:]0-9]
639[a-zA-Z0-9]
640.Ed
641.Pp
642If the scanner is case-insensitive (the
643.Fl i
644flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
645.Pp
646Some notes on patterns:
647.Bl -dash
648.It
649A negated character class such as the example
650.Qq [^A-Z]
651above will match a newline unless "\en"
652.Pq or an equivalent escape sequence
653is one of the characters explicitly present in the negated character class
654(e.g.,
655.Qq [^A-Z\en] ) .
656This is unlike how many other regular expression tools treat negated character
657classes, but unfortunately the inconsistency is historically entrenched.
658Matching newlines means that a pattern like
659.Qq [^"]*
660can match the entire input unless there's another quote in the input.
661.It
662A rule can have at most one instance of trailing context
663(the
664.Sq /
665operator or the
666.Sq $
667operator).
668The start condition,
669.Sq ^ ,
670and
671.Qq <<EOF>>
672patterns can only occur at the beginning of a pattern and, as well as with
673.Sq /
674and
675.Sq $ ,
676cannot be grouped inside parentheses.
677A
678.Sq ^
679which does not occur at the beginning of a rule or a
680.Sq $
681which does not occur at the end of a rule loses its special properties
682and is treated as a normal character.
683.It
684The following are illegal:
685.Bd -unfilled -offset indent
686foo/bar$
687<sc1>foo<sc2>bar
688.Ed
689.Pp
690Note that the first of these, can be written
691.Qq foo/bar\en .
692.It
693The following will result in
694.Sq $
695or
696.Sq ^
697being treated as a normal character:
698.Bd -unfilled -offset indent
699foo|(bar$)
700foo|^bar
701.Ed
702.Pp
703If what's wanted is a
704.Qq foo
705or a bar-followed-by-a-newline, the following could be used
706(the special
707.Sq |\&
708action is explained below):
709.Bd -unfilled -offset indent
710foo |
711bar$ /* action goes here */
712.Ed
713.Pp
714A similar trick will work for matching a foo or a
715bar-at-the-beginning-of-a-line.
716.El
717.Sh HOW THE INPUT IS MATCHED
718When the generated scanner is run,
719it analyzes its input looking for strings which match any of its patterns.
720If it finds more than one match,
721it takes the one matching the most text
722(for trailing context rules, this includes the length of the trailing part,
723even though it will then be returned to the input).
724If it finds two or more matches of the same length,
725the rule listed first in the
726.Nm
727input file is chosen.
728.Pp
729Once the match is determined, the text corresponding to the match
730(called the
731.Em token )
732is made available in the global character pointer
733.Fa yytext ,
734and its length in the global integer
735.Fa yyleng .
736The
737.Em action
738corresponding to the matched pattern is then executed
739.Pq a more detailed description of actions follows ,
740and then the remaining input is scanned for another match.
741.Pp
742If no match is found, then the default rule is executed:
743the next character in the input is considered matched and
744copied to the standard output.
745Thus, the simplest legal
746.Nm
747input is:
748.Pp
749.D1 %%
750.Pp
751which generates a scanner that simply copies its input
752.Pq one character at a time
753to its output.
754.Pp
755Note that
756.Fa yytext
757can be defined in two different ways:
758either as a character pointer or as a character array.
759Which definition
760.Nm
761uses can be controlled by including one of the special directives
762.Dq %pointer
763or
764.Dq %array
765in the first
766.Pq definitions
767section of flex input.
768The default is
769.Dq %pointer ,
770unless the
771.Fl l
772.Nm lex
773compatibility option is used, in which case
774.Fa yytext
775will be an array.
776The advantage of using
777.Dq %pointer
778is substantially faster scanning and no buffer overflow when matching
779very large tokens
780.Pq unless not enough dynamic memory is available .
781The disadvantage is that actions are restricted in how they can modify
782.Fa yytext
783.Pq see the next section ,
784and calls to the
785.Fn unput
786function destroy the present contents of
787.Fa yytext ,
788which can be a considerable porting headache when moving between different
789.Nm lex
790versions.
791.Pp
792The advantage of
793.Dq %array
794is that
795.Fa yytext
796can be modified as much as wanted, and calls to
797.Fn unput
798do not destroy
799.Fa yytext
800.Pq see below .
801Furthermore, existing
802.Nm lex
803programs sometimes access
804.Fa yytext
805externally using declarations of the form:
806.Pp
807.D1 extern char yytext[];
808.Pp
809This definition is erroneous when used with
810.Dq %pointer ,
811but correct for
812.Dq %array .
813.Pp
814.Dq %array
815defines
816.Fa yytext
817to be an array of
818.Dv YYLMAX
819characters, which defaults to a fairly large value.
820The size can be changed by simply #define'ing
821.Dv YYLMAX
822to a different value in the first section of
823.Nm
824input.
825As mentioned above, with
826.Dq %pointer
827yytext grows dynamically to accommodate large tokens.
828While this means a
829.Dq %pointer
830scanner can accommodate very large tokens
831.Pq such as matching entire blocks of comments ,
832bear in mind that each time the scanner must resize
833.Fa yytext
834it also must rescan the entire token from the beginning, so matching such
835tokens can prove slow.
836.Fa yytext
837presently does not dynamically grow if a call to
838.Fn unput
839results in too much text being pushed back; instead, a run-time error results.
840.Pp
841Also note that
842.Dq %array
843cannot be used with C++ scanner classes
844.Pq the c++ option; see below .
845.Sh ACTIONS
846Each pattern in a rule has a corresponding action,
847which can be any arbitrary C statement.
848The pattern ends at the first non-escaped whitespace character;
849the remainder of the line is its action.
850If the action is empty,
851then when the pattern is matched the input token is simply discarded.
852For example, here is the specification for a program
853which deletes all occurrences of
854.Qq zap me
855from its input:
856.Bd -literal -offset indent
857%%
858"zap me"
859.Ed
860.Pp
861(It will copy all other characters in the input to the output since
862they will be matched by the default rule.)
863.Pp
864Here is a program which compresses multiple blanks and tabs down to
865a single blank, and throws away whitespace found at the end of a line:
866.Bd -literal -offset indent
867%%
868[ \et]+ putchar(' ');
869[ \et]+$ /* ignore this token */
870.Ed
871.Pp
872If the action contains a
873.Sq { ,
874then the action spans till the balancing
875.Sq }
876is found, and the action may cross multiple lines.
877.Nm
878knows about C strings and comments and won't be fooled by braces found
879within them, but also allows actions to begin with
880.Sq %{
881and will consider the action to be all the text up to the next
882.Sq %}
883.Pq regardless of ordinary braces inside the action .
884.Pp
885An action consisting solely of a vertical bar
886.Pq Sq |\&
887means
888.Qq same as the action for the next rule .
889See below for an illustration.
890.Pp
891Actions can include arbitrary C code,
892including return statements to return a value to whatever routine called
893.Fn yylex .
894Each time
895.Fn yylex
896is called, it continues processing tokens from where it last left off
897until it either reaches the end of the file or executes a return.
898.Pp
899Actions are free to modify
900.Fa yytext
901except for lengthening it
902(adding characters to its end \- these will overwrite later characters in the
903input stream).
904This, however, does not apply when using
905.Dq %array
906.Pq see above ;
907in that case,
908.Fa yytext
909may be freely modified in any way.
910.Pp
911Actions are free to modify
912.Fa yyleng
913except they should not do so if the action also includes use of
914.Fn yymore
915.Pq see below .
916.Pp
917There are a number of special directives which can be included within
918an action:
919.Bl -tag -width Ds
920.It ECHO
921Copies
922.Fa yytext
923to the scanner's output.
924.It BEGIN
925Followed by the name of a start condition, places the scanner in the
926corresponding start condition
927.Pq see below .
928.It REJECT
929Directs the scanner to proceed on to the
930.Qq second best
931rule which matched the input
932.Pq or a prefix of the input .
933The rule is chosen as described above in
934.Sx HOW THE INPUT IS MATCHED ,
935and
936.Fa yytext
937and
938.Fa yyleng
939set up appropriately.
940It may either be one which matched as much text
941as the originally chosen rule but came later in the
942.Nm
943input file, or one which matched less text.
944For example, the following will both count the
945words in the input and call the routine
946.Fn special
947whenever
948.Qq frob
949is seen:
950.Bd -literal -offset indent
951int word_count = 0;
952%%
953
954frob special(); REJECT;
955[^ \et\en]+ ++word_count;
956.Ed
957.Pp
958Without the
959.Em REJECT ,
960any "frob"'s in the input would not be counted as words,
961since the scanner normally executes only one action per token.
962Multiple
963.Em REJECT Ns 's
964are allowed,
965each one finding the next best choice to the currently active rule.
966For example, when the following scanner scans the token
967.Qq abcd ,
968it will write
969.Qq abcdabcaba
970to the output:
971.Bd -literal -offset indent
972%%
973a |
974ab |
975abc |
976abcd ECHO; REJECT;
977\&.|\en /* eat up any unmatched character */
978.Ed
979.Pp
980(The first three rules share the fourth's action since they use
981the special
982.Sq |\&
983action.)
984.Em REJECT
985is a particularly expensive feature in terms of scanner performance;
986if it is used in any of the scanner's actions it will slow down
987all of the scanner's matching.
988Furthermore,
989.Em REJECT
990cannot be used with the
991.Fl Cf
992or
993.Fl CF
994options
995.Pq see below .
996.Pp
997Note also that unlike the other special actions,
998.Em REJECT
999is a
1000.Em branch ;
1001code immediately following it in the action will not be executed.
1002.It yymore()
1003Tells the scanner that the next time it matches a rule, the corresponding
1004token should be appended onto the current value of
1005.Fa yytext
1006rather than replacing it.
1007For example, given the input
1008.Qq mega-kludge
1009the following will write
1010.Qq mega-mega-kludge
1011to the output:
1012.Bd -literal -offset indent
1013%%
1014mega- ECHO; yymore();
1015kludge ECHO;
1016.Ed
1017.Pp
1018First
1019.Qq mega-
1020is matched and echoed to the output.
1021Then
1022.Qq kludge
1023is matched, but the previous
1024.Qq mega-
1025is still hanging around at the beginning of
1026.Fa yytext
1027so the
1028.Em ECHO
1029for the
1030.Qq kludge
1031rule will actually write
1032.Qq mega-kludge .
1033.Pp
1034Two notes regarding use of
1035.Fn yymore :
1036First,
1037.Fn yymore
1038depends on the value of
1039.Fa yyleng
1040correctly reflecting the size of the current token, so
1041.Fa yyleng
1042must not be modified when using
1043.Fn yymore .
1044Second, the presence of
1045.Fn yymore
1046in the scanner's action entails a minor performance penalty in the
1047scanner's matching speed.
1048.It yyless(n)
1049Returns all but the first
1050.Ar n
1051characters of the current token back to the input stream, where they
1052will be rescanned when the scanner looks for the next match.
1053.Fa yytext
1054and
1055.Fa yyleng
1056are adjusted appropriately (e.g.,
1057.Fa yyleng
1058will now be equal to
1059.Ar n ) .
1060For example, on the input
1061.Qq foobar
1062the following will write out
1063.Qq foobarbar :
1064.Bd -literal -offset indent
1065%%
1066foobar ECHO; yyless(3);
1067[a-z]+ ECHO;
1068.Ed
1069.Pp
1070An argument of 0 to
1071.Fa yyless
1072will cause the entire current input string to be scanned again.
1073Unless how the scanner will subsequently process its input has been changed
1074(using
1075.Em BEGIN ,
1076for example),
1077this will result in an endless loop.
1078.Pp
1079Note that
1080.Fa yyless
1081is a macro and can only be used in the
1082.Nm
1083input file, not from other source files.
1084.It unput(c)
1085Puts the character
1086.Ar c
1087back into the input stream.
1088It will be the next character scanned.
1089The following action will take the current token and cause it
1090to be rescanned enclosed in parentheses.
1091.Bd -literal -offset indent
1092{
1093 int i;
1094 char *yycopy;
1095
1096 /* Copy yytext because unput() trashes yytext */
1097 if ((yycopy = strdup(yytext)) == NULL)
1098 err(1, NULL);
1099 unput(')');
1100 for (i = yyleng - 1; i >= 0; --i)
1101 unput(yycopy[i]);
1102 unput('(');
1103 free(yycopy);
1104}
1105.Ed
1106.Pp
1107Note that since each
1108.Fn unput
1109puts the given character back at the beginning of the input stream,
1110pushing back strings must be done back-to-front.
1111.Pp
1112An important potential problem when using
1113.Fn unput
1114is that if using
1115.Dq %pointer
1116.Pq the default ,
1117a call to
1118.Fn unput
1119destroys the contents of
1120.Fa yytext ,
1121starting with its rightmost character and devouring one character to
1122the left with each call.
1123If the value of
1124.Fa yytext
1125should be preserved after a call to
1126.Fn unput
1127.Pq as in the above example ,
1128it must either first be copied elsewhere, or the scanner must be built using
1129.Dq %array
1130instead (see
1131.Sx HOW THE INPUT IS MATCHED ) .
1132.Pp
1133Finally, note that EOF cannot be put back
1134to attempt to mark the input stream with an end-of-file.
1135.It input()
1136Reads the next character from the input stream.
1137For example, the following is one way to eat up C comments:
1138.Bd -literal -offset indent
1139%%
1140"/*" {
1141 int c;
1142
1143 for (;;) {
1144 while ((c = input()) != '*' && c != EOF)
1145 ; /* eat up text of comment */
1146
1147 if (c == '*') {
1148 while ((c = input()) == '*')
1149 ;
1150 if (c == '/')
1151 break; /* found the end */
1152 }
1153
1154 if (c == EOF) {
1155 errx(1, "EOF in comment");
1156 break;
1157 }
1158 }
1159}
1160.Ed
1161.Pp
1162(Note that if the scanner is compiled using C++, then
1163.Fn input
1164is instead referred to as
1165.Fn yyinput ,
1166in order to avoid a name clash with the C++ stream by the name of input.)
1167.It YY_FLUSH_BUFFER
1168Flushes the scanner's internal buffer
1169so that the next time the scanner attempts to match a token,
1170it will first refill the buffer using
1171.Dv YY_INPUT
1172(see
1173.Sx THE GENERATED SCANNER ,
1174below).
1175This action is a special case of the more general
1176.Fn yy_flush_buffer
1177function, described below in the section
1178.Sx MULTIPLE INPUT BUFFERS .
1179.It yyterminate()
1180Can be used in lieu of a return statement in an action.
1181It terminates the scanner and returns a 0 to the scanner's caller, indicating
1182.Qq all done .
1183By default,
1184.Fn yyterminate
1185is also called when an end-of-file is encountered.
1186It is a macro and may be redefined.
1187.El
1188.Sh THE GENERATED SCANNER
1189The output of
1190.Nm
1191is the file
1192.Pa lex.yy.c ,
1193which contains the scanning routine
1194.Fn yylex ,
1195a number of tables used by it for matching tokens,
1196and a number of auxiliary routines and macros.
1197By default,
1198.Fn yylex
1199is declared as follows:
1200.Bd -unfilled -offset indent
1201int yylex()
1202{
1203 ... various definitions and the actions in here ...
1204}
1205.Ed
1206.Pp
1207(If the environment supports function prototypes, then it will
1208be "int yylex(void)".)
1209This definition may be changed by defining the
1210.Dv YY_DECL
1211macro.
1212For example:
1213.Bd -literal -offset indent
1214#define YY_DECL float lexscan(a, b) float a, b;
1215.Ed
1216.Pp
1217would give the scanning routine the name
1218.Em lexscan ,
1219returning a float, and taking two floats as arguments.
1220Note that if arguments are given to the scanning routine using a
1221K&R-style/non-prototyped function declaration,
1222the definition must be terminated with a semi-colon
1223.Pq Sq ;\& .
1224.Pp
1225Whenever
1226.Fn yylex
1227is called, it scans tokens from the global input file
1228.Pa yyin
1229.Pq which defaults to stdin .
1230It continues until it either reaches an end-of-file
1231.Pq at which point it returns the value 0
1232or one of its actions executes a
1233.Em return
1234statement.
1235.Pp
1236If the scanner reaches an end-of-file, subsequent calls are undefined
1237unless either
1238.Em yyin
1239is pointed at a new input file
1240.Pq in which case scanning continues from that file ,
1241or
1242.Fn yyrestart
1243is called.
1244.Fn yyrestart
1245takes one argument, a
1246.Fa FILE *
1247pointer (which can be nil, if
1248.Dv YY_INPUT
1249has been set up to scan from a source other than
1250.Em yyin ) ,
1251and initializes
1252.Em yyin
1253for scanning from that file.
1254Essentially there is no difference between just assigning
1255.Em yyin
1256to a new input file or using
1257.Fn yyrestart
1258to do so; the latter is available for compatibility with previous versions of
1259.Nm ,
1260and because it can be used to switch input files in the middle of scanning.
1261It can also be used to throw away the current input buffer,
1262by calling it with an argument of
1263.Em yyin ;
1264but better is to use
1265.Dv YY_FLUSH_BUFFER
1266.Pq see above .
1267Note that
1268.Fn yyrestart
1269does not reset the start condition to
1270.Em INITIAL
1271(see
1272.Sx START CONDITIONS ,
1273below).
1274.Pp
1275If
1276.Fn yylex
1277stops scanning due to executing a
1278.Em return
1279statement in one of the actions, the scanner may then be called again and it
1280will resume scanning where it left off.
1281.Pp
1282By default
1283.Pq and for purposes of efficiency ,
1284the scanner uses block-reads rather than simple
1285.Xr getc 3
1286calls to read characters from
1287.Em yyin .
1288The nature of how it gets its input can be controlled by defining the
1289.Dv YY_INPUT
1290macro.
1291.Dv YY_INPUT Ns 's
1292calling sequence is
1293.Qq YY_INPUT(buf,result,max_size) .
1294Its action is to place up to
1295.Dv max_size
1296characters in the character array
1297.Em buf
1298and return in the integer variable
1299.Em result
1300either the number of characters read or the constant
1301.Dv YY_NULL
1302(0 on
1303.Ux
1304systems)
1305to indicate
1306.Dv EOF .
1307The default
1308.Dv YY_INPUT
1309reads from the global file-pointer
1310.Qq yyin .
1311.Pp
1312A sample definition of
1313.Dv YY_INPUT
1314.Pq in the definitions section of the input file :
1315.Bd -unfilled -offset indent
1316%{
1317#define YY_INPUT(buf,result,max_size) \e
1318{ \e
1319 int c = getchar(); \e
1320 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
1321}
1322%}
1323.Ed
1324.Pp
1325This definition will change the input processing to occur
1326one character at a time.
1327.Pp
1328When the scanner receives an end-of-file indication from
1329.Dv YY_INPUT ,
1330it then checks the
1331.Fn yywrap
1332function.
1333If
1334.Fn yywrap
1335returns false
1336.Pq zero ,
1337then it is assumed that the function has gone ahead and set up
1338.Em yyin
1339to point to another input file, and scanning continues.
1340If it returns true
1341.Pq non-zero ,
1342then the scanner terminates, returning 0 to its caller.
1343Note that in either case, the start condition remains unchanged;
1344it does not revert to
1345.Em INITIAL .
1346.Pp
1347If you do not supply your own version of
1348.Fn yywrap ,
1349then you must either use
1350.Dq %option noyywrap
1351(in which case the scanner behaves as though
1352.Fn yywrap
1353returned 1), or you must link with
1354.Fl lfl
1355to obtain the default version of the routine, which always returns 1.
1356.Pp
1357Three routines are available for scanning from in-memory buffers rather
1358than files:
1359.Fn yy_scan_string ,
1360.Fn yy_scan_bytes ,
1361and
1362.Fn yy_scan_buffer .
1363See the discussion of them below in the section
1364.Sx MULTIPLE INPUT BUFFERS .
1365.Pp
1366The scanner writes its
1367.Em ECHO
1368output to the
1369.Em yyout
1370global
1371.Pq default, stdout ,
1372which may be redefined by the user simply by assigning it to some other
1373.Va FILE
1374pointer.
1375.Sh START CONDITIONS
1376.Nm
1377provides a mechanism for conditionally activating rules.
1378Any rule whose pattern is prefixed with
1379.Qq <sc>
1380will only be active when the scanner is in the start condition named
1381.Qq sc .
1382For example,
1383.Bd -literal -offset indent
1384<STRING>[^"]* { /* eat up the string body ... */
1385 ...
1386}
1387.Ed
1388.Pp
1389will be active only when the scanner is in the
1390.Qq STRING
1391start condition, and
1392.Bd -literal -offset indent
1393<INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
1394 ...
1395}
1396.Ed
1397.Pp
1398will be active only when the current start condition is either
1399.Qq INITIAL ,
1400.Qq STRING ,
1401or
1402.Qq QUOTE .
1403.Pp
1404Start conditions are declared in the definitions
1405.Pq first
1406section of the input using unindented lines beginning with either
1407.Sq %s
1408or
1409.Sq %x
1410followed by a list of names.
1411The former declares
1412.Em inclusive
1413start conditions, the latter
1414.Em exclusive
1415start conditions.
1416A start condition is activated using the
1417.Em BEGIN
1418action.
1419Until the next
1420.Em BEGIN
1421action is executed, rules with the given start condition will be active and
1422rules with other start conditions will be inactive.
1423If the start condition is inclusive,
1424then rules with no start conditions at all will also be active.
1425If it is exclusive,
1426then only rules qualified with the start condition will be active.
1427A set of rules contingent on the same exclusive start condition
1428describe a scanner which is independent of any of the other rules in the
1429.Nm
1430input.
1431Because of this, exclusive start conditions make it easy to specify
1432.Qq mini-scanners
1433which scan portions of the input that are syntactically different
1434from the rest
1435.Pq e.g., comments .
1436.Pp
1437If the distinction between inclusive and exclusive start conditions
1438is still a little vague, here's a simple example illustrating the
1439connection between the two.
1440The set of rules:
1441.Bd -literal -offset indent
1442%s example
1443%%
1444
1445<example>foo do_something();
1446
1447bar something_else();
1448.Ed
1449.Pp
1450is equivalent to
1451.Bd -literal -offset indent
1452%x example
1453%%
1454
1455<example>foo do_something();
1456
1457<INITIAL,example>bar something_else();
1458.Ed
1459.Pp
1460Without the <INITIAL,example> qualifier, the
1461.Dq bar
1462pattern in the second example wouldn't be active
1463.Pq i.e., couldn't match
1464when in start condition
1465.Dq example .
1466If we just used <example> to qualify
1467.Dq bar ,
1468though, then it would only be active in
1469.Dq example
1470and not in
1471.Em INITIAL ,
1472while in the first example it's active in both,
1473because in the first example the
1474.Dq example
1475start condition is an inclusive
1476.Pq Sq %s
1477start condition.
1478.Pp
1479Also note that the special start-condition specifier
1480.Sq <*>
1481matches every start condition.
1482Thus, the above example could also have been written:
1483.Bd -literal -offset indent
1484%x example
1485%%
1486
1487<example>foo do_something();
1488
1489<*>bar something_else();
1490.Ed
1491.Pp
1492The default rule (to
1493.Em ECHO
1494any unmatched character) remains active in start conditions.
1495It is equivalent to:
1496.Bd -literal -offset indent
1497<*>.|\en ECHO;
1498.Ed
1499.Pp
1500.Dq BEGIN(0)
1501returns to the original state where only the rules with
1502no start conditions are active.
1503This state can also be referred to as the start-condition
1504.Em INITIAL ,
1505so
1506.Dq BEGIN(INITIAL)
1507is equivalent to
1508.Dq BEGIN(0) .
1509(The parentheses around the start condition name are not required but
1510are considered good style.)
1511.Pp
1512.Em BEGIN
1513actions can also be given as indented code at the beginning
1514of the rules section.
1515For example, the following will cause the scanner to enter the
1516.Qq SPECIAL
1517start condition whenever
1518.Fn yylex
1519is called and the global variable
1520.Fa enter_special
1521is true:
1522.Bd -literal -offset indent
1523int enter_special;
1524
1525%x SPECIAL
1526%%
1527 if (enter_special)
1528 BEGIN(SPECIAL);
1529
1530<SPECIAL>blahblahblah
1531\&...more rules follow...
1532.Ed
1533.Pp
1534To illustrate the uses of start conditions,
1535here is a scanner which provides two different interpretations
1536of a string like
1537.Qq 123.456 .
1538By default it will treat it as three tokens: the integer
1539.Qq 123 ,
1540a dot
1541.Pq Sq .\& ,
1542and the integer
1543.Qq 456 .
1544But if the string is preceded earlier in the line by the string
1545.Qq expect-floats
1546it will treat it as a single token, the floating-point number 123.456:
1547.Bd -literal -offset indent
1548%{
1549#include <math.h>
1550%}
1551%s expect
1552
1553%%
1554expect-floats BEGIN(expect);
1555
1556<expect>[0-9]+"."[0-9]+ {
1557 printf("found a float, = %s\en", yytext);
1558}
1559<expect>\en {
1560 /*
1561 * That's the end of the line, so
1562 * we need another "expect-number"
1563 * before we'll recognize any more
1564 * numbers.
1565 */
1566 BEGIN(INITIAL);
1567}
1568
1569[0-9]+ {
1570 printf("found an integer, = %s\en", yytext);
1571}
1572
1573"." printf("found a dot\en");
1574.Ed
1575.Pp
1576Here is a scanner which recognizes
1577.Pq and discards
1578C comments while maintaining a count of the current input line:
1579.Bd -literal -offset indent
1580%x comment
1581%%
1582int line_num = 1;
1583
1584"/*" BEGIN(comment);
1585
1586<comment>[^*\en]* /* eat anything that's not a '*' */
1587<comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1588<comment>\en ++line_num;
1589<comment>"*"+"/" BEGIN(INITIAL);
1590.Ed
1591.Pp
1592This scanner goes to a bit of trouble to match as much
1593text as possible with each rule.
1594In general, when attempting to write a high-speed scanner
1595try to match as much as possible in each rule, as it's a big win.
1596.Pp
1597Note that start-condition names are really integer values and
1598can be stored as such.
1599Thus, the above could be extended in the following fashion:
1600.Bd -literal -offset indent
1601%x comment foo
1602%%
1603int line_num = 1;
1604int comment_caller;
1605
1606"/*" {
1607 comment_caller = INITIAL;
1608 BEGIN(comment);
1609}
1610
1611\&...
1612
1613<foo>"/*" {
1614 comment_caller = foo;
1615 BEGIN(comment);
1616}
1617
1618<comment>[^*\en]* /* eat anything that's not a '*' */
1619<comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */
1620<comment>\en ++line_num;
1621<comment>"*"+"/" BEGIN(comment_caller);
1622.Ed
1623.Pp
1624Furthermore, the current start condition can be accessed by using
1625the integer-valued
1626.Dv YY_START
1627macro.
1628For example, the above assignments to
1629.Em comment_caller
1630could instead be written
1631.Pp
1632.Dl comment_caller = YY_START;
1633.Pp
1634Flex provides
1635.Dv YYSTATE
1636as an alias for
1637.Dv YY_START
1638(since that is what's used by
1639.At
1640.Nm lex ) .
1641.Pp
1642Note that start conditions do not have their own name-space;
1643%s's and %x's declare names in the same fashion as #define's.
1644.Pp
1645Finally, here's an example of how to match C-style quoted strings using
1646exclusive start conditions, including expanded escape sequences
1647(but not including checking for a string that's too long):
1648.Bd -literal -offset indent
1649%x str
1650
1651%%
1652#define MAX_STR_CONST 1024
1653char string_buf[MAX_STR_CONST];
1654char *string_buf_ptr;
1655
1656\e" string_buf_ptr = string_buf; BEGIN(str);
1657
1658<str>\e" { /* saw closing quote - all done */
1659 BEGIN(INITIAL);
1660 *string_buf_ptr = '\e0';
1661 /*
1662 * return string constant token type and
1663 * value to parser
1664 */
1665}
1666
1667<str>\en {
1668 /* error - unterminated string constant */
1669 /* generate error message */
1670}
1671
1672<str>\e\e[0-7]{1,3} {
1673 /* octal escape sequence */
1674 int result;
1675
1676 (void) sscanf(yytext + 1, "%o", &result);
1677
1678 if (result > 0xff) {
1679 /* error, constant is out-of-bounds */
1680 } else
1681 *string_buf_ptr++ = result;
1682}
1683
1684<str>\e\e[0-9]+ {
1685 /*
1686 * generate error - bad escape sequence; something
1687 * like '\e48' or '\e0777777'
1688 */
1689}
1690
1691<str>\e\en *string_buf_ptr++ = '\en';
1692<str>\e\et *string_buf_ptr++ = '\et';
1693<str>\e\er *string_buf_ptr++ = '\er';
1694<str>\e\eb *string_buf_ptr++ = '\eb';
1695<str>\e\ef *string_buf_ptr++ = '\ef';
1696
1697<str>\e\e(.|\en) *string_buf_ptr++ = yytext[1];
1698
1699<str>[^\e\e\en\e"]+ {
1700 char *yptr = yytext;
1701
1702 while (*yptr)
1703 *string_buf_ptr++ = *yptr++;
1704}
1705.Ed
1706.Pp
1707Often, such as in some of the examples above,
1708a whole bunch of rules are all preceded by the same start condition(s).
1709.Nm
1710makes this a little easier and cleaner by introducing a notion of
1711start condition
1712.Em scope .
1713A start condition scope is begun with:
1714.Pp
1715.Dl <SCs>{
1716.Pp
1717where
1718.Dq SCs
1719is a list of one or more start conditions.
1720Inside the start condition scope, every rule automatically has the prefix <SCs>
1721applied to it, until a
1722.Sq }
1723which matches the initial
1724.Sq { .
1725So, for example,
1726.Bd -literal -offset indent
1727<ESC>{
1728 "\e\en" return '\en';
1729 "\e\er" return '\er';
1730 "\e\ef" return '\ef';
1731 "\e\e0" return '\e0';
1732}
1733.Ed
1734.Pp
1735is equivalent to:
1736.Bd -literal -offset indent
1737<ESC>"\e\en" return '\en';
1738<ESC>"\e\er" return '\er';
1739<ESC>"\e\ef" return '\ef';
1740<ESC>"\e\e0" return '\e0';
1741.Ed
1742.Pp
1743Start condition scopes may be nested.
1744.Pp
1745Three routines are available for manipulating stacks of start conditions:
1746.Bl -tag -width Ds
1747.It void yy_push_state(int new_state)
1748Pushes the current start condition onto the top of the start condition
1749stack and switches to
1750.Fa new_state
1751as though
1752.Dq BEGIN new_state
1753had been used
1754.Pq recall that start condition names are also integers .
1755.It void yy_pop_state()
1756Pops the top of the stack and switches to it via
1757.Em BEGIN .
1758.It int yy_top_state()
1759Returns the top of the stack without altering the stack's contents.
1760.El
1761.Pp
1762The start condition stack grows dynamically and so has no built-in
1763size limitation.
1764If memory is exhausted, program execution aborts.
1765.Pp
1766To use start condition stacks, scanners must include a
1767.Dq %option stack
1768directive (see
1769.Sx OPTIONS
1770below).
1771.Sh MULTIPLE INPUT BUFFERS
1772Some scanners
1773(such as those which support
1774.Qq include
1775files)
1776require reading from several input streams.
1777As
1778.Nm
1779scanners do a large amount of buffering, one cannot control
1780where the next input will be read from by simply writing a
1781.Dv YY_INPUT
1782which is sensitive to the scanning context.
1783.Dv YY_INPUT
1784is only called when the scanner reaches the end of its buffer, which
1785may be a long time after scanning a statement such as an
1786.Qq include
1787which requires switching the input source.
1788.Pp
1789To negotiate these sorts of problems,
1790.Nm
1791provides a mechanism for creating and switching between multiple
1792input buffers.
1793An input buffer is created by using:
1794.Pp
1795.D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
1796.Pp
1797which takes a
1798.Fa FILE
1799pointer and a
1800.Fa size
1801and creates a buffer associated with the given file and large enough to hold
1802.Fa size
1803characters (when in doubt, use
1804.Dv YY_BUF_SIZE
1805for the size).
1806It returns a
1807.Dv YY_BUFFER_STATE
1808handle, which may then be passed to other routines
1809.Pq see below .
1810The
1811.Dv YY_BUFFER_STATE
1812type is a pointer to an opaque
1813.Dq struct yy_buffer_state
1814structure, so
1815.Dv YY_BUFFER_STATE
1816variables may be safely initialized to
1817.Dq ((YY_BUFFER_STATE) 0)
1818if desired, and the opaque structure can also be referred to in order to
1819correctly declare input buffers in source files other than that of scanners.
1820Note that the
1821.Fa FILE
1822pointer in the call to
1823.Fn yy_create_buffer
1824is only used as the value of
1825.Fa yyin
1826seen by
1827.Dv YY_INPUT ;
1828if
1829.Dv YY_INPUT
1830is redefined so that it no longer uses
1831.Fa yyin ,
1832then a nil
1833.Fa FILE
1834pointer can safely be passed to
1835.Fn yy_create_buffer .
1836To select a particular buffer to scan:
1837.Pp
1838.D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
1839.Pp
1840It switches the scanner's input buffer so subsequent tokens will
1841come from
1842.Fa new_buffer .
1843Note that
1844.Fn yy_switch_to_buffer
1845may be used by
1846.Fn yywrap
1847to set things up for continued scanning,
1848instead of opening a new file and pointing
1849.Fa yyin
1850at it.
1851Note also that switching input sources via either
1852.Fn yy_switch_to_buffer
1853or
1854.Fn yywrap
1855does not change the start condition.
1856.Pp
1857.D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
1858.Pp
1859is used to reclaim the storage associated with a buffer.
1860.Pf ( Fa buffer
1861can be nil, in which case the routine does nothing.)
1862To clear the current contents of a buffer:
1863.Pp
1864.D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
1865.Pp
1866This function discards the buffer's contents,
1867so the next time the scanner attempts to match a token from the buffer,
1868it will first fill the buffer anew using
1869.Dv YY_INPUT .
1870.Pp
1871.Fn yy_new_buffer
1872is an alias for
1873.Fn yy_create_buffer ,
1874provided for compatibility with the C++ use of
1875.Em new
1876and
1877.Em delete
1878for creating and destroying dynamic objects.
1879.Pp
1880Finally, the
1881.Dv YY_CURRENT_BUFFER
1882macro returns a
1883.Dv YY_BUFFER_STATE
1884handle to the current buffer.
1885.Pp
1886Here is an example of using these features for writing a scanner
1887which expands include files (the <<EOF>> feature is discussed below):
1888.Bd -literal -offset indent
1889/*
1890 * the "incl" state is used for picking up the name
1891 * of an include file
1892 */
1893%x incl
1894
1895%{
1896#define MAX_INCLUDE_DEPTH 10
1897YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1898int include_stack_ptr = 0;
1899%}
1900
1901%%
1902include BEGIN(incl);
1903
1904[a-z]+ ECHO;
1905[^a-z\en]*\en? ECHO;
1906
1907<incl>[ \et]* /* eat the whitespace */
1908<incl>[^ \et\en]+ { /* got the include file name */
1909 if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
1910 errx(1, "Includes nested too deeply");
1911
1912 include_stack[include_stack_ptr++] =
1913 YY_CURRENT_BUFFER;
1914
1915 yyin = fopen(yytext, "r");
1916
1917 if (yyin == NULL)
1918 err(1, NULL);
1919
1920 yy_switch_to_buffer(
1921 yy_create_buffer(yyin, YY_BUF_SIZE));
1922
1923 BEGIN(INITIAL);
1924}
1925
1926<<EOF>> {
1927 if (--include_stack_ptr < 0)
1928 yyterminate();
1929 else {
1930 yy_delete_buffer(YY_CURRENT_BUFFER);
1931 yy_switch_to_buffer(
1932 include_stack[include_stack_ptr]);
1933 }
1934}
1935.Ed
1936.Pp
1937Three routines are available for setting up input buffers for
1938scanning in-memory strings instead of files.
1939All of them create a new input buffer for scanning the string,
1940and return a corresponding
1941.Dv YY_BUFFER_STATE
1942handle (which should be deleted afterwards using
1943.Fn yy_delete_buffer ) .
1944They also switch to the new buffer using
1945.Fn yy_switch_to_buffer ,
1946so the next call to
1947.Fn yylex
1948will start scanning the string.
1949.Bl -tag -width Ds
1950.It yy_scan_string(const char *str)
1951Scans a NUL-terminated string.
1952.It yy_scan_bytes(const char *bytes, int len)
1953Scans
1954.Fa len
1955bytes
1956.Pq including possibly NUL's
1957starting at location
1958.Fa bytes .
1959.El
1960.Pp
1961Note that both of these functions create and scan a copy
1962of the string or bytes.
1963(This may be desirable, since
1964.Fn yylex
1965modifies the contents of the buffer it is scanning.)
1966The copy can be avoided by using:
1967.Bl -tag -width Ds
1968.It yy_scan_buffer(char *base, yy_size_t size)
1969Which scans the buffer starting at
1970.Fa base ,
1971consisting of
1972.Fa size
1973bytes, the last two bytes of which must be
1974.Dv YY_END_OF_BUFFER_CHAR
1975.Pq ASCII NUL .
1976These last two bytes are not scanned; thus, scanning consists of
1977base[0] through base[size-2], inclusive.
1978.Pp
1979If
1980.Fa base
1981is not set up in this manner
1982(i.e., forget the final two
1983.Dv YY_END_OF_BUFFER_CHAR
1984bytes), then
1985.Fn yy_scan_buffer
1986returns a nil pointer instead of creating a new input buffer.
1987.Pp
1988The type
1989.Fa yy_size_t
1990is an integral type which can be cast to an integer expression
1991reflecting the size of the buffer.
1992.El
1993.Sh END-OF-FILE RULES
1994The special rule
1995.Qq <<EOF>>
1996indicates actions which are to be taken when an end-of-file is encountered and
1997.Fn yywrap
1998returns non-zero
1999.Pq i.e., indicates no further files to process .
2000The action must finish by doing one of four things:
2001.Bl -dash
2002.It
2003Assigning
2004.Em yyin
2005to a new input file
2006(in previous versions of
2007.Nm ,
2008after doing the assignment, it was necessary to call the special action
2009.Dv YY_NEW_FILE ;
2010this is no longer necessary).
2011.It
2012Executing a
2013.Em return
2014statement.
2015.It
2016Executing the special
2017.Fn yyterminate
2018action.
2019.It
2020Switching to a new buffer using
2021.Fn yy_switch_to_buffer
2022as shown in the example above.
2023.El
2024.Pp
2025<<EOF>> rules may not be used with other patterns;
2026they may only be qualified with a list of start conditions.
2027If an unqualified <<EOF>> rule is given, it applies to all start conditions
2028which do not already have <<EOF>> actions.
2029To specify an <<EOF>> rule for only the initial start condition, use
2030.Pp
2031.Dl <INITIAL><<EOF>>
2032.Pp
2033These rules are useful for catching things like unclosed comments.
2034An example:
2035.Bd -literal -offset indent
2036%x quote
2037%%
2038
2039\&...other rules for dealing with quotes...
2040
2041<quote><<EOF>> {
2042 error("unterminated quote");
2043 yyterminate();
2044}
2045<<EOF>> {
2046 if (*++filelist)
2047 yyin = fopen(*filelist, "r");
2048 else
2049 yyterminate();
2050}
2051.Ed
2052.Sh MISCELLANEOUS MACROS
2053The macro
2054.Dv YY_USER_ACTION
2055can be defined to provide an action
2056which is always executed prior to the matched rule's action.
2057For example,
2058it could be #define'd to call a routine to convert yytext to lower-case.
2059When
2060.Dv YY_USER_ACTION
2061is invoked, the variable
2062.Fa yy_act
2063gives the number of the matched rule
2064.Pq rules are numbered starting with 1 .
2065For example, to profile how often each rule is matched,
2066the following would do the trick:
2067.Pp
2068.Dl #define YY_USER_ACTION ++ctr[yy_act]
2069.Pp
2070where
2071.Fa ctr
2072is an array to hold the counts for the different rules.
2073Note that the macro
2074.Dv YY_NUM_RULES
2075gives the total number of rules
2076(including the default rule, even if
2077.Fl s
2078is used),
2079so a correct declaration for
2080.Fa ctr
2081is:
2082.Pp
2083.Dl int ctr[YY_NUM_RULES];
2084.Pp
2085The macro
2086.Dv YY_USER_INIT
2087may be defined to provide an action which is always executed before
2088the first scan
2089.Pq and before the scanner's internal initializations are done .
2090For example, it could be used to call a routine to read
2091in a data table or open a logging file.
2092.Pp
2093The macro
2094.Dv yy_set_interactive(is_interactive)
2095can be used to control whether the current buffer is considered
2096.Em interactive .
2097An interactive buffer is processed more slowly,
2098but must be used when the scanner's input source is indeed
2099interactive to avoid problems due to waiting to fill buffers
2100(see the discussion of the
2101.Fl I
2102flag below).
2103A non-zero value in the macro invocation marks the buffer as interactive,
2104a zero value as non-interactive.
2105Note that use of this macro overrides
2106.Dq %option always-interactive
2107or
2108.Dq %option never-interactive
2109(see
2110.Sx OPTIONS
2111below).
2112.Fn yy_set_interactive
2113must be invoked prior to beginning to scan the buffer that is
2114.Pq or is not
2115to be considered interactive.
2116.Pp
2117The macro
2118.Dv yy_set_bol(at_bol)
2119can be used to control whether the current buffer's scanning
2120context for the next token match is done as though at the
2121beginning of a line.
2122A non-zero macro argument makes rules anchored with
2123.Sq ^
2124active, while a zero argument makes
2125.Sq ^
2126rules inactive.
2127.Pp
2128The macro
2129.Dv YY_AT_BOL
2130returns true if the next token scanned from the current buffer will have
2131.Sq ^
2132rules active, false otherwise.
2133.Pp
2134In the generated scanner, the actions are all gathered in one large
2135switch statement and separated using
2136.Dv YY_BREAK ,
2137which may be redefined.
2138By default, it is simply a
2139.Qq break ,
2140to separate each rule's action from the following rules.
2141Redefining
2142.Dv YY_BREAK
2143allows, for example, C++ users to
2144.Dq #define YY_BREAK
2145to do nothing
2146(while being very careful that every rule ends with a
2147.Qq break
2148or a
2149.Qq return ! )
2150to avoid suffering from unreachable statement warnings where because a rule's
2151action ends with
2152.Dq return ,
2153the
2154.Dv YY_BREAK
2155is inaccessible.
2156.Sh VALUES AVAILABLE TO THE USER
2157This section summarizes the various values available to the user
2158in the rule actions.
2159.Bl -tag -width Ds
2160.It char *yytext
2161Holds the text of the current token.
2162It may be modified but not lengthened
2163.Pq characters cannot be appended to the end .
2164.Pp
2165If the special directive
2166.Dq %array
2167appears in the first section of the scanner description, then
2168.Fa yytext
2169is instead declared
2170.Dq char yytext[YYLMAX] ,
2171where
2172.Dv YYLMAX
2173is a macro definition that can be redefined in the first section
2174to change the default value
2175.Pq generally 8KB .
2176Using
2177.Dq %array
2178results in somewhat slower scanners, but the value of
2179.Fa yytext
2180becomes immune to calls to
2181.Fn input
2182and
2183.Fn unput ,
2184which potentially destroy its value when
2185.Fa yytext
2186is a character pointer.
2187The opposite of
2188.Dq %array
2189is
2190.Dq %pointer ,
2191which is the default.
2192.Pp
2193.Dq %array
2194cannot be used when generating C++ scanner classes
2195(the
2196.Fl +
2197flag).
2198.It int yyleng
2199Holds the length of the current token.
2200.It FILE *yyin
2201Is the file which by default
2202.Nm
2203reads from.
2204It may be redefined, but doing so only makes sense before
2205scanning begins or after an
2206.Dv EOF
2207has been encountered.
2208Changing it in the midst of scanning will have unexpected results since
2209.Nm
2210buffers its input; use
2211.Fn yyrestart
2212instead.
2213Once scanning terminates because an end-of-file
2214has been seen,
2215.Fa yyin
2216can be assigned as the new input file
2217and the scanner can be called again to continue scanning.
2218.It void yyrestart(FILE *new_file)
2219May be called to point
2220.Fa yyin
2221at the new input file.
2222The switch-over to the new file is immediate
2223.Pq any previously buffered-up input is lost .
2224Note that calling
2225.Fn yyrestart
2226with
2227.Fa yyin
2228as an argument thus throws away the current input buffer and continues
2229scanning the same input file.
2230.It FILE *yyout
2231Is the file to which
2232.Em ECHO
2233actions are done.
2234It can be reassigned by the user.
2235.It YY_CURRENT_BUFFER
2236Returns a
2237.Dv YY_BUFFER_STATE
2238handle to the current buffer.
2239.It YY_START
2240Returns an integer value corresponding to the current start condition.
2241This value can subsequently be used with
2242.Em BEGIN
2243to return to that start condition.
2244.El
2245.Sh INTERFACING WITH YACC
2246One of the main uses of
2247.Nm
2248is as a companion to the
2249.Xr yacc 1
2250parser-generator.
2251yacc parsers expect to call a routine named
2252.Fn yylex
2253to find the next input token.
2254The routine is supposed to return the type of the next token
2255as well as putting any associated value in the global
2256.Fa yylval ,
2257which is defined externally,
2258and can be a union or any other complex data structure.
2259To use
2260.Nm
2261with yacc, one specifies the
2262.Fl d
2263option to yacc to instruct it to generate the file
2264.Pa y.tab.h
2265containing definitions of all the
2266.Dq %tokens
2267appearing in the yacc input.
2268This file is then included in the
2269.Nm
2270scanner.
2271For example, part of the scanner might look like:
2272.Bd -literal -offset indent
2273%{
2274#include "y.tab.h"
2275%}
2276
2277%%
2278
2279if return TOK_IF;
2280then return TOK_THEN;
2281begin return TOK_BEGIN;
2282end return TOK_END;
2283.Ed
2284.Sh OPTIONS
2285.Nm
2286has the following options:
2287.Bl -tag -width Ds
2288.It Fl 7
2289Instructs
2290.Nm
2291to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
2292characters in its input.
2293The advantage of using
2294.Fl 7
2295is that the scanner's tables can be up to half the size of those generated
2296using the
2297.Fl 8
2298option
2299.Pq see below .
2300The disadvantage is that such scanners often hang
2301or crash if their input contains an 8-bit character.
2302.Pp
2303Note, however, that unless generating a scanner using the
2304.Fl Cf
2305or
2306.Fl CF
2307table compression options, use of
2308.Fl 7
2309will save only a small amount of table space,
2310and make the scanner considerably less portable.
2311.Nm flex Ns 's
2312default behavior is to generate an 8-bit scanner unless
2313.Fl Cf
2314or
2315.Fl CF
2316is specified, in which case
2317.Nm
2318defaults to generating 7-bit scanners unless it was
2319configured to generate 8-bit scanners
2320(as will often be the case with non-USA sites).
2321It is possible tell whether
2322.Nm
2323generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
2324.Fl v
2325output as described below.
2326.Pp
2327Note that if
2328.Fl Cfe
2329or
2330.Fl CFe
2331are used
2332(the table compression options, but also using equivalence classes as
2333discussed below),
2334.Nm
2335still defaults to generating an 8-bit scanner,
2336since usually with these compression options full 8-bit tables
2337are not much more expensive than 7-bit tables.
2338.It Fl 8
2339Instructs
2340.Nm
2341to generate an 8-bit scanner, i.e., one which can recognize 8-bit
2342characters.
2343This flag is only needed for scanners generated using
2344.Fl Cf
2345or
2346.Fl CF ,
2347as otherwise
2348.Nm
2349defaults to generating an 8-bit scanner anyway.
2350.Pp
2351See the discussion of
2352.Fl 7
2353above for
2354.Nm flex Ns 's
2355default behavior and the tradeoffs between 7-bit and 8-bit scanners.
2356.It Fl B
2357Instructs
2358.Nm
2359to generate a
2360.Em batch
2361scanner, the opposite of
2362.Em interactive
2363scanners generated by
2364.Fl I
2365.Pq see below .
2366In general,
2367.Fl B
2368is used when the scanner will never be used interactively,
2369and you want to squeeze a little more performance out of it.
2370If the aim is instead to squeeze out a lot more performance,
2371use the
2372.Fl Cf
2373or
2374.Fl CF
2375options
2376.Pq discussed below ,
2377which turn on
2378.Fl B
2379automatically anyway.
2380.It Fl b
2381Generate backing-up information to
2382.Pa lex.backup .
2383This is a list of scanner states which require backing up
2384and the input characters on which they do so.
2385By adding rules one can remove backing-up states.
2386If all backing-up states are eliminated and
2387.Fl Cf
2388or
2389.Fl CF
2390is used, the generated scanner will run faster (see the
2391.Fl p
2392flag).
2393Only users who wish to squeeze every last cycle out of their
2394scanners need worry about this option.
2395(See the section on
2396.Sx PERFORMANCE CONSIDERATIONS
2397below.)
2398.It Fl C Ns Op Cm aeFfmr
2399Controls the degree of table compression and, more generally, trade-offs
2400between small scanners and fast scanners.
2401.Bl -tag -width Ds
2402.It Fl Ca
2403Instructs
2404.Nm
2405to trade off larger tables in the generated scanner for faster performance
2406because the elements of the tables are better aligned for memory access
2407and computation.
2408On some
2409.Tn RISC
2410architectures, fetching and manipulating longwords is more efficient
2411than with smaller-sized units such as shortwords.
2412This option can double the size of the tables used by the scanner.
2413.It Fl Ce
2414Directs
2415.Nm
2416to construct
2417.Em equivalence classes ,
2418i.e., sets of characters which have identical lexical properties
2419(for example, if the only appearance of digits in the
2420.Nm
2421input is in the character class
2422.Qq [0-9]
2423then the digits
2424.Sq 0 ,
2425.Sq 1 ,
2426.Sq ... ,
2427.Sq 9
2428will all be put in the same equivalence class).
2429Equivalence classes usually give dramatic reductions in the final
2430table/object file sizes
2431.Pq typically a factor of 2\-5
2432and are pretty cheap performance-wise
2433.Pq one array look-up per character scanned .
2434.It Fl CF
2435Specifies that the alternate fast scanner representation
2436(described below under the
2437.Fl F
2438option)
2439should be used.
2440This option cannot be used with
2441.Fl + .
2442.It Fl Cf
2443Specifies that the
2444.Em full
2445scanner tables should be generated \-
2446.Nm
2447should not compress the tables by taking advantage of
2448similar transition functions for different states.
2449.It Fl \&Cm
2450Directs
2451.Nm
2452to construct
2453.Em meta-equivalence classes ,
2454which are sets of equivalence classes
2455(or characters, if equivalence classes are not being used)
2456that are commonly used together.
2457Meta-equivalence classes are often a big win when using compressed tables,
2458but they have a moderate performance impact
2459(one or two
2460.Qq if
2461tests and one array look-up per character scanned).
2462.It Fl Cr
2463Causes the generated scanner to
2464.Em bypass
2465use of the standard I/O library
2466.Pq stdio
2467for input.
2468Instead of calling
2469.Xr fread 3
2470or
2471.Xr getc 3 ,
2472the scanner will use the
2473.Xr read 2
2474system call,
2475resulting in a performance gain which varies from system to system,
2476but in general is probably negligible unless
2477.Fl Cf
2478or
2479.Fl CF
2480are being used.
2481Using
2482.Fl Cr
2483can cause strange behavior if, for example, reading from
2484.Fa yyin
2485using stdio prior to calling the scanner
2486(because the scanner will miss whatever text previous reads left
2487in the stdio input buffer).
2488.Pp
2489.Fl Cr
2490has no effect if
2491.Dv YY_INPUT
2492is defined
2493(see
2494.Sx THE GENERATED SCANNER
2495above).
2496.El
2497.Pp
2498A lone
2499.Fl C
2500specifies that the scanner tables should be compressed but neither
2501equivalence classes nor meta-equivalence classes should be used.
2502.Pp
2503The options
2504.Fl Cf
2505or
2506.Fl CF
2507and
2508.Fl \&Cm
2509do not make sense together \- there is no opportunity for meta-equivalence
2510classes if the table is not being compressed.
2511Otherwise the options may be freely mixed, and are cumulative.
2512.Pp
2513The default setting is
2514.Fl Cem
2515which specifies that
2516.Nm
2517should generate equivalence classes and meta-equivalence classes.
2518This setting provides the highest degree of table compression.
2519It is possible to trade off faster-executing scanners at the cost of
2520larger tables with the following generally being true:
2521.Bd -unfilled -offset indent
2522slowest & smallest
2523 -Cem
2524 -Cm
2525 -Ce
2526 -C
2527 -C{f,F}e
2528 -C{f,F}
2529 -C{f,F}a
2530fastest & largest
2531.Ed
2532.Pp
2533Note that scanners with the smallest tables are usually generated and
2534compiled the quickest,
2535so during development the default is usually best,
2536maximal compression.
2537.Pp
2538.Fl Cfe
2539is often a good compromise between speed and size for production scanners.
2540.It Fl d
2541Makes the generated scanner run in debug mode.
2542Whenever a pattern is recognized and the global
2543.Fa yy_flex_debug
2544is non-zero
2545.Pq which is the default ,
2546the scanner will write to stderr a line of the form:
2547.Pp
2548.D1 --accepting rule at line 53 ("the matched text")
2549.Pp
2550The line number refers to the location of the rule in the file
2551defining the scanner
2552(i.e., the file that was fed to
2553.Nm ) .
2554Messages are also generated when the scanner backs up,
2555accepts the default rule,
2556reaches the end of its input buffer
2557(or encounters a NUL;
2558at this point, the two look the same as far as the scanner's concerned),
2559or reaches an end-of-file.
2560.It Fl F
2561Specifies that the fast scanner table representation should be used
2562.Pq and stdio bypassed .
2563This representation is about as fast as the full table representation
2564.Pq Fl f ,
2565and for some sets of patterns will be considerably smaller
2566.Pq and for others, larger .
2567In general, if the pattern set contains both
2568.Qq keywords
2569and a catch-all,
2570.Qq identifier
2571rule, such as in the set:
2572.Bd -unfilled -offset indent
2573"case" return TOK_CASE;
2574"switch" return TOK_SWITCH;
2575\&...
2576"default" return TOK_DEFAULT;
2577[a-z]+ return TOK_ID;
2578.Ed
2579.Pp
2580then it's better to use the full table representation.
2581If only the
2582.Qq identifier
2583rule is present and a hash table or some such is used to detect the keywords,
2584it's better to use
2585.Fl F .
2586.Pp
2587This option is equivalent to
2588.Fl CFr
2589.Pq see above .
2590It cannot be used with
2591.Fl + .
2592.It Fl f
2593Specifies
2594.Em fast scanner .
2595No table compression is done and stdio is bypassed.
2596The result is large but fast.
2597This option is equivalent to
2598.Fl Cfr
2599.Pq see above .
2600.It Fl h
2601Generates a help summary of
2602.Nm flex Ns 's
2603options to stdout and then exits.
2604.Fl ?\&
2605and
2606.Fl Fl help
2607are synonyms for
2608.Fl h .
2609.It Fl I
2610Instructs
2611.Nm
2612to generate an
2613.Em interactive
2614scanner.
2615An interactive scanner is one that only looks ahead to decide
2616what token has been matched if it absolutely must.
2617It turns out that always looking one extra character ahead,
2618even if the scanner has already seen enough text
2619to disambiguate the current token, is a bit faster than
2620only looking ahead when necessary.
2621But scanners that always look ahead give dreadful interactive performance;
2622for example, when a user types a newline,
2623it is not recognized as a newline token until they enter
2624.Em another
2625token, which often means typing in another whole line.
2626.Pp
2627.Nm
2628scanners default to
2629.Em interactive
2630unless
2631.Fl Cf
2632or
2633.Fl CF
2634table-compression options are specified
2635.Pq see above .
2636That's because if high-performance is most important,
2637one of these options should be used,
2638so if they weren't,
2639.Nm
2640assumes it is preferable to trade off a bit of run-time performance for
2641intuitive interactive behavior.
2642Note also that
2643.Fl I
2644cannot be used in conjunction with
2645.Fl Cf
2646or
2647.Fl CF .
2648Thus, this option is not really needed; it is on by default for all those
2649cases in which it is allowed.
2650.Pp
2651A scanner can be forced to not be interactive by using
2652.Fl B
2653.Pq see above .
2654.It Fl i
2655Instructs
2656.Nm
2657to generate a case-insensitive scanner.
2658The case of letters given in the
2659.Nm
2660input patterns will be ignored,
2661and tokens in the input will be matched regardless of case.
2662The matched text given in
2663.Fa yytext
2664will have the preserved case
2665.Pq i.e., it will not be folded .
2666.It Fl L
2667Instructs
2668.Nm
2669not to generate
2670.Dq #line
2671directives.
2672Without this option,
2673.Nm
2674peppers the generated scanner with #line directives so error messages
2675in the actions will be correctly located with respect to either the original
2676.Nm
2677input file
2678(if the errors are due to code in the input file),
2679or
2680.Pa lex.yy.c
2681(if the errors are
2682.Nm flex Ns 's
2683fault \- these sorts of errors should be reported to the email address
2684given below).
2685.It Fl l
2686Turns on maximum compatibility with the original
2687.At
2688.Nm lex
2689implementation.
2690Note that this does not mean full compatibility.
2691Use of this option costs a considerable amount of performance,
2692and it cannot be used with the
2693.Fl + , f , F , Cf ,
2694or
2695.Fl CF
2696options.
2697For details on the compatibilities it provides, see the section
2698.Sx INCOMPATIBILITIES WITH LEX AND POSIX
2699below.
2700This option also results in the name
2701.Dv YY_FLEX_LEX_COMPAT
2702being #define'd in the generated scanner.
2703.It Fl n
2704Another do-nothing, deprecated option included only for
2705.Tn POSIX
2706compliance.
2707.It Fl o Ns Ar output
2708Directs
2709.Nm
2710to write the scanner to the file
2711.Ar output
2712instead of
2713.Pa lex.yy.c .
2714If
2715.Fl o
2716is combined with the
2717.Fl t
2718option, then the scanner is written to stdout but its
2719.Dq #line
2720directives
2721(see the
2722.Fl L
2723option above)
2724refer to the file
2725.Ar output .
2726.It Fl P Ns Ar prefix
2727Changes the default
2728.Qq yy
2729prefix used by
2730.Nm
2731for all globally visible variable and function names to instead be
2732.Ar prefix .
2733For example,
2734.Fl P Ns Ar foo
2735changes the name of
2736.Fa yytext
2737to
2738.Fa footext .
2739It also changes the name of the default output file from
2740.Pa lex.yy.c
2741to
2742.Pa lex.foo.c .
2743Here are all of the names affected:
2744.Bd -unfilled -offset indent
2745yy_create_buffer
2746yy_delete_buffer
2747yy_flex_debug
2748yy_init_buffer
2749yy_flush_buffer
2750yy_load_buffer_state
2751yy_switch_to_buffer
2752yyin
2753yyleng
2754yylex
2755yylineno
2756yyout
2757yyrestart
2758yytext
2759yywrap
2760.Ed
2761.Pp
2762(If using a C++ scanner, then only
2763.Fa yywrap
2764and
2765.Fa yyFlexLexer
2766are affected.)
2767Within the scanner itself, it is still possible to refer to the global variables
2768and functions using either version of their name; but externally, they
2769have the modified name.
2770.Pp
2771This option allows multiple
2772.Nm
2773programs to be easily linked together into the same executable.
2774Note, though, that using this option also renames
2775.Fn yywrap ,
2776so now either an
2777.Pq appropriately named
2778version of the routine for the scanner must be supplied, or
2779.Dq %option noyywrap
2780must be used, as linking with
2781.Fl lfl
2782no longer provides one by default.
2783.It Fl p
2784Generates a performance report to stderr.
2785The report consists of comments regarding features of the
2786.Nm
2787input file which will cause a serious loss of performance in the resulting
2788scanner.
2789If the flag is specified twice,
2790comments regarding features that lead to minor performance losses
2791will also be reported>
2792.Pp
2793Note that the use of
2794.Em REJECT ,
2795.Dq %option yylineno ,
2796and variable trailing context
2797(see the
2798.Sx BUGS
2799section below)
2800entails a substantial performance penalty; use of
2801.Fn yymore ,
2802the
2803.Sq ^
2804operator, and the
2805.Fl I
2806flag entail minor performance penalties.
2807.It Fl S Ns Ar skeleton
2808Overrides the default skeleton file from which
2809.Nm
2810constructs its scanners.
2811This option is needed only for
2812.Nm
2813maintenance or development.
2814.It Fl s
2815Causes the default rule
2816.Pq that unmatched scanner input is echoed to stdout
2817to be suppressed.
2818If the scanner encounters input that does not
2819match any of its rules, it aborts with an error.
2820This option is useful for finding holes in a scanner's rule set.
2821.It Fl T
2822Makes
2823.Nm
2824run in
2825.Em trace
2826mode.
2827It will generate a lot of messages to stderr concerning
2828the form of the input and the resultant non-deterministic and deterministic
2829finite automata.
2830This option is mostly for use in maintaining
2831.Nm .
2832.It Fl t
2833Instructs
2834.Nm
2835to write the scanner it generates to standard output instead of
2836.Pa lex.yy.c .
2837.It Fl V
2838Prints the version number to stdout and exits.
2839.Fl Fl version
2840is a synonym for
2841.Fl V .
2842.It Fl v
2843Specifies that
2844.Nm
2845should write to stderr
2846a summary of statistics regarding the scanner it generates.
2847Most of the statistics are meaningless to the casual
2848.Nm
2849user, but the first line identifies the version of
2850.Nm
2851(same as reported by
2852.Fl V ) ,
2853and the next line the flags used when generating the scanner,
2854including those that are on by default.
2855.It Fl w
2856Suppresses warning messages.
2857.It Fl +
2858Specifies that
2859.Nm
2860should generate a C++ scanner class.
2861See the section on
2862.Sx GENERATING C++ SCANNERS
2863below for details.
2864.El
2865.Pp
2866.Nm
2867also provides a mechanism for controlling options within the
2868scanner specification itself, rather than from the
2869.Nm
2870command line.
2871This is done by including
2872.Dq %option
2873directives in the first section of the scanner specification.
2874Multiple options can be specified with a single
2875.Dq %option
2876directive, and multiple directives in the first section of the
2877.Nm
2878input file.
2879.Pp
2880Most options are given simply as names, optionally preceded by the word
2881.Qq no
2882.Pq with no intervening whitespace
2883to negate their meaning.
2884A number are equivalent to
2885.Nm
2886flags or their negation:
2887.Bd -unfilled -offset indent
28887bit -7 option
28898bit -8 option
2890align -Ca option
2891backup -b option
2892batch -B option
2893c++ -+ option
2894
2895caseful or
2896case-sensitive opposite of -i (default)
2897
2898case-insensitive or
2899caseless -i option
2900
2901debug -d option
2902default opposite of -s option
2903ecs -Ce option
2904fast -F option
2905full -f option
2906interactive -I option
2907lex-compat -l option
2908meta-ecs -Cm option
2909perf-report -p option
2910read -Cr option
2911stdout -t option
2912verbose -v option
2913warn opposite of -w option
2914 (use "%option nowarn" for -w)
2915
2916array equivalent to "%array"
2917pointer equivalent to "%pointer" (default)
2918.Ed
2919.Pp
2920Some %option's provide features otherwise not available:
2921.Bl -tag -width Ds
2922.It always-interactive
2923Instructs
2924.Nm
2925to generate a scanner which always considers its input
2926.Qq interactive .
2927Normally, on each new input file the scanner calls
2928.Fn isatty
2929in an attempt to determine whether the scanner's input source is interactive
2930and thus should be read a character at a time.
2931When this option is used, however, no such call is made.
2932.It main
2933Directs
2934.Nm
2935to provide a default
2936.Fn main
2937program for the scanner, which simply calls
2938.Fn yylex .
2939This option implies
2940.Dq noyywrap
2941.Pq see below .
2942.It never-interactive
2943Instructs
2944.Nm
2945to generate a scanner which never considers its input
2946.Qq interactive
2947(again, no call made to
2948.Fn isatty ) .
2949This is the opposite of
2950.Dq always-interactive .
2951.It stack
2952Enables the use of start condition stacks
2953(see
2954.Sx START CONDITIONS
2955above).
2956.It stdinit
2957If set (i.e.,
2958.Dq %option stdinit ) ,
2959initializes
2960.Fa yyin
2961and
2962.Fa yyout
2963to stdin and stdout, instead of the default of
2964.Dq nil .
2965Some existing
2966.Nm lex
2967programs depend on this behavior, even though it is not compliant with ANSI C,
2968which does not require stdin and stdout to be compile-time constant.
2969.It yylineno
2970Directs
2971.Nm
2972to generate a scanner that maintains the number of the current line
2973read from its input in the global variable
2974.Fa yylineno .
2975This option is implied by
2976.Dq %option lex-compat .
2977.It yywrap
2978If unset (i.e.,
2979.Dq %option noyywrap ) ,
2980makes the scanner not call
2981.Fn yywrap
2982upon an end-of-file, but simply assume that there are no more files to scan
2983(until the user points
2984.Fa yyin
2985at a new file and calls
2986.Fn yylex
2987again).
2988.El
2989.Pp
2990.Nm
2991scans rule actions to determine whether the
2992.Em REJECT
2993or
2994.Fn yymore
2995features are being used.
2996The
2997.Dq reject
2998and
2999.Dq yymore
3000options are available to override its decision as to whether to use the
3001options, either by setting them (e.g.,
3002.Dq %option reject )
3003to indicate the feature is indeed used,
3004or unsetting them to indicate it actually is not used
3005(e.g.,
3006.Dq %option noyymore ) .
3007.Pp
3008Three options take string-delimited values, offset with
3009.Sq = :
3010.Pp
3011.D1 %option outfile="ABC"
3012.Pp
3013is equivalent to
3014.Fl o Ns Ar ABC ,
3015and
3016.Pp
3017.D1 %option prefix="XYZ"
3018.Pp
3019is equivalent to
3020.Fl P Ns Ar XYZ .
3021Finally,
3022.Pp
3023.D1 %option yyclass="foo"
3024.Pp
3025only applies when generating a C++ scanner
3026.Pf ( Fl +
3027option).
3028It informs
3029.Nm
3030that
3031.Dq foo
3032has been derived as a subclass of yyFlexLexer, so
3033.Nm
3034will place actions in the member function
3035.Dq foo::yylex()
3036instead of
3037.Dq yyFlexLexer::yylex() .
3038It also generates a
3039.Dq yyFlexLexer::yylex()
3040member function that emits a run-time error (by invoking
3041.Dq yyFlexLexer::LexerError() )
3042if called.
3043See
3044.Sx GENERATING C++ SCANNERS ,
3045below, for additional information.
3046.Pp
3047A number of options are available for
3048lint
3049purists who want to suppress the appearance of unneeded routines
3050in the generated scanner.
3051Each of the following, if unset
3052(e.g.,
3053.Dq %option nounput ) ,
3054results in the corresponding routine not appearing in the generated scanner:
3055.Bd -unfilled -offset indent
3056input, unput
3057yy_push_state, yy_pop_state, yy_top_state
3058yy_scan_buffer, yy_scan_bytes, yy_scan_string
3059.Ed
3060.Pp
3061(though
3062.Fn yy_push_state
3063and friends won't appear anyway unless
3064.Dq %option stack
3065is being used).
3066.Sh PERFORMANCE CONSIDERATIONS
3067The main design goal of
3068.Nm
3069is that it generate high-performance scanners.
3070It has been optimized for dealing well with large sets of rules.
3071Aside from the effects on scanner speed of the table compression
3072.Fl C
3073options outlined above,
3074there are a number of options/actions which degrade performance.
3075These are, from most expensive to least:
3076.Bd -unfilled -offset indent
3077REJECT
3078%option yylineno
3079arbitrary trailing context
3080
3081pattern sets that require backing up
3082%array
3083%option interactive
3084%option always-interactive
3085
3086\&'^' beginning-of-line operator
3087yymore()
3088.Ed
3089.Pp
3090with the first three all being quite expensive
3091and the last two being quite cheap.
3092Note also that
3093.Fn unput
3094is implemented as a routine call that potentially does quite a bit of work,
3095while
3096.Fn yyless
3097is a quite-cheap macro; so if just putting back some excess text,
3098use
3099.Fn yyless .
3100.Pp
3101.Em REJECT
3102should be avoided at all costs when performance is important.
3103It is a particularly expensive option.
3104.Pp
3105Getting rid of backing up is messy and often may be an enormous
3106amount of work for a complicated scanner.
3107In principal, one begins by using the
3108.Fl b
3109flag to generate a
3110.Pa lex.backup
3111file.
3112For example, on the input
3113.Bd -literal -offset indent
3114%%
3115foo return TOK_KEYWORD;
3116foobar return TOK_KEYWORD;
3117.Ed
3118.Pp
3119the file looks like:
3120.Bd -literal -offset indent
3121State #6 is non-accepting -
3122 associated rule line numbers:
3123 2 3
3124 out-transitions: [ o ]
3125 jam-transitions: EOF [ \e001-n p-\e177 ]
3126
3127State #8 is non-accepting -
3128 associated rule line numbers:
3129 3
3130 out-transitions: [ a ]
3131 jam-transitions: EOF [ \e001-` b-\e177 ]
3132
3133State #9 is non-accepting -
3134 associated rule line numbers:
3135 3
3136 out-transitions: [ r ]
3137 jam-transitions: EOF [ \e001-q s-\e177 ]
3138
3139Compressed tables always back up.
3140.Ed
3141.Pp
3142The first few lines tell us that there's a scanner state in
3143which it can make a transition on an
3144.Sq o
3145but not on any other character,
3146and that in that state the currently scanned text does not match any rule.
3147The state occurs when trying to match the rules found
3148at lines 2 and 3 in the input file.
3149If the scanner is in that state and then reads something other than an
3150.Sq o ,
3151it will have to back up to find a rule which is matched.
3152With a bit of headscratching one can see that this must be the
3153state it's in when it has seen
3154.Sq fo .
3155When this has happened, if anything other than another
3156.Sq o
3157is seen, the scanner will have to back up to simply match the
3158.Sq f
3159.Pq by the default rule .
3160.Pp
3161The comment regarding State #8 indicates there's a problem when
3162.Qq foob
3163has been scanned.
3164Indeed, on any character other than an
3165.Sq a ,
3166the scanner will have to back up to accept
3167.Qq foo .
3168Similarly, the comment for State #9 concerns when
3169.Qq fooba
3170has been scanned and an
3171.Sq r
3172does not follow.
3173.Pp
3174The final comment reminds us that there's no point going to
3175all the trouble of removing backing up from the rules unless we're using
3176.Fl Cf
3177or
3178.Fl CF ,
3179since there's no performance gain doing so with compressed scanners.
3180.Pp
3181The way to remove the backing up is to add
3182.Qq error
3183rules:
3184.Bd -literal -offset indent
3185%%
3186foo return TOK_KEYWORD;
3187foobar return TOK_KEYWORD;
3188
3189fooba |
3190foob |
3191fo {
3192 /* false alarm, not really a keyword */
3193 return TOK_ID;
3194}
3195.Ed
3196.Pp
3197Eliminating backing up among a list of keywords can also be done using a
3198.Qq catch-all
3199rule:
3200.Bd -literal -offset indent
3201%%
3202foo return TOK_KEYWORD;
3203foobar return TOK_KEYWORD;
3204
3205[a-z]+ return TOK_ID;
3206.Ed
3207.Pp
3208This is usually the best solution when appropriate.
3209.Pp
3210Backing up messages tend to cascade.
3211With a complicated set of rules it's not uncommon to get hundreds of messages.
3212If one can decipher them, though,
3213it often only takes a dozen or so rules to eliminate the backing up
3214(though it's easy to make a mistake and have an error rule accidentally match
3215a valid token; a possible future
3216.Nm
3217feature will be to automatically add rules to eliminate backing up).
3218.Pp
3219It's important to keep in mind that the benefits of eliminating
3220backing up are gained only if
3221.Em every
3222instance of backing up is eliminated.
3223Leaving just one gains nothing.
3224.Pp
3225.Em Variable
3226trailing context
3227(where both the leading and trailing parts do not have a fixed length)
3228entails almost the same performance loss as
3229.Em REJECT
3230.Pq i.e., substantial .
3231So when possible a rule like:
3232.Bd -literal -offset indent
3233%%
3234mouse|rat/(cat|dog) run();
3235.Ed
3236.Pp
3237is better written:
3238.Bd -literal -offset indent
3239%%
3240mouse/cat|dog run();
3241rat/cat|dog run();
3242.Ed
3243.Pp
3244or as
3245.Bd -literal -offset indent
3246%%
3247mouse|rat/cat run();
3248mouse|rat/dog run();
3249.Ed
3250.Pp
3251Note that here the special
3252.Sq |\&
3253action does not provide any savings, and can even make things worse (see
3254.Sx BUGS
3255below).
3256.Pp
3257Another area where the user can increase a scanner's performance
3258.Pq and one that's easier to implement
3259arises from the fact that the longer the tokens matched,
3260the faster the scanner will run.
3261This is because with long tokens the processing of most input
3262characters takes place in the
3263.Pq short
3264inner scanning loop, and does not often have to go through the additional work
3265of setting up the scanning environment (e.g.,
3266.Fa yytext )
3267for the action.
3268Recall the scanner for C comments:
3269.Bd -literal -offset indent
3270%x comment
3271%%
3272int line_num = 1;
3273
3274"/*" BEGIN(comment);
3275
3276<comment>[^*\en]*
3277<comment>"*"+[^*/\en]*
3278<comment>\en ++line_num;
3279<comment>"*"+"/" BEGIN(INITIAL);
3280.Ed
3281.Pp
3282This could be sped up by writing it as:
3283.Bd -literal -offset indent
3284%x comment
3285%%
3286int line_num = 1;
3287
3288"/*" BEGIN(comment);
3289
3290<comment>[^*\en]*
3291<comment>[^*\en]*\en ++line_num;
3292<comment>"*"+[^*/\en]*
3293<comment>"*"+[^*/\en]*\en ++line_num;
3294<comment>"*"+"/" BEGIN(INITIAL);
3295.Ed
3296.Pp
3297Now instead of each newline requiring the processing of another action,
3298recognizing the newlines is
3299.Qq distributed
3300over the other rules to keep the matched text as long as possible.
3301Note that adding rules does
3302.Em not
3303slow down the scanner!
3304The speed of the scanner is independent of the number of rules or
3305(modulo the considerations given at the beginning of this section)
3306how complicated the rules are with regard to operators such as
3307.Sq *
3308and
3309.Sq |\& .
3310.Pp
3311A final example in speeding up a scanner:
3312scan through a file containing identifiers and keywords, one per line
3313and with no other extraneous characters, and recognize all the keywords.
3314A natural first approach is:
3315.Bd -literal -offset indent
3316%%
3317asm |
3318auto |
3319break |
3320\&... etc ...
3321volatile |
3322while /* it's a keyword */
3323
3324\&.|\en /* it's not a keyword */
3325.Ed
3326.Pp
3327To eliminate the back-tracking, introduce a catch-all rule:
3328.Bd -literal -offset indent
3329%%
3330asm |
3331auto |
3332break |
3333\&... etc ...
3334volatile |
3335while /* it's a keyword */
3336
3337[a-z]+ |
3338\&.|\en /* it's not a keyword */
3339.Ed
3340.Pp
3341Now, if it's guaranteed that there's exactly one word per line,
3342then we can reduce the total number of matches by a half by
3343merging in the recognition of newlines with that of the other tokens:
3344.Bd -literal -offset indent
3345%%
3346asm\en |
3347auto\en |
3348break\en |
3349\&... etc ...
3350volatile\en |
3351while\en /* it's a keyword */
3352
3353[a-z]+\en |
3354\&.|\en /* it's not a keyword */
3355.Ed
3356.Pp
3357One has to be careful here,
3358as we have now reintroduced backing up into the scanner.
3359In particular, while we know that there will never be any characters
3360in the input stream other than letters or newlines,
3361.Nm
3362can't figure this out, and it will plan for possibly needing to back up
3363when it has scanned a token like
3364.Qq auto
3365and then the next character is something other than a newline or a letter.
3366Previously it would then just match the
3367.Qq auto
3368rule and be done, but now it has no
3369.Qq auto
3370rule, only an
3371.Qq auto\en
3372rule.
3373To eliminate the possibility of backing up,
3374we could either duplicate all rules but without final newlines or,
3375since we never expect to encounter such an input and therefore don't
3376how it's classified, we can introduce one more catch-all rule,
3377this one which doesn't include a newline:
3378.Bd -literal -offset indent
3379%%
3380asm\en |
3381auto\en |
3382break\en |
3383\&... etc ...
3384volatile\en |
3385while\en /* it's a keyword */
3386
3387[a-z]+\en |
3388[a-z]+ |
3389\&.|\en /* it's not a keyword */
3390.Ed
3391.Pp
3392Compiled with
3393.Fl Cf ,
3394this is about as fast as one can get a
3395.Nm
3396scanner to go for this particular problem.
3397.Pp
3398A final note:
3399.Nm
3400is slow when matching NUL's,
3401particularly when a token contains multiple NUL's.
3402It's best to write rules which match short
3403amounts of text if it's anticipated that the text will often include NUL's.
3404.Pp
3405Another final note regarding performance: as mentioned above in the section
3406.Sx HOW THE INPUT IS MATCHED ,
3407dynamically resizing
3408.Fa yytext
3409to accommodate huge tokens is a slow process because it presently requires that
3410the
3411.Pq huge
3412token be rescanned from the beginning.
3413Thus if performance is vital, it is better to attempt to match
3414.Qq large
3415quantities of text but not
3416.Qq huge
3417quantities, where the cutoff between the two is at about 8K characters/token.
3418.Sh GENERATING C++ SCANNERS
3419.Nm
3420provides two different ways to generate scanners for use with C++.
3421The first way is to simply compile a scanner generated by
3422.Nm
3423using a C++ compiler instead of a C compiler.
3424This should not generate any compilation errors
3425(please report any found to the email address given in the
3426.Sx AUTHORS
3427section below).
3428C++ code can then be used in rule actions instead of C code.
3429Note that the default input source for scanners remains
3430.Fa yyin ,
3431and default echoing is still done to
3432.Fa yyout .
3433Both of these remain
3434.Fa FILE *
3435variables and not C++ streams.
3436.Pp
3437.Nm
3438can also be used to generate a C++ scanner class, using the
3439.Fl +
3440option (or, equivalently,
3441.Dq %option c++ ) ,
3442which is automatically specified if the name of the flex executable ends in a
3443.Sq + ,
3444such as
3445.Nm flex++ .
3446When using this option,
3447.Nm
3448defaults to generating the scanner to the file
3449.Pa lex.yy.cc
3450instead of
3451.Pa lex.yy.c .
3452The generated scanner includes the header file
3453.In g++/FlexLexer.h ,
3454which defines the interface to two C++ classes.
3455.Pp
3456The first class,
3457.Em FlexLexer ,
3458provides an abstract base class defining the general scanner class interface.
3459It provides the following member functions:
3460.Bl -tag -width Ds
3461.It const char* YYText()
3462Returns the text of the most recently matched token, the equivalent of
3463.Fa yytext .
3464.It int YYLeng()
3465Returns the length of the most recently matched token, the equivalent of
3466.Fa yyleng .
3467.It int lineno() const
3468Returns the current input line number
3469(see
3470.Dq %option yylineno ) ,
3471or 1 if
3472.Dq %option yylineno
3473was not used.
3474.It void set_debug(int flag)
3475Sets the debugging flag for the scanner, equivalent to assigning to
3476.Fa yy_flex_debug
3477(see the
3478.Sx OPTIONS
3479section above).
3480Note that the scanner must be built using
3481.Dq %option debug
3482to include debugging information in it.
3483.It int debug() const
3484Returns the current setting of the debugging flag.
3485.El
3486.Pp
3487Also provided are member functions equivalent to
3488.Fn yy_switch_to_buffer ,
3489.Fn yy_create_buffer
3490(though the first argument is an
3491.Fa std::istream*
3492object pointer and not a
3493.Fa FILE* ) ,
3494.Fn yy_flush_buffer ,
3495.Fn yy_delete_buffer ,
3496and
3497.Fn yyrestart
3498(again, the first argument is an
3499.Fa std::istream*
3500object pointer).
3501.Pp
3502The second class defined in
3503.In g++/FlexLexer.h
3504is
3505.Fa yyFlexLexer ,
3506which is derived from
3507.Fa FlexLexer .
3508It defines the following additional member functions:
3509.Bl -tag -width Ds
3510.It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
3511Constructs a
3512.Fa yyFlexLexer
3513object using the given streams for input and output.
3514If not specified, the streams default to
3515.Fa cin
3516and
3517.Fa cout ,
3518respectively.
3519.It virtual int yylex()
3520Performs the same role as
3521.Fn yylex
3522does for ordinary flex scanners: it scans the input stream, consuming
3523tokens, until a rule's action returns a value.
3524If subclass
3525.Sq S
3526is derived from
3527.Fa yyFlexLexer ,
3528in order to access the member functions and variables of
3529.Sq S
3530inside
3531.Fn yylex ,
3532use
3533.Dq %option yyclass="S"
3534to inform
3535.Nm
3536that the
3537.Sq S
3538subclass will be used instead of
3539.Fa yyFlexLexer .
3540In this case, rather than generating
3541.Dq yyFlexLexer::yylex() ,
3542.Nm
3543generates
3544.Dq S::yylex()
3545(and also generates a dummy
3546.Dq yyFlexLexer::yylex()
3547that calls
3548.Dq yyFlexLexer::LexerError()
3549if called).
3550.It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
3551Reassigns
3552.Fa yyin
3553to
3554.Fa new_in
3555.Pq if non-nil
3556and
3557.Fa yyout
3558to
3559.Fa new_out
3560.Pq ditto ,
3561deleting the previous input buffer if
3562.Fa yyin
3563is reassigned.
3564.It int yylex(std::istream* new_in, std::ostream* new_out = 0)
3565First switches the input streams via
3566.Dq switch_streams(new_in, new_out)
3567and then returns the value of
3568.Fn yylex .
3569.El
3570.Pp
3571In addition,
3572.Fa yyFlexLexer
3573defines the following protected virtual functions which can be redefined
3574in derived classes to tailor the scanner:
3575.Bl -tag -width Ds
3576.It virtual int LexerInput(char* buf, int max_size)
3577Reads up to
3578.Fa max_size
3579characters into
3580.Fa buf
3581and returns the number of characters read.
3582To indicate end-of-input, return 0 characters.
3583Note that
3584.Qq interactive
3585scanners (see the
3586.Fl B
3587and
3588.Fl I
3589flags) define the macro
3590.Dv YY_INTERACTIVE .
3591If
3592.Fn LexerInput
3593has been redefined, and it's necessary to take different actions depending on
3594whether or not the scanner might be scanning an interactive input source,
3595it's possible to test for the presence of this name via
3596.Dq #ifdef .
3597.It virtual void LexerOutput(const char* buf, int size)
3598Writes out
3599.Fa size
3600characters from the buffer
3601.Fa buf ,
3602which, while NUL-terminated, may also contain
3603.Qq internal
3604NUL's if the scanner's rules can match text with NUL's in them.
3605.It virtual void LexerError(const char* msg)
3606Reports a fatal error message.
3607The default version of this function writes the message to the stream
3608.Fa cerr
3609and exits.
3610.El
3611.Pp
3612Note that a
3613.Fa yyFlexLexer
3614object contains its entire scanning state.
3615Thus such objects can be used to create reentrant scanners.
3616Multiple instances of the same
3617.Fa yyFlexLexer
3618class can be instantiated, and multiple C++ scanner classes can be combined
3619in the same program using the
3620.Fl P
3621option discussed above.
3622.Pp
3623Finally, note that the
3624.Dq %array
3625feature is not available to C++ scanner classes;
3626.Dq %pointer
3627must be used
3628.Pq the default .
3629.Pp
3630Here is an example of a simple C++ scanner:
3631.Bd -literal -offset indent
3632// An example of using the flex C++ scanner class.
3633
3634%{
3635#include <errno.h>
3636int mylineno = 0;
3637%}
3638
3639string \e"[^\en"]+\e"
3640
3641ws [ \et]+
3642
3643alpha [A-Za-z]
3644dig [0-9]
3645name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
3646num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
3647num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
3648number {num1}|{num2}
3649
3650%%
3651
3652{ws} /* skip blanks and tabs */
3653
3654"/*" {
3655 int c;
3656
3657 while ((c = yyinput()) != 0) {
3658 if(c == '\en')
3659 ++mylineno;
3660 else if(c == '*') {
3661 if ((c = yyinput()) == '/')
3662 break;
3663 else
3664 unput(c);
3665 }
3666 }
3667}
3668
3669{number} cout << "number " << YYText() << '\en';
3670
3671\en mylineno++;
3672
3673{name} cout << "name " << YYText() << '\en';
3674
3675{string} cout << "string " << YYText() << '\en';
3676
3677%%
3678
3679int main(int /* argc */, char** /* argv */)
3680{
3681 FlexLexer* lexer = new yyFlexLexer;
3682 while(lexer->yylex() != 0)
3683 ;
3684 return 0;
3685}
3686.Ed
3687.Pp
3688To create multiple
3689.Pq different
3690lexer classes, use the
3691.Fl P
3692flag
3693(or the
3694.Dq prefix=
3695option)
3696to rename each
3697.Fa yyFlexLexer
3698to some other
3699.Fa xxFlexLexer .
3700.In g++/FlexLexer.h
3701can then be included in other sources once per lexer class, first renaming
3702.Fa yyFlexLexer
3703as follows:
3704.Bd -literal -offset indent
3705#undef yyFlexLexer
3706#define yyFlexLexer xxFlexLexer
3707#include <g++/FlexLexer.h>
3708
3709#undef yyFlexLexer
3710#define yyFlexLexer zzFlexLexer
3711#include <g++/FlexLexer.h>
3712.Ed
3713.Pp
3714If, for example,
3715.Dq %option prefix="xx"
3716is used for one scanner and
3717.Dq %option prefix="zz"
3718is used for the other.
3719.Pp
3720.Sy IMPORTANT :
3721the present form of the scanning class is experimental
3722and may change considerably between major releases.
3723.Sh INCOMPATIBILITIES WITH LEX AND POSIX
3724.Nm
3725is a rewrite of the
3726.At
3727.Nm lex
3728tool
3729(the two implementations do not share any code, though),
3730with some extensions and incompatibilities, both of which are of concern
3731to those who wish to write scanners acceptable to either implementation.
3732.Nm
3733is fully compliant with the
3734.Tn POSIX
3735.Nm lex
3736specification, except that when using
3737.Dq %pointer
3738.Pq the default ,
3739a call to
3740.Fn unput
3741destroys the contents of
3742.Fa yytext ,
3743which is counter to the
3744.Tn POSIX
3745specification.
3746.Pp
3747In this section we discuss all of the known areas of incompatibility between
3748.Nm ,
3749.At
3750.Nm lex ,
3751and the
3752.Tn POSIX
3753specification.
3754.Pp
3755.Nm flex Ns 's
3756.Fl l
3757option turns on maximum compatibility with the original
3758.At
3759.Nm lex
3760implementation, at the cost of a major loss in the generated scanner's
3761performance.
3762We note below which incompatibilities can be overcome using the
3763.Fl l
3764option.
3765.Pp
3766.Nm
3767is fully compatible with
3768.Nm lex
3769with the following exceptions:
3770.Bl -dash
3771.It
3772The undocumented
3773.Nm lex
3774scanner internal variable
3775.Fa yylineno
3776is not supported unless
3777.Fl l
3778or
3779.Dq %option yylineno
3780is used.
3781.Pp
3782.Fa yylineno
3783should be maintained on a per-buffer basis, rather than a per-scanner
3784.Pq single global variable
3785basis.
3786.Pp
3787.Fa yylineno
3788is not part of the
3789.Tn POSIX
3790specification.
3791.It
3792The
3793.Fn input
3794routine is not redefinable, though it may be called to read characters
3795following whatever has been matched by a rule.
3796If
3797.Fn input
3798encounters an end-of-file, the normal
3799.Fn yywrap
3800processing is done.
3801A
3802.Dq real
3803end-of-file is returned by
3804.Fn input
3805as
3806.Dv EOF .
3807.Pp
3808Input is instead controlled by defining the
3809.Dv YY_INPUT
3810macro.
3811.Pp
3812The
3813.Nm
3814restriction that
3815.Fn input
3816cannot be redefined is in accordance with the
3817.Tn POSIX
3818specification, which simply does not specify any way of controlling the
3819scanner's input other than by making an initial assignment to
3820.Fa yyin .
3821.It
3822The
3823.Fn unput
3824routine is not redefinable.
3825This restriction is in accordance with
3826.Tn POSIX .
3827.It
3828.Nm
3829scanners are not as reentrant as
3830.Nm lex
3831scanners.
3832In particular, if a scanner is interactive and
3833an interrupt handler long-jumps out of the scanner,
3834and the scanner is subsequently called again,
3835the following error message may be displayed:
3836.Pp
3837.D1 fatal flex scanner internal error--end of buffer missed
3838.Pp
3839To reenter the scanner, first use
3840.Pp
3841.Dl yyrestart(yyin);
3842.Pp
3843Note that this call will throw away any buffered input;
3844usually this isn't a problem with an interactive scanner.
3845.Pp
3846Also note that flex C++ scanner classes are reentrant,
3847so if using C++ is an option , they should be used instead.
3848See
3849.Sx GENERATING C++ SCANNERS
3850above for details.
3851.It
3852.Fn output
3853is not supported.
3854Output from the
3855.Em ECHO
3856macro is done to the file-pointer
3857.Fa yyout
3858.Pq default stdout .
3859.Pp
3860.Fn output
3861is not part of the
3862.Tn POSIX
3863specification.
3864.It
3865.Nm lex
3866does not support exclusive start conditions
3867.Pq %x ,
3868though they are in the
3869.Tn POSIX
3870specification.
3871.It
3872When definitions are expanded,
3873.Nm
3874encloses them in parentheses.
3875With
3876.Nm lex ,
3877the following:
3878.Bd -literal -offset indent
3879NAME [A-Z][A-Z0-9]*
3880%%
3881foo{NAME}? printf("Found it\en");
3882%%
3883.Ed
3884.Pp
3885will not match the string
3886.Qq foo
3887because when the macro is expanded the rule is equivalent to
3888.Qq foo[A-Z][A-Z0-9]*?
3889and the precedence is such that the
3890.Sq ?\&
3891is associated with
3892.Qq [A-Z0-9]* .
3893With
3894.Nm ,
3895the rule will be expanded to
3896.Qq foo([A-Z][A-Z0-9]*)?
3897and so the string
3898.Qq foo
3899will match.
3900.Pp
3901Note that if the definition begins with
3902.Sq ^
3903or ends with
3904.Sq $
3905then it is not expanded with parentheses, to allow these operators to appear in
3906definitions without losing their special meanings.
3907But the
3908.Sq <s> ,
3909.Sq / ,
3910and
3911.Sq <<EOF>>
3912operators cannot be used in a
3913.Nm
3914definition.
3915.Pp
3916Using
3917.Fl l
3918results in the
3919.Nm lex
3920behavior of no parentheses around the definition.
3921.Pp
3922The
3923.Tn POSIX
3924specification is that the definition be enclosed in parentheses.
3925.It
3926Some implementations of
3927.Nm lex
3928allow a rule's action to begin on a separate line,
3929if the rule's pattern has trailing whitespace:
3930.Bd -literal -offset indent
3931%%
3932foo|bar<space here>
3933 { foobar_action(); }
3934.Ed
3935.Pp
3936.Nm
3937does not support this feature.
3938.It
3939The
3940.Nm lex
3941.Sq %r
3942.Pq generate a Ratfor scanner
3943option is not supported.
3944It is not part of the
3945.Tn POSIX
3946specification.
3947.It
3948After a call to
3949.Fn unput ,
3950.Fa yytext
3951is undefined until the next token is matched,
3952unless the scanner was built using
3953.Dq %array .
3954This is not the case with
3955.Nm lex
3956or the
3957.Tn POSIX
3958specification.
3959The
3960.Fl l
3961option does away with this incompatibility.
3962.It
3963The precedence of the
3964.Sq {}
3965.Pq numeric range
3966operator is different.
3967.Nm lex
3968interprets
3969.Qq abc{1,3}
3970as match one, two, or three occurrences of
3971.Sq abc ,
3972whereas
3973.Nm
3974interprets it as match
3975.Sq ab
3976followed by one, two, or three occurrences of
3977.Sq c .
3978The latter is in agreement with the
3979.Tn POSIX
3980specification.
3981.It
3982The precedence of the
3983.Sq ^
3984operator is different.
3985.Nm lex
3986interprets
3987.Qq ^foo|bar
3988as match either
3989.Sq foo
3990at the beginning of a line, or
3991.Sq bar
3992anywhere, whereas
3993.Nm
3994interprets it as match either
3995.Sq foo
3996or
3997.Sq bar
3998if they come at the beginning of a line.
3999The latter is in agreement with the
4000.Tn POSIX
4001specification.
4002.It
4003The special table-size declarations such as
4004.Sq %a
4005supported by
4006.Nm lex
4007are not required by
4008.Nm
4009scanners;
4010.Nm
4011ignores them.
4012.It
4013The name
4014.Dv FLEX_SCANNER
4015is #define'd so scanners may be written for use with either
4016.Nm
4017or
4018.Nm lex .
4019Scanners also include
4020.Dv YY_FLEX_MAJOR_VERSION
4021and
4022.Dv YY_FLEX_MINOR_VERSION
4023indicating which version of
4024.Nm
4025generated the scanner
4026(for example, for the 2.5 release, these defines would be 2 and 5,
4027respectively).
4028.El
4029.Pp
4030The following
4031.Nm
4032features are not included in
4033.Nm lex
4034or the
4035.Tn POSIX
4036specification:
4037.Bd -unfilled -offset indent
4038C++ scanners
4039%option
4040start condition scopes
4041start condition stacks
4042interactive/non-interactive scanners
4043yy_scan_string() and friends
4044yyterminate()
4045yy_set_interactive()
4046yy_set_bol()
4047YY_AT_BOL()
4048<<EOF>>
4049<*>
4050YY_DECL
4051YY_START
4052YY_USER_ACTION
4053YY_USER_INIT
4054#line directives
4055%{}'s around actions
4056multiple actions on a line
4057.Ed
4058.Pp
4059plus almost all of the
4060.Nm
4061flags.
4062The last feature in the list refers to the fact that with
4063.Nm
4064multiple actions can be placed on the same line,
4065separated with semi-colons, while with
4066.Nm lex ,
4067the following
4068.Pp
4069.Dl foo handle_foo(); ++num_foos_seen;
4070.Pp
4071is
4072.Pq rather surprisingly
4073truncated to
4074.Pp
4075.Dl foo handle_foo();
4076.Pp
4077.Nm
4078does not truncate the action.
4079Actions that are not enclosed in braces
4080are simply terminated at the end of the line.
4081.Sh FILES
4082.Bl -tag -width "<g++/FlexLexer.h>"
4083.It Pa flex.skl
4084Skeleton scanner.
4085This file is only used when building flex, not when
4086.Nm
4087executes.
4088.It Pa lex.backup
4089Backing-up information for the
4090.Fl b
4091flag (called
4092.Pa lex.bck
4093on some systems).
4094.It Pa lex.yy.c
4095Generated scanner
4096(called
4097.Pa lexyy.c
4098on some systems).
4099.It Pa lex.yy.cc
4100Generated C++ scanner class, when using
4101.Fl + .
4102.It In g++/FlexLexer.h
4103Header file defining the C++ scanner base class,
4104.Fa FlexLexer ,
4105and its derived class,
4106.Fa yyFlexLexer .
4107.It Pa /usr/lib/libl.*
4108.Nm
4109libraries.
4110The
4111.Pa /usr/lib/libfl.*\&
4112libraries are links to these.
4113Scanners must be linked using either
4114.Fl \&ll
4115or
4116.Fl lfl .
4117.El
4118.Sh EXIT STATUS
4119.Ex -std flex
4120.Sh DIAGNOSTICS
4121.Bl -diag
4122.It warning, rule cannot be matched
4123Indicates that the given rule cannot be matched because it follows other rules
4124that will always match the same text as it.
4125For example, in the following
4126.Dq foo
4127cannot be matched because it comes after an identifier
4128.Qq catch-all
4129rule:
4130.Bd -literal -offset indent
4131[a-z]+ got_identifier();
4132foo got_foo();
4133.Ed
4134.Pp
4135Using
4136.Em REJECT
4137in a scanner suppresses this warning.
4138.It "warning, \-s option given but default rule can be matched"
4139Means that it is possible
4140.Pq perhaps only in a particular start condition
4141that the default rule
4142.Pq match any single character
4143is the only one that will match a particular input.
4144Since
4145.Fl s
4146was given, presumably this is not intended.
4147.It reject_used_but_not_detected undefined
4148.It yymore_used_but_not_detected undefined
4149These errors can occur at compile time.
4150They indicate that the scanner uses
4151.Em REJECT
4152or
4153.Fn yymore
4154but that
4155.Nm
4156failed to notice the fact, meaning that
4157.Nm
4158scanned the first two sections looking for occurrences of these actions
4159and failed to find any, but somehow they snuck in
4160.Pq via an #include file, for example .
4161Use
4162.Dq %option reject
4163or
4164.Dq %option yymore
4165to indicate to
4166.Nm
4167that these features are really needed.
4168.It flex scanner jammed
4169A scanner compiled with
4170.Fl s
4171has encountered an input string which wasn't matched by any of its rules.
4172This error can also occur due to internal problems.
4173.It token too large, exceeds YYLMAX
4174The scanner uses
4175.Dq %array
4176and one of its rules matched a string longer than the
4177.Dv YYLMAX
4178constant
4179.Pq 8K bytes by default .
4180The value can be increased by #define'ing
4181.Dv YYLMAX
4182in the definitions section of
4183.Nm
4184input.
4185.It "scanner requires \-8 flag to use the character 'x'"
4186The scanner specification includes recognizing the 8-bit character
4187.Sq x
4188and the
4189.Fl 8
4190flag was not specified, and defaulted to 7-bit because the
4191.Fl Cf
4192or
4193.Fl CF
4194table compression options were used.
4195See the discussion of the
4196.Fl 7
4197flag for details.
4198.It flex scanner push-back overflow
4199unput() was used to push back so much text that the scanner's buffer
4200could not hold both the pushed-back text and the current token in
4201.Fa yytext .
4202Ideally the scanner should dynamically resize the buffer in this case,
4203but at present it does not.
4204.It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
4205The scanner was working on matching an extremely large token and needed
4206to expand the input buffer.
4207This doesn't work with scanners that use
4208.Em REJECT .
4209.It "fatal flex scanner internal error--end of buffer missed"
4210This can occur in a scanner which is reentered after a long-jump
4211has jumped out
4212.Pq or over
4213the scanner's activation frame.
4214Before reentering the scanner, use:
4215.Pp
4216.Dl yyrestart(yyin);
4217.Pp
4218or, as noted above, switch to using the C++ scanner class.
4219.It "too many start conditions in <> construct!"
4220More start conditions than exist were listed in a <> construct
4221(so at least one of them must have been listed twice).
4222.El
4223.Sh SEE ALSO
4224.Xr awk 1 ,
4225.Xr sed 1 ,
4226.Xr yacc 1
4227.Rs
4228.\" 4.4BSD PSD:16
4229.%A M. E. Lesk
4230.%T Lex \(em Lexical Analyzer Generator
4231.%I AT&T Bell Laboratories
4232.%R Computing Science Technical Report
4233.%N 39
4234.%D October 1975
4235.Re
4236.Rs
4237.%A John Levine
4238.%A Tony Mason
4239.%A Doug Brown
4240.%B Lex & Yacc
4241.%I O'Reilly and Associates
4242.%N 2nd edition
4243.Re
4244.Rs
4245.%A Alfred Aho
4246.%A Ravi Sethi
4247.%A Jeffrey Ullman
4248.%B Compilers: Principles, Techniques and Tools
4249.%I Addison-Wesley
4250.%D 1986
4251.%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
4252.Re
4253.Sh STANDARDS
4254The
4255.Nm lex
4256utility is compliant with the
4257.St -p1003.1-2008
4258specification,
4259though its presence is optional.
4260.Pp
4261The flags
4262.Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
4263.Op Fl -help ,
4264and
4265.Op Fl -version
4266are extensions to that specification.
4267.Pp
4268See also the
4269.Sx INCOMPATIBILITIES WITH LEX AND POSIX
4270section, above.
4271.Sh AUTHORS
4272Vern Paxson, with the help of many ideas and much inspiration from
4273Van Jacobson.
4274Original version by Jef Poskanzer.
4275The fast table representation is a partial implementation of a design done by
4276Van Jacobson.
4277The implementation was done by Kevin Gong and Vern Paxson.
4278.Pp
4279Thanks to the many
4280.Nm
4281beta-testers, feedbackers, and contributors, especially Francois Pinard,
4282Casey Leedom,
4283Robert Abramovitz,
4284Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
4285Neal Becker, Nelson H.F. Beebe,
4286.Mt benson@odi.com ,
4287Karl Berry, Peter A. Bigot, Simon Blanchard,
4288Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
4289Brian Clapper, J.T. Conklin,
4290Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
4291Daniels, Chris G. Demetriou, Theo de Raadt,
4292Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
4293Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
4294Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
4295Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
4296Jan Hajic, Charles Hemphill, NORO Hideo,
4297Jarkko Hietaniemi, Scott Hofmann,
4298Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
4299Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
4300Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
4301Amir Katz,
4302.Mt ken@ken.hilco.com ,
4303Kevin B. Kenny,
4304Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
4305Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
4306David Loffredo, Mike Long,
4307Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
4308Bengt Martensson, Chris Metcalf,
4309Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
4310G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
4311Richard Ohnemus, Karsten Pahnke,
4312Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
4313Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
4314Frederic Raimbault, Pat Rankin, Rick Richardson,
4315Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
4316Andreas Scherer, Darrell Schiebel, Raf Schietekat,
4317Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
4318Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
4319Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
4320Chris Thewalt, Richard M. Timoney, Jodi Tsai,
4321Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
4322Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
4323and those whose names have slipped my marginal mail-archiving skills
4324but whose contributions are appreciated all the
4325same.
4326.Pp
4327Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
4328John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
4329Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
4330distribution headaches.
4331.Pp
4332Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
4333to Benson Margulies and Fred Burke for C++ support;
4334to Kent Williams and Tom Epperly for C++ class support;
4335to Ove Ewerlid for support of NUL's;
4336and to Eric Hughes for support of multiple buffers.
4337.Pp
4338This work was primarily done when I was with the Real Time Systems Group
4339at the Lawrence Berkeley Laboratory in Berkeley, CA.
4340Many thanks to all there for the support I received.
4341.Pp
4342Send comments to
4343.Aq Mt vern@ee.lbl.gov .
4344.Sh BUGS
4345Some trailing context patterns cannot be properly matched and generate
4346warning messages
4347.Pq "dangerous trailing context" .
4348These are patterns where the ending of the first part of the rule
4349matches the beginning of the second part, such as
4350.Qq zx*/xy* ,
4351where the
4352.Sq x*
4353matches the
4354.Sq x
4355at the beginning of the trailing context.
4356(Note that the POSIX draft states that the text matched by such patterns
4357is undefined.)
4358.Pp
4359For some trailing context rules, parts which are actually fixed-length are
4360not recognized as such, leading to the above mentioned performance loss.
4361In particular, parts using
4362.Sq |\&
4363or
4364.Sq {n}
4365(such as
4366.Qq foo{3} )
4367are always considered variable-length.
4368.Pp
4369Combining trailing context with the special
4370.Sq |\&
4371action can result in fixed trailing context being turned into
4372the more expensive variable trailing context.
4373For example, in the following:
4374.Bd -literal -offset indent
4375%%
4376abc |
4377xyz/def
4378.Ed
4379.Pp
4380Use of
4381.Fn unput
4382invalidates yytext and yyleng, unless the
4383.Dq %array
4384directive
4385or the
4386.Fl l
4387option has been used.
4388.Pp
4389Pattern-matching of NUL's is substantially slower than matching other
4390characters.
4391.Pp
4392Dynamic resizing of the input buffer is slow, as it entails rescanning
4393all the text matched so far by the current
4394.Pq generally huge
4395token.
4396.Pp
4397Due to both buffering of input and read-ahead,
4398it is not possible to intermix calls to
4399.In stdio.h
4400routines, such as, for example,
4401.Fn getchar ,
4402with
4403.Nm
4404rules and expect it to work.
4405Call
4406.Fn input
4407instead.
4408.Pp
4409The total table entries listed by the
4410.Fl v
4411flag excludes the number of table entries needed to determine
4412what rule has been matched.
4413The number of entries is equal to the number of DFA states
4414if the scanner does not use
4415.Em REJECT ,
4416and somewhat greater than the number of states if it does.
4417.Pp
4418.Em REJECT
4419cannot be used with the
4420.Fl f
4421or
4422.Fl F
4423options.
4424.Pp
4425The
4426.Nm
4427internal algorithms need documentation.