jcs's openbsd hax
openbsd
at jcs 4427 lines 105 kB view raw
1.\" $OpenBSD: flex.1,v 1.47 2025/05/22 07:31:18 bentley Exp $ 2.\" 3.\" Copyright (c) 1990 The Regents of the University of California. 4.\" All rights reserved. 5.\" 6.\" This code is derived from software contributed to Berkeley by 7.\" Vern Paxson. 8.\" 9.\" The United States Government has rights in this work pursuant 10.\" to contract no. DE-AC03-76SF00098 between the United States 11.\" Department of Energy and the University of California. 12.\" 13.\" Redistribution and use in source and binary forms, with or without 14.\" modification, are permitted provided that the following conditions 15.\" are met: 16.\" 17.\" 1. Redistributions of source code must retain the above copyright 18.\" notice, this list of conditions and the following disclaimer. 19.\" 2. Redistributions in binary form must reproduce the above copyright 20.\" notice, this list of conditions and the following disclaimer in the 21.\" documentation and/or other materials provided with the distribution. 22.\" 23.\" Neither the name of the University nor the names of its contributors 24.\" may be used to endorse or promote products derived from this software 25.\" without specific prior written permission. 26.\" 27.\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 28.\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 29.\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 30.\" PURPOSE. 31.\" 32.Dd $Mdocdate: May 22 2025 $ 33.Dt FLEX 1 34.Os 35.Sh NAME 36.Nm flex , 37.Nm flex++ , 38.Nm lex 39.Nd fast lexical analyzer generator 40.Sh SYNOPSIS 41.Nm 42.Bk -words 43.Op Fl 78BbdFfhIiLlnpsTtVvw+? 44.Op Fl C Ns Op Cm aeFfmr 45.Op Fl Fl help 46.Op Fl Fl version 47.Op Fl o Ns Ar output 48.Op Fl P Ns Ar prefix 49.Op Fl S Ns Ar skeleton 50.Op Ar 51.Ek 52.Sh DESCRIPTION 53.Nm 54is a tool for generating 55.Em scanners : 56programs which recognize lexical patterns in text. 57.Nm 58reads the given input files, or its standard input if no file names are given, 59for a description of a scanner to generate. 60The description is in the form of pairs of regular expressions and C code, 61called 62.Em rules . 63.Nm 64generates as output a C source file, 65.Pa lex.yy.c , 66which defines a routine 67.Fn yylex . 68This file is compiled and linked with the 69.Fl lfl 70library to produce an executable. 71When the executable is run, it analyzes its input for occurrences 72of the regular expressions. 73Whenever it finds one, it executes the corresponding C code. 74.Pp 75.Nm lex 76is a synonym for 77.Nm flex . 78.Nm flex++ 79is a synonym for 80.Nm 81.Fl + . 82.Pp 83The manual includes both tutorial and reference sections: 84.Bl -ohang 85.It Sy Some Simple Examples 86.It Sy Format of the Input File 87.It Sy Patterns 88The extended regular expressions used by 89.Nm . 90.It Sy How the Input is Matched 91The rules for determining what has been matched. 92.It Sy Actions 93How to specify what to do when a pattern is matched. 94.It Sy The Generated Scanner 95Details regarding the scanner that 96.Nm 97produces; 98how to control the input source. 99.It Sy Start Conditions 100Introducing context into scanners, and managing 101.Qq mini-scanners . 102.It Sy Multiple Input Buffers 103How to manipulate multiple input sources; 104how to scan from strings instead of files. 105.It Sy End-of-File Rules 106Special rules for matching the end of the input. 107.It Sy Miscellaneous Macros 108A summary of macros available to the actions. 109.It Sy Values Available to the User 110A summary of values available to the actions. 111.It Sy Interfacing with Yacc 112Connecting flex scanners together with 113.Xr yacc 1 114parsers. 115.It Sy Options 116.Nm 117command-line options, and the 118.Dq %option 119directive. 120.It Sy Performance Considerations 121How to make scanners go as fast as possible. 122.It Sy Generating C++ Scanners 123The 124.Pq experimental 125facility for generating C++ scanner classes. 126.It Sy Incompatibilities with Lex and POSIX 127How 128.Nm 129differs from 130.At 131.Nm lex 132and the 133.Tn POSIX 134.Nm lex 135standard. 136.It Sy Files 137Files used by 138.Nm . 139.It Sy Diagnostics 140Those error messages produced by 141.Nm 142.Pq or scanners it generates 143whose meanings might not be apparent. 144.It Sy See Also 145Other documentation, related tools. 146.It Sy Authors 147Includes contact information. 148.It Sy Bugs 149Known problems with 150.Nm . 151.El 152.Sh SOME SIMPLE EXAMPLES 153First some simple examples to get the flavor of how one uses 154.Nm . 155The following 156.Nm 157input specifies a scanner which whenever it encounters the string 158.Qq username 159will replace it with the user's login name: 160.Bd -literal -offset indent 161%% 162username printf("%s", getlogin()); 163.Ed 164.Pp 165By default, any text not matched by a 166.Nm 167scanner is copied to the output, so the net effect of this scanner is 168to copy its input file to its output with each occurrence of 169.Qq username 170expanded. 171In this input, there is just one rule. 172.Qq username 173is the 174.Em pattern 175and the 176.Qq printf 177is the 178.Em action . 179The 180.Qq %% 181marks the beginning of the rules. 182.Pp 183Here's another simple example: 184.Bd -literal -offset indent 185%{ 186int num_lines = 0, num_chars = 0; 187%} 188 189%% 190\en ++num_lines; ++num_chars; 191\&. ++num_chars; 192 193%% 194main() 195{ 196 yylex(); 197 printf("# of lines = %d, # of chars = %d\en", 198 num_lines, num_chars); 199} 200.Ed 201.Pp 202This scanner counts the number of characters and the number 203of lines in its input 204(it produces no output other than the final report on the counts). 205The first line declares two globals, 206.Qq num_lines 207and 208.Qq num_chars , 209which are accessible both inside 210.Fn yylex 211and in the 212.Fn main 213routine declared after the second 214.Qq %% . 215There are two rules, one which matches a newline 216.Pq \&"\en\&" 217and increments both the line count and the character count, 218and one which matches any character other than a newline 219(indicated by the 220.Qq \&. 221regular expression). 222.Pp 223A somewhat more complicated example: 224.Bd -literal -offset indent 225/* scanner for a toy Pascal-like language */ 226 227DIGIT [0-9] 228ID [a-z][a-z0-9]* 229 230%% 231 232{DIGIT}+ { 233 printf("An integer: %s\en", yytext); 234} 235 236{DIGIT}+"."{DIGIT}* { 237 printf("A float: %s\en", yytext); 238} 239 240if|then|begin|end|procedure|function { 241 printf("A keyword: %s\en", yytext); 242} 243 244{ID} printf("An identifier: %s\en", yytext); 245 246"+"|"-"|"*"|"/" printf("An operator: %s\en", yytext); 247 248"{"[^}\en]*"}" /* eat up one-line comments */ 249 250[ \et\en]+ /* eat up whitespace */ 251 252\&. printf("Unrecognized character: %s\en", yytext); 253 254%% 255 256int 257main(int argc, char *argv[]) 258{ 259 ++argv; --argc; /* skip over program name */ 260 if (argc > 0) 261 yyin = fopen(argv[0], "r"); 262 else 263 yyin = stdin; 264 265 yylex(); 266} 267.Ed 268.Pp 269This is the beginnings of a simple scanner for a language like Pascal. 270It identifies different types of 271.Em tokens 272and reports on what it has seen. 273.Pp 274The details of this example will be explained in the following sections. 275.Sh FORMAT OF THE INPUT FILE 276The 277.Nm 278input file consists of three sections, separated by a line with just 279.Qq %% 280in it: 281.Bd -unfilled -offset indent 282definitions 283%% 284rules 285%% 286user code 287.Ed 288.Pp 289The 290.Em definitions 291section contains declarations of simple 292.Em name 293definitions to simplify the scanner specification, and declarations of 294.Em start conditions , 295which are explained in a later section. 296.Pp 297Name definitions have the form: 298.Pp 299.D1 name definition 300.Pp 301The 302.Qq name 303is a word beginning with a letter or an underscore 304.Pq Sq _ 305followed by zero or more letters, digits, 306.Sq _ , 307or 308.Sq - 309.Pq dash . 310The definition is taken to begin at the first non-whitespace character 311following the name and continuing to the end of the line. 312The definition can subsequently be referred to using 313.Qq {name} , 314which will expand to 315.Qq (definition) . 316For example: 317.Bd -literal -offset indent 318DIGIT [0-9] 319ID [a-z][a-z0-9]* 320.Ed 321.Pp 322This defines 323.Qq DIGIT 324to be a regular expression which matches a single digit, and 325.Qq ID 326to be a regular expression which matches a letter 327followed by zero-or-more letters-or-digits. 328A subsequent reference to 329.Pp 330.Dl {DIGIT}+"."{DIGIT}* 331.Pp 332is identical to 333.Pp 334.Dl ([0-9])+"."([0-9])* 335.Pp 336and matches one-or-more digits followed by a 337.Sq .\& 338followed by zero-or-more digits. 339.Pp 340The 341.Em rules 342section of the 343.Nm 344input contains a series of rules of the form: 345.Pp 346.Dl pattern action 347.Pp 348The pattern must be unindented and the action must begin 349on the same line. 350.Pp 351See below for a further description of patterns and actions. 352.Pp 353Finally, the user code section is simply copied to 354.Pa lex.yy.c 355verbatim. 356It is used for companion routines which call or are called by the scanner. 357The presence of this section is optional; 358if it is missing, the second 359.Qq %% 360in the input file may be skipped too. 361.Pp 362In the definitions and rules sections, any indented text or text enclosed in 363.Sq %{ 364and 365.Sq %} 366is copied verbatim to the output 367.Pq with the %{}'s removed . 368The %{}'s must appear unindented on lines by themselves. 369.Pp 370In the rules section, 371any indented or %{} text appearing before the first rule may be used to 372declare variables which are local to the scanning routine and 373.Pq after the declarations 374code which is to be executed whenever the scanning routine is entered. 375Other indented or %{} text in the rule section is still copied to the output, 376but its meaning is not well-defined and it may well cause compile-time 377errors (this feature is present for 378.Tn POSIX 379compliance; see below for other such features). 380.Pp 381In the definitions section 382.Pq but not in the rules section , 383an unindented comment 384(i.e., a line beginning with 385.Qq /* ) 386is also copied verbatim to the output up to the next 387.Qq */ . 388.Sh PATTERNS 389The patterns in the input are written using an extended set of regular 390expressions. 391These are: 392.Bl -tag -width "XXXXXXXX" 393.It x 394Match the character 395.Sq x . 396.It .\& 397Any character 398.Pq byte 399except newline. 400.It [xyz] 401A 402.Qq character class ; 403in this case, the pattern matches either an 404.Sq x , 405a 406.Sq y , 407or a 408.Sq z . 409.It [abj-oZ] 410A 411.Qq character class 412with a range in it; matches an 413.Sq a , 414a 415.Sq b , 416any letter from 417.Sq j 418through 419.Sq o , 420or a 421.Sq Z . 422.It [^A-Z] 423A 424.Qq negated character class , 425i.e., any character but those in the class. 426In this case, any character EXCEPT an uppercase letter. 427.It [^A-Z\en] 428Any character EXCEPT an uppercase letter or a newline. 429.It r* 430Zero or more r's, where 431.Sq r 432is any regular expression. 433.It r+ 434One or more r's. 435.It r? 436Zero or one r's (that is, 437.Qq an optional r ) . 438.It r{2,5} 439Anywhere from two to five r's. 440.It r{2,} 441Two or more r's. 442.It r{4} 443Exactly 4 r's. 444.It {name} 445The expansion of the 446.Qq name 447definition 448.Pq see above . 449.It \&"[xyz]\e\&"foo\&" 450The literal string: [xyz]"foo. 451.It \eX 452If 453.Sq X 454is an 455.Sq a , 456.Sq b , 457.Sq f , 458.Sq n , 459.Sq r , 460.Sq t , 461or 462.Sq v , 463then the ANSI-C interpretation of 464.Sq \eX . 465Otherwise, a literal 466.Sq X 467(used to escape operators such as 468.Sq * ) . 469.It \e0 470A NUL character 471.Pq ASCII code 0 . 472.It \e123 473The character with octal value 123. 474.It \ex2a 475The character with hexadecimal value 2a. 476.It (r) 477Match an 478.Sq r ; 479parentheses are used to override precedence 480.Pq see below . 481.It rs 482The regular expression 483.Sq r 484followed by the regular expression 485.Sq s ; 486called 487.Qq concatenation . 488.It r|s 489Either an 490.Sq r 491or an 492.Sq s . 493.It r/s 494An 495.Sq r , 496but only if it is followed by an 497.Sq s . 498The text matched by 499.Sq s 500is included when determining whether this rule is the 501.Qq longest match , 502but is then returned to the input before the action is executed. 503So the action only sees the text matched by 504.Sq r . 505This type of pattern is called 506.Qq trailing context . 507(There are some combinations of r/s that 508.Nm 509cannot match correctly; see notes in the 510.Sx BUGS 511section below regarding 512.Qq dangerous trailing context . ) 513.It ^r 514An 515.Sq r , 516but only at the beginning of a line 517(i.e., just starting to scan, or right after a newline has been scanned). 518.It r$ 519An 520.Sq r , 521but only at the end of a line 522.Pq i.e., just before a newline . 523Equivalent to 524.Qq r/\en . 525.Pp 526Note that 527.Nm flex Ns 's 528notion of 529.Qq newline 530is exactly whatever the C compiler used to compile 531.Nm 532interprets 533.Sq \en 534as. 535.\" In particular, on some DOS systems you must either filter out \er's in the 536.\" input yourself, or explicitly use r/\er\en for 537.\" .Qq r$ . 538.It <s>r 539An 540.Sq r , 541but only in start condition 542.Sq s 543.Pq see below for discussion of start conditions . 544.It <s1,s2,s3>r 545The same, but in any of start conditions s1, s2, or s3. 546.It <*>r 547An 548.Sq r 549in any start condition, even an exclusive one. 550.It <<EOF>> 551An end-of-file. 552.It <s1,s2><<EOF>> 553An end-of-file when in start condition s1 or s2. 554.El 555.Pp 556Note that inside of a character class, all regular expression operators 557lose their special meaning except escape 558.Pq Sq \e 559and the character class operators, 560.Sq - , 561.Sq ]\& , 562and, at the beginning of the class, 563.Sq ^ . 564.Pp 565The regular expressions listed above are grouped according to 566precedence, from highest precedence at the top to lowest at the bottom. 567Those grouped together have equal precedence. 568For example, 569.Pp 570.D1 foo|bar* 571.Pp 572is the same as 573.Pp 574.D1 (foo)|(ba(r*)) 575.Pp 576since the 577.Sq * 578operator has higher precedence than concatenation, 579and concatenation higher than alternation 580.Pq Sq |\& . 581This pattern therefore matches 582.Em either 583the string 584.Qq foo 585.Em or 586the string 587.Qq ba 588followed by zero-or-more r's. 589To match 590.Qq foo 591or zero-or-more "bar"'s, 592use: 593.Pp 594.D1 foo|(bar)* 595.Pp 596and to match zero-or-more "foo"'s-or-"bar"'s: 597.Pp 598.D1 (foo|bar)* 599.Pp 600In addition to characters and ranges of characters, character classes 601can also contain character class 602.Em expressions . 603These are expressions enclosed inside 604.Sq [: 605and 606.Sq :] 607delimiters (which themselves must appear between the 608.Sq \&[ 609and 610.Sq ]\& 611of the 612character class; other elements may occur inside the character class, too). 613The valid expressions are: 614.Bd -unfilled -offset indent 615[:alnum:] [:alpha:] [:blank:] 616[:cntrl:] [:digit:] [:graph:] 617[:lower:] [:print:] [:punct:] 618[:space:] [:upper:] [:xdigit:] 619.Ed 620.Pp 621These expressions all designate a set of characters equivalent to 622the corresponding standard C 623.Fn isXXX 624function. 625For example, [:alnum:] designates those characters for which 626.Xr isalnum 3 627returns true \- i.e., any alphabetic or numeric. 628Some systems don't provide 629.Xr isblank 3 , 630so 631.Nm 632defines [:blank:] as a blank or a tab. 633.Pp 634For example, the following character classes are all equivalent: 635.Bd -unfilled -offset indent 636[[:alnum:]] 637[[:alpha:][:digit:]] 638[[:alpha:]0-9] 639[a-zA-Z0-9] 640.Ed 641.Pp 642If the scanner is case-insensitive (the 643.Fl i 644flag), then [:upper:] and [:lower:] are equivalent to [:alpha:]. 645.Pp 646Some notes on patterns: 647.Bl -dash 648.It 649A negated character class such as the example 650.Qq [^A-Z] 651above will match a newline unless "\en" 652.Pq or an equivalent escape sequence 653is one of the characters explicitly present in the negated character class 654(e.g., 655.Qq [^A-Z\en] ) . 656This is unlike how many other regular expression tools treat negated character 657classes, but unfortunately the inconsistency is historically entrenched. 658Matching newlines means that a pattern like 659.Qq [^"]* 660can match the entire input unless there's another quote in the input. 661.It 662A rule can have at most one instance of trailing context 663(the 664.Sq / 665operator or the 666.Sq $ 667operator). 668The start condition, 669.Sq ^ , 670and 671.Qq <<EOF>> 672patterns can only occur at the beginning of a pattern and, as well as with 673.Sq / 674and 675.Sq $ , 676cannot be grouped inside parentheses. 677A 678.Sq ^ 679which does not occur at the beginning of a rule or a 680.Sq $ 681which does not occur at the end of a rule loses its special properties 682and is treated as a normal character. 683.It 684The following are illegal: 685.Bd -unfilled -offset indent 686foo/bar$ 687<sc1>foo<sc2>bar 688.Ed 689.Pp 690Note that the first of these, can be written 691.Qq foo/bar\en . 692.It 693The following will result in 694.Sq $ 695or 696.Sq ^ 697being treated as a normal character: 698.Bd -unfilled -offset indent 699foo|(bar$) 700foo|^bar 701.Ed 702.Pp 703If what's wanted is a 704.Qq foo 705or a bar-followed-by-a-newline, the following could be used 706(the special 707.Sq |\& 708action is explained below): 709.Bd -unfilled -offset indent 710foo | 711bar$ /* action goes here */ 712.Ed 713.Pp 714A similar trick will work for matching a foo or a 715bar-at-the-beginning-of-a-line. 716.El 717.Sh HOW THE INPUT IS MATCHED 718When the generated scanner is run, 719it analyzes its input looking for strings which match any of its patterns. 720If it finds more than one match, 721it takes the one matching the most text 722(for trailing context rules, this includes the length of the trailing part, 723even though it will then be returned to the input). 724If it finds two or more matches of the same length, 725the rule listed first in the 726.Nm 727input file is chosen. 728.Pp 729Once the match is determined, the text corresponding to the match 730(called the 731.Em token ) 732is made available in the global character pointer 733.Fa yytext , 734and its length in the global integer 735.Fa yyleng . 736The 737.Em action 738corresponding to the matched pattern is then executed 739.Pq a more detailed description of actions follows , 740and then the remaining input is scanned for another match. 741.Pp 742If no match is found, then the default rule is executed: 743the next character in the input is considered matched and 744copied to the standard output. 745Thus, the simplest legal 746.Nm 747input is: 748.Pp 749.D1 %% 750.Pp 751which generates a scanner that simply copies its input 752.Pq one character at a time 753to its output. 754.Pp 755Note that 756.Fa yytext 757can be defined in two different ways: 758either as a character pointer or as a character array. 759Which definition 760.Nm 761uses can be controlled by including one of the special directives 762.Dq %pointer 763or 764.Dq %array 765in the first 766.Pq definitions 767section of flex input. 768The default is 769.Dq %pointer , 770unless the 771.Fl l 772.Nm lex 773compatibility option is used, in which case 774.Fa yytext 775will be an array. 776The advantage of using 777.Dq %pointer 778is substantially faster scanning and no buffer overflow when matching 779very large tokens 780.Pq unless not enough dynamic memory is available . 781The disadvantage is that actions are restricted in how they can modify 782.Fa yytext 783.Pq see the next section , 784and calls to the 785.Fn unput 786function destroy the present contents of 787.Fa yytext , 788which can be a considerable porting headache when moving between different 789.Nm lex 790versions. 791.Pp 792The advantage of 793.Dq %array 794is that 795.Fa yytext 796can be modified as much as wanted, and calls to 797.Fn unput 798do not destroy 799.Fa yytext 800.Pq see below . 801Furthermore, existing 802.Nm lex 803programs sometimes access 804.Fa yytext 805externally using declarations of the form: 806.Pp 807.D1 extern char yytext[]; 808.Pp 809This definition is erroneous when used with 810.Dq %pointer , 811but correct for 812.Dq %array . 813.Pp 814.Dq %array 815defines 816.Fa yytext 817to be an array of 818.Dv YYLMAX 819characters, which defaults to a fairly large value. 820The size can be changed by simply #define'ing 821.Dv YYLMAX 822to a different value in the first section of 823.Nm 824input. 825As mentioned above, with 826.Dq %pointer 827yytext grows dynamically to accommodate large tokens. 828While this means a 829.Dq %pointer 830scanner can accommodate very large tokens 831.Pq such as matching entire blocks of comments , 832bear in mind that each time the scanner must resize 833.Fa yytext 834it also must rescan the entire token from the beginning, so matching such 835tokens can prove slow. 836.Fa yytext 837presently does not dynamically grow if a call to 838.Fn unput 839results in too much text being pushed back; instead, a run-time error results. 840.Pp 841Also note that 842.Dq %array 843cannot be used with C++ scanner classes 844.Pq the c++ option; see below . 845.Sh ACTIONS 846Each pattern in a rule has a corresponding action, 847which can be any arbitrary C statement. 848The pattern ends at the first non-escaped whitespace character; 849the remainder of the line is its action. 850If the action is empty, 851then when the pattern is matched the input token is simply discarded. 852For example, here is the specification for a program 853which deletes all occurrences of 854.Qq zap me 855from its input: 856.Bd -literal -offset indent 857%% 858"zap me" 859.Ed 860.Pp 861(It will copy all other characters in the input to the output since 862they will be matched by the default rule.) 863.Pp 864Here is a program which compresses multiple blanks and tabs down to 865a single blank, and throws away whitespace found at the end of a line: 866.Bd -literal -offset indent 867%% 868[ \et]+ putchar(' '); 869[ \et]+$ /* ignore this token */ 870.Ed 871.Pp 872If the action contains a 873.Sq { , 874then the action spans till the balancing 875.Sq } 876is found, and the action may cross multiple lines. 877.Nm 878knows about C strings and comments and won't be fooled by braces found 879within them, but also allows actions to begin with 880.Sq %{ 881and will consider the action to be all the text up to the next 882.Sq %} 883.Pq regardless of ordinary braces inside the action . 884.Pp 885An action consisting solely of a vertical bar 886.Pq Sq |\& 887means 888.Qq same as the action for the next rule . 889See below for an illustration. 890.Pp 891Actions can include arbitrary C code, 892including return statements to return a value to whatever routine called 893.Fn yylex . 894Each time 895.Fn yylex 896is called, it continues processing tokens from where it last left off 897until it either reaches the end of the file or executes a return. 898.Pp 899Actions are free to modify 900.Fa yytext 901except for lengthening it 902(adding characters to its end \- these will overwrite later characters in the 903input stream). 904This, however, does not apply when using 905.Dq %array 906.Pq see above ; 907in that case, 908.Fa yytext 909may be freely modified in any way. 910.Pp 911Actions are free to modify 912.Fa yyleng 913except they should not do so if the action also includes use of 914.Fn yymore 915.Pq see below . 916.Pp 917There are a number of special directives which can be included within 918an action: 919.Bl -tag -width Ds 920.It ECHO 921Copies 922.Fa yytext 923to the scanner's output. 924.It BEGIN 925Followed by the name of a start condition, places the scanner in the 926corresponding start condition 927.Pq see below . 928.It REJECT 929Directs the scanner to proceed on to the 930.Qq second best 931rule which matched the input 932.Pq or a prefix of the input . 933The rule is chosen as described above in 934.Sx HOW THE INPUT IS MATCHED , 935and 936.Fa yytext 937and 938.Fa yyleng 939set up appropriately. 940It may either be one which matched as much text 941as the originally chosen rule but came later in the 942.Nm 943input file, or one which matched less text. 944For example, the following will both count the 945words in the input and call the routine 946.Fn special 947whenever 948.Qq frob 949is seen: 950.Bd -literal -offset indent 951int word_count = 0; 952%% 953 954frob special(); REJECT; 955[^ \et\en]+ ++word_count; 956.Ed 957.Pp 958Without the 959.Em REJECT , 960any "frob"'s in the input would not be counted as words, 961since the scanner normally executes only one action per token. 962Multiple 963.Em REJECT Ns 's 964are allowed, 965each one finding the next best choice to the currently active rule. 966For example, when the following scanner scans the token 967.Qq abcd , 968it will write 969.Qq abcdabcaba 970to the output: 971.Bd -literal -offset indent 972%% 973a | 974ab | 975abc | 976abcd ECHO; REJECT; 977\&.|\en /* eat up any unmatched character */ 978.Ed 979.Pp 980(The first three rules share the fourth's action since they use 981the special 982.Sq |\& 983action.) 984.Em REJECT 985is a particularly expensive feature in terms of scanner performance; 986if it is used in any of the scanner's actions it will slow down 987all of the scanner's matching. 988Furthermore, 989.Em REJECT 990cannot be used with the 991.Fl Cf 992or 993.Fl CF 994options 995.Pq see below . 996.Pp 997Note also that unlike the other special actions, 998.Em REJECT 999is a 1000.Em branch ; 1001code immediately following it in the action will not be executed. 1002.It yymore() 1003Tells the scanner that the next time it matches a rule, the corresponding 1004token should be appended onto the current value of 1005.Fa yytext 1006rather than replacing it. 1007For example, given the input 1008.Qq mega-kludge 1009the following will write 1010.Qq mega-mega-kludge 1011to the output: 1012.Bd -literal -offset indent 1013%% 1014mega- ECHO; yymore(); 1015kludge ECHO; 1016.Ed 1017.Pp 1018First 1019.Qq mega- 1020is matched and echoed to the output. 1021Then 1022.Qq kludge 1023is matched, but the previous 1024.Qq mega- 1025is still hanging around at the beginning of 1026.Fa yytext 1027so the 1028.Em ECHO 1029for the 1030.Qq kludge 1031rule will actually write 1032.Qq mega-kludge . 1033.Pp 1034Two notes regarding use of 1035.Fn yymore : 1036First, 1037.Fn yymore 1038depends on the value of 1039.Fa yyleng 1040correctly reflecting the size of the current token, so 1041.Fa yyleng 1042must not be modified when using 1043.Fn yymore . 1044Second, the presence of 1045.Fn yymore 1046in the scanner's action entails a minor performance penalty in the 1047scanner's matching speed. 1048.It yyless(n) 1049Returns all but the first 1050.Ar n 1051characters of the current token back to the input stream, where they 1052will be rescanned when the scanner looks for the next match. 1053.Fa yytext 1054and 1055.Fa yyleng 1056are adjusted appropriately (e.g., 1057.Fa yyleng 1058will now be equal to 1059.Ar n ) . 1060For example, on the input 1061.Qq foobar 1062the following will write out 1063.Qq foobarbar : 1064.Bd -literal -offset indent 1065%% 1066foobar ECHO; yyless(3); 1067[a-z]+ ECHO; 1068.Ed 1069.Pp 1070An argument of 0 to 1071.Fa yyless 1072will cause the entire current input string to be scanned again. 1073Unless how the scanner will subsequently process its input has been changed 1074(using 1075.Em BEGIN , 1076for example), 1077this will result in an endless loop. 1078.Pp 1079Note that 1080.Fa yyless 1081is a macro and can only be used in the 1082.Nm 1083input file, not from other source files. 1084.It unput(c) 1085Puts the character 1086.Ar c 1087back into the input stream. 1088It will be the next character scanned. 1089The following action will take the current token and cause it 1090to be rescanned enclosed in parentheses. 1091.Bd -literal -offset indent 1092{ 1093 int i; 1094 char *yycopy; 1095 1096 /* Copy yytext because unput() trashes yytext */ 1097 if ((yycopy = strdup(yytext)) == NULL) 1098 err(1, NULL); 1099 unput(')'); 1100 for (i = yyleng - 1; i >= 0; --i) 1101 unput(yycopy[i]); 1102 unput('('); 1103 free(yycopy); 1104} 1105.Ed 1106.Pp 1107Note that since each 1108.Fn unput 1109puts the given character back at the beginning of the input stream, 1110pushing back strings must be done back-to-front. 1111.Pp 1112An important potential problem when using 1113.Fn unput 1114is that if using 1115.Dq %pointer 1116.Pq the default , 1117a call to 1118.Fn unput 1119destroys the contents of 1120.Fa yytext , 1121starting with its rightmost character and devouring one character to 1122the left with each call. 1123If the value of 1124.Fa yytext 1125should be preserved after a call to 1126.Fn unput 1127.Pq as in the above example , 1128it must either first be copied elsewhere, or the scanner must be built using 1129.Dq %array 1130instead (see 1131.Sx HOW THE INPUT IS MATCHED ) . 1132.Pp 1133Finally, note that EOF cannot be put back 1134to attempt to mark the input stream with an end-of-file. 1135.It input() 1136Reads the next character from the input stream. 1137For example, the following is one way to eat up C comments: 1138.Bd -literal -offset indent 1139%% 1140"/*" { 1141 int c; 1142 1143 for (;;) { 1144 while ((c = input()) != '*' && c != EOF) 1145 ; /* eat up text of comment */ 1146 1147 if (c == '*') { 1148 while ((c = input()) == '*') 1149 ; 1150 if (c == '/') 1151 break; /* found the end */ 1152 } 1153 1154 if (c == EOF) { 1155 errx(1, "EOF in comment"); 1156 break; 1157 } 1158 } 1159} 1160.Ed 1161.Pp 1162(Note that if the scanner is compiled using C++, then 1163.Fn input 1164is instead referred to as 1165.Fn yyinput , 1166in order to avoid a name clash with the C++ stream by the name of input.) 1167.It YY_FLUSH_BUFFER 1168Flushes the scanner's internal buffer 1169so that the next time the scanner attempts to match a token, 1170it will first refill the buffer using 1171.Dv YY_INPUT 1172(see 1173.Sx THE GENERATED SCANNER , 1174below). 1175This action is a special case of the more general 1176.Fn yy_flush_buffer 1177function, described below in the section 1178.Sx MULTIPLE INPUT BUFFERS . 1179.It yyterminate() 1180Can be used in lieu of a return statement in an action. 1181It terminates the scanner and returns a 0 to the scanner's caller, indicating 1182.Qq all done . 1183By default, 1184.Fn yyterminate 1185is also called when an end-of-file is encountered. 1186It is a macro and may be redefined. 1187.El 1188.Sh THE GENERATED SCANNER 1189The output of 1190.Nm 1191is the file 1192.Pa lex.yy.c , 1193which contains the scanning routine 1194.Fn yylex , 1195a number of tables used by it for matching tokens, 1196and a number of auxiliary routines and macros. 1197By default, 1198.Fn yylex 1199is declared as follows: 1200.Bd -unfilled -offset indent 1201int yylex() 1202{ 1203 ... various definitions and the actions in here ... 1204} 1205.Ed 1206.Pp 1207(If the environment supports function prototypes, then it will 1208be "int yylex(void)".) 1209This definition may be changed by defining the 1210.Dv YY_DECL 1211macro. 1212For example: 1213.Bd -literal -offset indent 1214#define YY_DECL float lexscan(a, b) float a, b; 1215.Ed 1216.Pp 1217would give the scanning routine the name 1218.Em lexscan , 1219returning a float, and taking two floats as arguments. 1220Note that if arguments are given to the scanning routine using a 1221K&R-style/non-prototyped function declaration, 1222the definition must be terminated with a semi-colon 1223.Pq Sq ;\& . 1224.Pp 1225Whenever 1226.Fn yylex 1227is called, it scans tokens from the global input file 1228.Pa yyin 1229.Pq which defaults to stdin . 1230It continues until it either reaches an end-of-file 1231.Pq at which point it returns the value 0 1232or one of its actions executes a 1233.Em return 1234statement. 1235.Pp 1236If the scanner reaches an end-of-file, subsequent calls are undefined 1237unless either 1238.Em yyin 1239is pointed at a new input file 1240.Pq in which case scanning continues from that file , 1241or 1242.Fn yyrestart 1243is called. 1244.Fn yyrestart 1245takes one argument, a 1246.Fa FILE * 1247pointer (which can be nil, if 1248.Dv YY_INPUT 1249has been set up to scan from a source other than 1250.Em yyin ) , 1251and initializes 1252.Em yyin 1253for scanning from that file. 1254Essentially there is no difference between just assigning 1255.Em yyin 1256to a new input file or using 1257.Fn yyrestart 1258to do so; the latter is available for compatibility with previous versions of 1259.Nm , 1260and because it can be used to switch input files in the middle of scanning. 1261It can also be used to throw away the current input buffer, 1262by calling it with an argument of 1263.Em yyin ; 1264but better is to use 1265.Dv YY_FLUSH_BUFFER 1266.Pq see above . 1267Note that 1268.Fn yyrestart 1269does not reset the start condition to 1270.Em INITIAL 1271(see 1272.Sx START CONDITIONS , 1273below). 1274.Pp 1275If 1276.Fn yylex 1277stops scanning due to executing a 1278.Em return 1279statement in one of the actions, the scanner may then be called again and it 1280will resume scanning where it left off. 1281.Pp 1282By default 1283.Pq and for purposes of efficiency , 1284the scanner uses block-reads rather than simple 1285.Xr getc 3 1286calls to read characters from 1287.Em yyin . 1288The nature of how it gets its input can be controlled by defining the 1289.Dv YY_INPUT 1290macro. 1291.Dv YY_INPUT Ns 's 1292calling sequence is 1293.Qq YY_INPUT(buf,result,max_size) . 1294Its action is to place up to 1295.Dv max_size 1296characters in the character array 1297.Em buf 1298and return in the integer variable 1299.Em result 1300either the number of characters read or the constant 1301.Dv YY_NULL 1302(0 on 1303.Ux 1304systems) 1305to indicate 1306.Dv EOF . 1307The default 1308.Dv YY_INPUT 1309reads from the global file-pointer 1310.Qq yyin . 1311.Pp 1312A sample definition of 1313.Dv YY_INPUT 1314.Pq in the definitions section of the input file : 1315.Bd -unfilled -offset indent 1316%{ 1317#define YY_INPUT(buf,result,max_size) \e 1318{ \e 1319 int c = getchar(); \e 1320 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e 1321} 1322%} 1323.Ed 1324.Pp 1325This definition will change the input processing to occur 1326one character at a time. 1327.Pp 1328When the scanner receives an end-of-file indication from 1329.Dv YY_INPUT , 1330it then checks the 1331.Fn yywrap 1332function. 1333If 1334.Fn yywrap 1335returns false 1336.Pq zero , 1337then it is assumed that the function has gone ahead and set up 1338.Em yyin 1339to point to another input file, and scanning continues. 1340If it returns true 1341.Pq non-zero , 1342then the scanner terminates, returning 0 to its caller. 1343Note that in either case, the start condition remains unchanged; 1344it does not revert to 1345.Em INITIAL . 1346.Pp 1347If you do not supply your own version of 1348.Fn yywrap , 1349then you must either use 1350.Dq %option noyywrap 1351(in which case the scanner behaves as though 1352.Fn yywrap 1353returned 1), or you must link with 1354.Fl lfl 1355to obtain the default version of the routine, which always returns 1. 1356.Pp 1357Three routines are available for scanning from in-memory buffers rather 1358than files: 1359.Fn yy_scan_string , 1360.Fn yy_scan_bytes , 1361and 1362.Fn yy_scan_buffer . 1363See the discussion of them below in the section 1364.Sx MULTIPLE INPUT BUFFERS . 1365.Pp 1366The scanner writes its 1367.Em ECHO 1368output to the 1369.Em yyout 1370global 1371.Pq default, stdout , 1372which may be redefined by the user simply by assigning it to some other 1373.Va FILE 1374pointer. 1375.Sh START CONDITIONS 1376.Nm 1377provides a mechanism for conditionally activating rules. 1378Any rule whose pattern is prefixed with 1379.Qq <sc> 1380will only be active when the scanner is in the start condition named 1381.Qq sc . 1382For example, 1383.Bd -literal -offset indent 1384<STRING>[^"]* { /* eat up the string body ... */ 1385 ... 1386} 1387.Ed 1388.Pp 1389will be active only when the scanner is in the 1390.Qq STRING 1391start condition, and 1392.Bd -literal -offset indent 1393<INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */ 1394 ... 1395} 1396.Ed 1397.Pp 1398will be active only when the current start condition is either 1399.Qq INITIAL , 1400.Qq STRING , 1401or 1402.Qq QUOTE . 1403.Pp 1404Start conditions are declared in the definitions 1405.Pq first 1406section of the input using unindented lines beginning with either 1407.Sq %s 1408or 1409.Sq %x 1410followed by a list of names. 1411The former declares 1412.Em inclusive 1413start conditions, the latter 1414.Em exclusive 1415start conditions. 1416A start condition is activated using the 1417.Em BEGIN 1418action. 1419Until the next 1420.Em BEGIN 1421action is executed, rules with the given start condition will be active and 1422rules with other start conditions will be inactive. 1423If the start condition is inclusive, 1424then rules with no start conditions at all will also be active. 1425If it is exclusive, 1426then only rules qualified with the start condition will be active. 1427A set of rules contingent on the same exclusive start condition 1428describe a scanner which is independent of any of the other rules in the 1429.Nm 1430input. 1431Because of this, exclusive start conditions make it easy to specify 1432.Qq mini-scanners 1433which scan portions of the input that are syntactically different 1434from the rest 1435.Pq e.g., comments . 1436.Pp 1437If the distinction between inclusive and exclusive start conditions 1438is still a little vague, here's a simple example illustrating the 1439connection between the two. 1440The set of rules: 1441.Bd -literal -offset indent 1442%s example 1443%% 1444 1445<example>foo do_something(); 1446 1447bar something_else(); 1448.Ed 1449.Pp 1450is equivalent to 1451.Bd -literal -offset indent 1452%x example 1453%% 1454 1455<example>foo do_something(); 1456 1457<INITIAL,example>bar something_else(); 1458.Ed 1459.Pp 1460Without the <INITIAL,example> qualifier, the 1461.Dq bar 1462pattern in the second example wouldn't be active 1463.Pq i.e., couldn't match 1464when in start condition 1465.Dq example . 1466If we just used <example> to qualify 1467.Dq bar , 1468though, then it would only be active in 1469.Dq example 1470and not in 1471.Em INITIAL , 1472while in the first example it's active in both, 1473because in the first example the 1474.Dq example 1475start condition is an inclusive 1476.Pq Sq %s 1477start condition. 1478.Pp 1479Also note that the special start-condition specifier 1480.Sq <*> 1481matches every start condition. 1482Thus, the above example could also have been written: 1483.Bd -literal -offset indent 1484%x example 1485%% 1486 1487<example>foo do_something(); 1488 1489<*>bar something_else(); 1490.Ed 1491.Pp 1492The default rule (to 1493.Em ECHO 1494any unmatched character) remains active in start conditions. 1495It is equivalent to: 1496.Bd -literal -offset indent 1497<*>.|\en ECHO; 1498.Ed 1499.Pp 1500.Dq BEGIN(0) 1501returns to the original state where only the rules with 1502no start conditions are active. 1503This state can also be referred to as the start-condition 1504.Em INITIAL , 1505so 1506.Dq BEGIN(INITIAL) 1507is equivalent to 1508.Dq BEGIN(0) . 1509(The parentheses around the start condition name are not required but 1510are considered good style.) 1511.Pp 1512.Em BEGIN 1513actions can also be given as indented code at the beginning 1514of the rules section. 1515For example, the following will cause the scanner to enter the 1516.Qq SPECIAL 1517start condition whenever 1518.Fn yylex 1519is called and the global variable 1520.Fa enter_special 1521is true: 1522.Bd -literal -offset indent 1523int enter_special; 1524 1525%x SPECIAL 1526%% 1527 if (enter_special) 1528 BEGIN(SPECIAL); 1529 1530<SPECIAL>blahblahblah 1531\&...more rules follow... 1532.Ed 1533.Pp 1534To illustrate the uses of start conditions, 1535here is a scanner which provides two different interpretations 1536of a string like 1537.Qq 123.456 . 1538By default it will treat it as three tokens: the integer 1539.Qq 123 , 1540a dot 1541.Pq Sq .\& , 1542and the integer 1543.Qq 456 . 1544But if the string is preceded earlier in the line by the string 1545.Qq expect-floats 1546it will treat it as a single token, the floating-point number 123.456: 1547.Bd -literal -offset indent 1548%{ 1549#include <math.h> 1550%} 1551%s expect 1552 1553%% 1554expect-floats BEGIN(expect); 1555 1556<expect>[0-9]+"."[0-9]+ { 1557 printf("found a float, = %s\en", yytext); 1558} 1559<expect>\en { 1560 /* 1561 * That's the end of the line, so 1562 * we need another "expect-number" 1563 * before we'll recognize any more 1564 * numbers. 1565 */ 1566 BEGIN(INITIAL); 1567} 1568 1569[0-9]+ { 1570 printf("found an integer, = %s\en", yytext); 1571} 1572 1573"." printf("found a dot\en"); 1574.Ed 1575.Pp 1576Here is a scanner which recognizes 1577.Pq and discards 1578C comments while maintaining a count of the current input line: 1579.Bd -literal -offset indent 1580%x comment 1581%% 1582int line_num = 1; 1583 1584"/*" BEGIN(comment); 1585 1586<comment>[^*\en]* /* eat anything that's not a '*' */ 1587<comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */ 1588<comment>\en ++line_num; 1589<comment>"*"+"/" BEGIN(INITIAL); 1590.Ed 1591.Pp 1592This scanner goes to a bit of trouble to match as much 1593text as possible with each rule. 1594In general, when attempting to write a high-speed scanner 1595try to match as much as possible in each rule, as it's a big win. 1596.Pp 1597Note that start-condition names are really integer values and 1598can be stored as such. 1599Thus, the above could be extended in the following fashion: 1600.Bd -literal -offset indent 1601%x comment foo 1602%% 1603int line_num = 1; 1604int comment_caller; 1605 1606"/*" { 1607 comment_caller = INITIAL; 1608 BEGIN(comment); 1609} 1610 1611\&... 1612 1613<foo>"/*" { 1614 comment_caller = foo; 1615 BEGIN(comment); 1616} 1617 1618<comment>[^*\en]* /* eat anything that's not a '*' */ 1619<comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */ 1620<comment>\en ++line_num; 1621<comment>"*"+"/" BEGIN(comment_caller); 1622.Ed 1623.Pp 1624Furthermore, the current start condition can be accessed by using 1625the integer-valued 1626.Dv YY_START 1627macro. 1628For example, the above assignments to 1629.Em comment_caller 1630could instead be written 1631.Pp 1632.Dl comment_caller = YY_START; 1633.Pp 1634Flex provides 1635.Dv YYSTATE 1636as an alias for 1637.Dv YY_START 1638(since that is what's used by 1639.At 1640.Nm lex ) . 1641.Pp 1642Note that start conditions do not have their own name-space; 1643%s's and %x's declare names in the same fashion as #define's. 1644.Pp 1645Finally, here's an example of how to match C-style quoted strings using 1646exclusive start conditions, including expanded escape sequences 1647(but not including checking for a string that's too long): 1648.Bd -literal -offset indent 1649%x str 1650 1651%% 1652#define MAX_STR_CONST 1024 1653char string_buf[MAX_STR_CONST]; 1654char *string_buf_ptr; 1655 1656\e" string_buf_ptr = string_buf; BEGIN(str); 1657 1658<str>\e" { /* saw closing quote - all done */ 1659 BEGIN(INITIAL); 1660 *string_buf_ptr = '\e0'; 1661 /* 1662 * return string constant token type and 1663 * value to parser 1664 */ 1665} 1666 1667<str>\en { 1668 /* error - unterminated string constant */ 1669 /* generate error message */ 1670} 1671 1672<str>\e\e[0-7]{1,3} { 1673 /* octal escape sequence */ 1674 int result; 1675 1676 (void) sscanf(yytext + 1, "%o", &result); 1677 1678 if (result > 0xff) { 1679 /* error, constant is out-of-bounds */ 1680 } else 1681 *string_buf_ptr++ = result; 1682} 1683 1684<str>\e\e[0-9]+ { 1685 /* 1686 * generate error - bad escape sequence; something 1687 * like '\e48' or '\e0777777' 1688 */ 1689} 1690 1691<str>\e\en *string_buf_ptr++ = '\en'; 1692<str>\e\et *string_buf_ptr++ = '\et'; 1693<str>\e\er *string_buf_ptr++ = '\er'; 1694<str>\e\eb *string_buf_ptr++ = '\eb'; 1695<str>\e\ef *string_buf_ptr++ = '\ef'; 1696 1697<str>\e\e(.|\en) *string_buf_ptr++ = yytext[1]; 1698 1699<str>[^\e\e\en\e"]+ { 1700 char *yptr = yytext; 1701 1702 while (*yptr) 1703 *string_buf_ptr++ = *yptr++; 1704} 1705.Ed 1706.Pp 1707Often, such as in some of the examples above, 1708a whole bunch of rules are all preceded by the same start condition(s). 1709.Nm 1710makes this a little easier and cleaner by introducing a notion of 1711start condition 1712.Em scope . 1713A start condition scope is begun with: 1714.Pp 1715.Dl <SCs>{ 1716.Pp 1717where 1718.Dq SCs 1719is a list of one or more start conditions. 1720Inside the start condition scope, every rule automatically has the prefix <SCs> 1721applied to it, until a 1722.Sq } 1723which matches the initial 1724.Sq { . 1725So, for example, 1726.Bd -literal -offset indent 1727<ESC>{ 1728 "\e\en" return '\en'; 1729 "\e\er" return '\er'; 1730 "\e\ef" return '\ef'; 1731 "\e\e0" return '\e0'; 1732} 1733.Ed 1734.Pp 1735is equivalent to: 1736.Bd -literal -offset indent 1737<ESC>"\e\en" return '\en'; 1738<ESC>"\e\er" return '\er'; 1739<ESC>"\e\ef" return '\ef'; 1740<ESC>"\e\e0" return '\e0'; 1741.Ed 1742.Pp 1743Start condition scopes may be nested. 1744.Pp 1745Three routines are available for manipulating stacks of start conditions: 1746.Bl -tag -width Ds 1747.It void yy_push_state(int new_state) 1748Pushes the current start condition onto the top of the start condition 1749stack and switches to 1750.Fa new_state 1751as though 1752.Dq BEGIN new_state 1753had been used 1754.Pq recall that start condition names are also integers . 1755.It void yy_pop_state() 1756Pops the top of the stack and switches to it via 1757.Em BEGIN . 1758.It int yy_top_state() 1759Returns the top of the stack without altering the stack's contents. 1760.El 1761.Pp 1762The start condition stack grows dynamically and so has no built-in 1763size limitation. 1764If memory is exhausted, program execution aborts. 1765.Pp 1766To use start condition stacks, scanners must include a 1767.Dq %option stack 1768directive (see 1769.Sx OPTIONS 1770below). 1771.Sh MULTIPLE INPUT BUFFERS 1772Some scanners 1773(such as those which support 1774.Qq include 1775files) 1776require reading from several input streams. 1777As 1778.Nm 1779scanners do a large amount of buffering, one cannot control 1780where the next input will be read from by simply writing a 1781.Dv YY_INPUT 1782which is sensitive to the scanning context. 1783.Dv YY_INPUT 1784is only called when the scanner reaches the end of its buffer, which 1785may be a long time after scanning a statement such as an 1786.Qq include 1787which requires switching the input source. 1788.Pp 1789To negotiate these sorts of problems, 1790.Nm 1791provides a mechanism for creating and switching between multiple 1792input buffers. 1793An input buffer is created by using: 1794.Pp 1795.D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size) 1796.Pp 1797which takes a 1798.Fa FILE 1799pointer and a 1800.Fa size 1801and creates a buffer associated with the given file and large enough to hold 1802.Fa size 1803characters (when in doubt, use 1804.Dv YY_BUF_SIZE 1805for the size). 1806It returns a 1807.Dv YY_BUFFER_STATE 1808handle, which may then be passed to other routines 1809.Pq see below . 1810The 1811.Dv YY_BUFFER_STATE 1812type is a pointer to an opaque 1813.Dq struct yy_buffer_state 1814structure, so 1815.Dv YY_BUFFER_STATE 1816variables may be safely initialized to 1817.Dq ((YY_BUFFER_STATE) 0) 1818if desired, and the opaque structure can also be referred to in order to 1819correctly declare input buffers in source files other than that of scanners. 1820Note that the 1821.Fa FILE 1822pointer in the call to 1823.Fn yy_create_buffer 1824is only used as the value of 1825.Fa yyin 1826seen by 1827.Dv YY_INPUT ; 1828if 1829.Dv YY_INPUT 1830is redefined so that it no longer uses 1831.Fa yyin , 1832then a nil 1833.Fa FILE 1834pointer can safely be passed to 1835.Fn yy_create_buffer . 1836To select a particular buffer to scan: 1837.Pp 1838.D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer) 1839.Pp 1840It switches the scanner's input buffer so subsequent tokens will 1841come from 1842.Fa new_buffer . 1843Note that 1844.Fn yy_switch_to_buffer 1845may be used by 1846.Fn yywrap 1847to set things up for continued scanning, 1848instead of opening a new file and pointing 1849.Fa yyin 1850at it. 1851Note also that switching input sources via either 1852.Fn yy_switch_to_buffer 1853or 1854.Fn yywrap 1855does not change the start condition. 1856.Pp 1857.D1 void yy_delete_buffer(YY_BUFFER_STATE buffer) 1858.Pp 1859is used to reclaim the storage associated with a buffer. 1860.Pf ( Fa buffer 1861can be nil, in which case the routine does nothing.) 1862To clear the current contents of a buffer: 1863.Pp 1864.D1 void yy_flush_buffer(YY_BUFFER_STATE buffer) 1865.Pp 1866This function discards the buffer's contents, 1867so the next time the scanner attempts to match a token from the buffer, 1868it will first fill the buffer anew using 1869.Dv YY_INPUT . 1870.Pp 1871.Fn yy_new_buffer 1872is an alias for 1873.Fn yy_create_buffer , 1874provided for compatibility with the C++ use of 1875.Em new 1876and 1877.Em delete 1878for creating and destroying dynamic objects. 1879.Pp 1880Finally, the 1881.Dv YY_CURRENT_BUFFER 1882macro returns a 1883.Dv YY_BUFFER_STATE 1884handle to the current buffer. 1885.Pp 1886Here is an example of using these features for writing a scanner 1887which expands include files (the <<EOF>> feature is discussed below): 1888.Bd -literal -offset indent 1889/* 1890 * the "incl" state is used for picking up the name 1891 * of an include file 1892 */ 1893%x incl 1894 1895%{ 1896#define MAX_INCLUDE_DEPTH 10 1897YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1898int include_stack_ptr = 0; 1899%} 1900 1901%% 1902include BEGIN(incl); 1903 1904[a-z]+ ECHO; 1905[^a-z\en]*\en? ECHO; 1906 1907<incl>[ \et]* /* eat the whitespace */ 1908<incl>[^ \et\en]+ { /* got the include file name */ 1909 if (include_stack_ptr >= MAX_INCLUDE_DEPTH) 1910 errx(1, "Includes nested too deeply"); 1911 1912 include_stack[include_stack_ptr++] = 1913 YY_CURRENT_BUFFER; 1914 1915 yyin = fopen(yytext, "r"); 1916 1917 if (yyin == NULL) 1918 err(1, NULL); 1919 1920 yy_switch_to_buffer( 1921 yy_create_buffer(yyin, YY_BUF_SIZE)); 1922 1923 BEGIN(INITIAL); 1924} 1925 1926<<EOF>> { 1927 if (--include_stack_ptr < 0) 1928 yyterminate(); 1929 else { 1930 yy_delete_buffer(YY_CURRENT_BUFFER); 1931 yy_switch_to_buffer( 1932 include_stack[include_stack_ptr]); 1933 } 1934} 1935.Ed 1936.Pp 1937Three routines are available for setting up input buffers for 1938scanning in-memory strings instead of files. 1939All of them create a new input buffer for scanning the string, 1940and return a corresponding 1941.Dv YY_BUFFER_STATE 1942handle (which should be deleted afterwards using 1943.Fn yy_delete_buffer ) . 1944They also switch to the new buffer using 1945.Fn yy_switch_to_buffer , 1946so the next call to 1947.Fn yylex 1948will start scanning the string. 1949.Bl -tag -width Ds 1950.It yy_scan_string(const char *str) 1951Scans a NUL-terminated string. 1952.It yy_scan_bytes(const char *bytes, int len) 1953Scans 1954.Fa len 1955bytes 1956.Pq including possibly NUL's 1957starting at location 1958.Fa bytes . 1959.El 1960.Pp 1961Note that both of these functions create and scan a copy 1962of the string or bytes. 1963(This may be desirable, since 1964.Fn yylex 1965modifies the contents of the buffer it is scanning.) 1966The copy can be avoided by using: 1967.Bl -tag -width Ds 1968.It yy_scan_buffer(char *base, yy_size_t size) 1969Which scans the buffer starting at 1970.Fa base , 1971consisting of 1972.Fa size 1973bytes, the last two bytes of which must be 1974.Dv YY_END_OF_BUFFER_CHAR 1975.Pq ASCII NUL . 1976These last two bytes are not scanned; thus, scanning consists of 1977base[0] through base[size-2], inclusive. 1978.Pp 1979If 1980.Fa base 1981is not set up in this manner 1982(i.e., forget the final two 1983.Dv YY_END_OF_BUFFER_CHAR 1984bytes), then 1985.Fn yy_scan_buffer 1986returns a nil pointer instead of creating a new input buffer. 1987.Pp 1988The type 1989.Fa yy_size_t 1990is an integral type which can be cast to an integer expression 1991reflecting the size of the buffer. 1992.El 1993.Sh END-OF-FILE RULES 1994The special rule 1995.Qq <<EOF>> 1996indicates actions which are to be taken when an end-of-file is encountered and 1997.Fn yywrap 1998returns non-zero 1999.Pq i.e., indicates no further files to process . 2000The action must finish by doing one of four things: 2001.Bl -dash 2002.It 2003Assigning 2004.Em yyin 2005to a new input file 2006(in previous versions of 2007.Nm , 2008after doing the assignment, it was necessary to call the special action 2009.Dv YY_NEW_FILE ; 2010this is no longer necessary). 2011.It 2012Executing a 2013.Em return 2014statement. 2015.It 2016Executing the special 2017.Fn yyterminate 2018action. 2019.It 2020Switching to a new buffer using 2021.Fn yy_switch_to_buffer 2022as shown in the example above. 2023.El 2024.Pp 2025<<EOF>> rules may not be used with other patterns; 2026they may only be qualified with a list of start conditions. 2027If an unqualified <<EOF>> rule is given, it applies to all start conditions 2028which do not already have <<EOF>> actions. 2029To specify an <<EOF>> rule for only the initial start condition, use 2030.Pp 2031.Dl <INITIAL><<EOF>> 2032.Pp 2033These rules are useful for catching things like unclosed comments. 2034An example: 2035.Bd -literal -offset indent 2036%x quote 2037%% 2038 2039\&...other rules for dealing with quotes... 2040 2041<quote><<EOF>> { 2042 error("unterminated quote"); 2043 yyterminate(); 2044} 2045<<EOF>> { 2046 if (*++filelist) 2047 yyin = fopen(*filelist, "r"); 2048 else 2049 yyterminate(); 2050} 2051.Ed 2052.Sh MISCELLANEOUS MACROS 2053The macro 2054.Dv YY_USER_ACTION 2055can be defined to provide an action 2056which is always executed prior to the matched rule's action. 2057For example, 2058it could be #define'd to call a routine to convert yytext to lower-case. 2059When 2060.Dv YY_USER_ACTION 2061is invoked, the variable 2062.Fa yy_act 2063gives the number of the matched rule 2064.Pq rules are numbered starting with 1 . 2065For example, to profile how often each rule is matched, 2066the following would do the trick: 2067.Pp 2068.Dl #define YY_USER_ACTION ++ctr[yy_act] 2069.Pp 2070where 2071.Fa ctr 2072is an array to hold the counts for the different rules. 2073Note that the macro 2074.Dv YY_NUM_RULES 2075gives the total number of rules 2076(including the default rule, even if 2077.Fl s 2078is used), 2079so a correct declaration for 2080.Fa ctr 2081is: 2082.Pp 2083.Dl int ctr[YY_NUM_RULES]; 2084.Pp 2085The macro 2086.Dv YY_USER_INIT 2087may be defined to provide an action which is always executed before 2088the first scan 2089.Pq and before the scanner's internal initializations are done . 2090For example, it could be used to call a routine to read 2091in a data table or open a logging file. 2092.Pp 2093The macro 2094.Dv yy_set_interactive(is_interactive) 2095can be used to control whether the current buffer is considered 2096.Em interactive . 2097An interactive buffer is processed more slowly, 2098but must be used when the scanner's input source is indeed 2099interactive to avoid problems due to waiting to fill buffers 2100(see the discussion of the 2101.Fl I 2102flag below). 2103A non-zero value in the macro invocation marks the buffer as interactive, 2104a zero value as non-interactive. 2105Note that use of this macro overrides 2106.Dq %option always-interactive 2107or 2108.Dq %option never-interactive 2109(see 2110.Sx OPTIONS 2111below). 2112.Fn yy_set_interactive 2113must be invoked prior to beginning to scan the buffer that is 2114.Pq or is not 2115to be considered interactive. 2116.Pp 2117The macro 2118.Dv yy_set_bol(at_bol) 2119can be used to control whether the current buffer's scanning 2120context for the next token match is done as though at the 2121beginning of a line. 2122A non-zero macro argument makes rules anchored with 2123.Sq ^ 2124active, while a zero argument makes 2125.Sq ^ 2126rules inactive. 2127.Pp 2128The macro 2129.Dv YY_AT_BOL 2130returns true if the next token scanned from the current buffer will have 2131.Sq ^ 2132rules active, false otherwise. 2133.Pp 2134In the generated scanner, the actions are all gathered in one large 2135switch statement and separated using 2136.Dv YY_BREAK , 2137which may be redefined. 2138By default, it is simply a 2139.Qq break , 2140to separate each rule's action from the following rules. 2141Redefining 2142.Dv YY_BREAK 2143allows, for example, C++ users to 2144.Dq #define YY_BREAK 2145to do nothing 2146(while being very careful that every rule ends with a 2147.Qq break 2148or a 2149.Qq return ! ) 2150to avoid suffering from unreachable statement warnings where because a rule's 2151action ends with 2152.Dq return , 2153the 2154.Dv YY_BREAK 2155is inaccessible. 2156.Sh VALUES AVAILABLE TO THE USER 2157This section summarizes the various values available to the user 2158in the rule actions. 2159.Bl -tag -width Ds 2160.It char *yytext 2161Holds the text of the current token. 2162It may be modified but not lengthened 2163.Pq characters cannot be appended to the end . 2164.Pp 2165If the special directive 2166.Dq %array 2167appears in the first section of the scanner description, then 2168.Fa yytext 2169is instead declared 2170.Dq char yytext[YYLMAX] , 2171where 2172.Dv YYLMAX 2173is a macro definition that can be redefined in the first section 2174to change the default value 2175.Pq generally 8KB . 2176Using 2177.Dq %array 2178results in somewhat slower scanners, but the value of 2179.Fa yytext 2180becomes immune to calls to 2181.Fn input 2182and 2183.Fn unput , 2184which potentially destroy its value when 2185.Fa yytext 2186is a character pointer. 2187The opposite of 2188.Dq %array 2189is 2190.Dq %pointer , 2191which is the default. 2192.Pp 2193.Dq %array 2194cannot be used when generating C++ scanner classes 2195(the 2196.Fl + 2197flag). 2198.It int yyleng 2199Holds the length of the current token. 2200.It FILE *yyin 2201Is the file which by default 2202.Nm 2203reads from. 2204It may be redefined, but doing so only makes sense before 2205scanning begins or after an 2206.Dv EOF 2207has been encountered. 2208Changing it in the midst of scanning will have unexpected results since 2209.Nm 2210buffers its input; use 2211.Fn yyrestart 2212instead. 2213Once scanning terminates because an end-of-file 2214has been seen, 2215.Fa yyin 2216can be assigned as the new input file 2217and the scanner can be called again to continue scanning. 2218.It void yyrestart(FILE *new_file) 2219May be called to point 2220.Fa yyin 2221at the new input file. 2222The switch-over to the new file is immediate 2223.Pq any previously buffered-up input is lost . 2224Note that calling 2225.Fn yyrestart 2226with 2227.Fa yyin 2228as an argument thus throws away the current input buffer and continues 2229scanning the same input file. 2230.It FILE *yyout 2231Is the file to which 2232.Em ECHO 2233actions are done. 2234It can be reassigned by the user. 2235.It YY_CURRENT_BUFFER 2236Returns a 2237.Dv YY_BUFFER_STATE 2238handle to the current buffer. 2239.It YY_START 2240Returns an integer value corresponding to the current start condition. 2241This value can subsequently be used with 2242.Em BEGIN 2243to return to that start condition. 2244.El 2245.Sh INTERFACING WITH YACC 2246One of the main uses of 2247.Nm 2248is as a companion to the 2249.Xr yacc 1 2250parser-generator. 2251yacc parsers expect to call a routine named 2252.Fn yylex 2253to find the next input token. 2254The routine is supposed to return the type of the next token 2255as well as putting any associated value in the global 2256.Fa yylval , 2257which is defined externally, 2258and can be a union or any other complex data structure. 2259To use 2260.Nm 2261with yacc, one specifies the 2262.Fl d 2263option to yacc to instruct it to generate the file 2264.Pa y.tab.h 2265containing definitions of all the 2266.Dq %tokens 2267appearing in the yacc input. 2268This file is then included in the 2269.Nm 2270scanner. 2271For example, part of the scanner might look like: 2272.Bd -literal -offset indent 2273%{ 2274#include "y.tab.h" 2275%} 2276 2277%% 2278 2279if return TOK_IF; 2280then return TOK_THEN; 2281begin return TOK_BEGIN; 2282end return TOK_END; 2283.Ed 2284.Sh OPTIONS 2285.Nm 2286has the following options: 2287.Bl -tag -width Ds 2288.It Fl 7 2289Instructs 2290.Nm 2291to generate a 7-bit scanner, i.e., one which can only recognize 7-bit 2292characters in its input. 2293The advantage of using 2294.Fl 7 2295is that the scanner's tables can be up to half the size of those generated 2296using the 2297.Fl 8 2298option 2299.Pq see below . 2300The disadvantage is that such scanners often hang 2301or crash if their input contains an 8-bit character. 2302.Pp 2303Note, however, that unless generating a scanner using the 2304.Fl Cf 2305or 2306.Fl CF 2307table compression options, use of 2308.Fl 7 2309will save only a small amount of table space, 2310and make the scanner considerably less portable. 2311.Nm flex Ns 's 2312default behavior is to generate an 8-bit scanner unless 2313.Fl Cf 2314or 2315.Fl CF 2316is specified, in which case 2317.Nm 2318defaults to generating 7-bit scanners unless it was 2319configured to generate 8-bit scanners 2320(as will often be the case with non-USA sites). 2321It is possible tell whether 2322.Nm 2323generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the 2324.Fl v 2325output as described below. 2326.Pp 2327Note that if 2328.Fl Cfe 2329or 2330.Fl CFe 2331are used 2332(the table compression options, but also using equivalence classes as 2333discussed below), 2334.Nm 2335still defaults to generating an 8-bit scanner, 2336since usually with these compression options full 8-bit tables 2337are not much more expensive than 7-bit tables. 2338.It Fl 8 2339Instructs 2340.Nm 2341to generate an 8-bit scanner, i.e., one which can recognize 8-bit 2342characters. 2343This flag is only needed for scanners generated using 2344.Fl Cf 2345or 2346.Fl CF , 2347as otherwise 2348.Nm 2349defaults to generating an 8-bit scanner anyway. 2350.Pp 2351See the discussion of 2352.Fl 7 2353above for 2354.Nm flex Ns 's 2355default behavior and the tradeoffs between 7-bit and 8-bit scanners. 2356.It Fl B 2357Instructs 2358.Nm 2359to generate a 2360.Em batch 2361scanner, the opposite of 2362.Em interactive 2363scanners generated by 2364.Fl I 2365.Pq see below . 2366In general, 2367.Fl B 2368is used when the scanner will never be used interactively, 2369and you want to squeeze a little more performance out of it. 2370If the aim is instead to squeeze out a lot more performance, 2371use the 2372.Fl Cf 2373or 2374.Fl CF 2375options 2376.Pq discussed below , 2377which turn on 2378.Fl B 2379automatically anyway. 2380.It Fl b 2381Generate backing-up information to 2382.Pa lex.backup . 2383This is a list of scanner states which require backing up 2384and the input characters on which they do so. 2385By adding rules one can remove backing-up states. 2386If all backing-up states are eliminated and 2387.Fl Cf 2388or 2389.Fl CF 2390is used, the generated scanner will run faster (see the 2391.Fl p 2392flag). 2393Only users who wish to squeeze every last cycle out of their 2394scanners need worry about this option. 2395(See the section on 2396.Sx PERFORMANCE CONSIDERATIONS 2397below.) 2398.It Fl C Ns Op Cm aeFfmr 2399Controls the degree of table compression and, more generally, trade-offs 2400between small scanners and fast scanners. 2401.Bl -tag -width Ds 2402.It Fl Ca 2403Instructs 2404.Nm 2405to trade off larger tables in the generated scanner for faster performance 2406because the elements of the tables are better aligned for memory access 2407and computation. 2408On some 2409.Tn RISC 2410architectures, fetching and manipulating longwords is more efficient 2411than with smaller-sized units such as shortwords. 2412This option can double the size of the tables used by the scanner. 2413.It Fl Ce 2414Directs 2415.Nm 2416to construct 2417.Em equivalence classes , 2418i.e., sets of characters which have identical lexical properties 2419(for example, if the only appearance of digits in the 2420.Nm 2421input is in the character class 2422.Qq [0-9] 2423then the digits 2424.Sq 0 , 2425.Sq 1 , 2426.Sq ... , 2427.Sq 9 2428will all be put in the same equivalence class). 2429Equivalence classes usually give dramatic reductions in the final 2430table/object file sizes 2431.Pq typically a factor of 2\-5 2432and are pretty cheap performance-wise 2433.Pq one array look-up per character scanned . 2434.It Fl CF 2435Specifies that the alternate fast scanner representation 2436(described below under the 2437.Fl F 2438option) 2439should be used. 2440This option cannot be used with 2441.Fl + . 2442.It Fl Cf 2443Specifies that the 2444.Em full 2445scanner tables should be generated \- 2446.Nm 2447should not compress the tables by taking advantage of 2448similar transition functions for different states. 2449.It Fl \&Cm 2450Directs 2451.Nm 2452to construct 2453.Em meta-equivalence classes , 2454which are sets of equivalence classes 2455(or characters, if equivalence classes are not being used) 2456that are commonly used together. 2457Meta-equivalence classes are often a big win when using compressed tables, 2458but they have a moderate performance impact 2459(one or two 2460.Qq if 2461tests and one array look-up per character scanned). 2462.It Fl Cr 2463Causes the generated scanner to 2464.Em bypass 2465use of the standard I/O library 2466.Pq stdio 2467for input. 2468Instead of calling 2469.Xr fread 3 2470or 2471.Xr getc 3 , 2472the scanner will use the 2473.Xr read 2 2474system call, 2475resulting in a performance gain which varies from system to system, 2476but in general is probably negligible unless 2477.Fl Cf 2478or 2479.Fl CF 2480are being used. 2481Using 2482.Fl Cr 2483can cause strange behavior if, for example, reading from 2484.Fa yyin 2485using stdio prior to calling the scanner 2486(because the scanner will miss whatever text previous reads left 2487in the stdio input buffer). 2488.Pp 2489.Fl Cr 2490has no effect if 2491.Dv YY_INPUT 2492is defined 2493(see 2494.Sx THE GENERATED SCANNER 2495above). 2496.El 2497.Pp 2498A lone 2499.Fl C 2500specifies that the scanner tables should be compressed but neither 2501equivalence classes nor meta-equivalence classes should be used. 2502.Pp 2503The options 2504.Fl Cf 2505or 2506.Fl CF 2507and 2508.Fl \&Cm 2509do not make sense together \- there is no opportunity for meta-equivalence 2510classes if the table is not being compressed. 2511Otherwise the options may be freely mixed, and are cumulative. 2512.Pp 2513The default setting is 2514.Fl Cem 2515which specifies that 2516.Nm 2517should generate equivalence classes and meta-equivalence classes. 2518This setting provides the highest degree of table compression. 2519It is possible to trade off faster-executing scanners at the cost of 2520larger tables with the following generally being true: 2521.Bd -unfilled -offset indent 2522slowest & smallest 2523 -Cem 2524 -Cm 2525 -Ce 2526 -C 2527 -C{f,F}e 2528 -C{f,F} 2529 -C{f,F}a 2530fastest & largest 2531.Ed 2532.Pp 2533Note that scanners with the smallest tables are usually generated and 2534compiled the quickest, 2535so during development the default is usually best, 2536maximal compression. 2537.Pp 2538.Fl Cfe 2539is often a good compromise between speed and size for production scanners. 2540.It Fl d 2541Makes the generated scanner run in debug mode. 2542Whenever a pattern is recognized and the global 2543.Fa yy_flex_debug 2544is non-zero 2545.Pq which is the default , 2546the scanner will write to stderr a line of the form: 2547.Pp 2548.D1 --accepting rule at line 53 ("the matched text") 2549.Pp 2550The line number refers to the location of the rule in the file 2551defining the scanner 2552(i.e., the file that was fed to 2553.Nm ) . 2554Messages are also generated when the scanner backs up, 2555accepts the default rule, 2556reaches the end of its input buffer 2557(or encounters a NUL; 2558at this point, the two look the same as far as the scanner's concerned), 2559or reaches an end-of-file. 2560.It Fl F 2561Specifies that the fast scanner table representation should be used 2562.Pq and stdio bypassed . 2563This representation is about as fast as the full table representation 2564.Pq Fl f , 2565and for some sets of patterns will be considerably smaller 2566.Pq and for others, larger . 2567In general, if the pattern set contains both 2568.Qq keywords 2569and a catch-all, 2570.Qq identifier 2571rule, such as in the set: 2572.Bd -unfilled -offset indent 2573"case" return TOK_CASE; 2574"switch" return TOK_SWITCH; 2575\&... 2576"default" return TOK_DEFAULT; 2577[a-z]+ return TOK_ID; 2578.Ed 2579.Pp 2580then it's better to use the full table representation. 2581If only the 2582.Qq identifier 2583rule is present and a hash table or some such is used to detect the keywords, 2584it's better to use 2585.Fl F . 2586.Pp 2587This option is equivalent to 2588.Fl CFr 2589.Pq see above . 2590It cannot be used with 2591.Fl + . 2592.It Fl f 2593Specifies 2594.Em fast scanner . 2595No table compression is done and stdio is bypassed. 2596The result is large but fast. 2597This option is equivalent to 2598.Fl Cfr 2599.Pq see above . 2600.It Fl h 2601Generates a help summary of 2602.Nm flex Ns 's 2603options to stdout and then exits. 2604.Fl ?\& 2605and 2606.Fl Fl help 2607are synonyms for 2608.Fl h . 2609.It Fl I 2610Instructs 2611.Nm 2612to generate an 2613.Em interactive 2614scanner. 2615An interactive scanner is one that only looks ahead to decide 2616what token has been matched if it absolutely must. 2617It turns out that always looking one extra character ahead, 2618even if the scanner has already seen enough text 2619to disambiguate the current token, is a bit faster than 2620only looking ahead when necessary. 2621But scanners that always look ahead give dreadful interactive performance; 2622for example, when a user types a newline, 2623it is not recognized as a newline token until they enter 2624.Em another 2625token, which often means typing in another whole line. 2626.Pp 2627.Nm 2628scanners default to 2629.Em interactive 2630unless 2631.Fl Cf 2632or 2633.Fl CF 2634table-compression options are specified 2635.Pq see above . 2636That's because if high-performance is most important, 2637one of these options should be used, 2638so if they weren't, 2639.Nm 2640assumes it is preferable to trade off a bit of run-time performance for 2641intuitive interactive behavior. 2642Note also that 2643.Fl I 2644cannot be used in conjunction with 2645.Fl Cf 2646or 2647.Fl CF . 2648Thus, this option is not really needed; it is on by default for all those 2649cases in which it is allowed. 2650.Pp 2651A scanner can be forced to not be interactive by using 2652.Fl B 2653.Pq see above . 2654.It Fl i 2655Instructs 2656.Nm 2657to generate a case-insensitive scanner. 2658The case of letters given in the 2659.Nm 2660input patterns will be ignored, 2661and tokens in the input will be matched regardless of case. 2662The matched text given in 2663.Fa yytext 2664will have the preserved case 2665.Pq i.e., it will not be folded . 2666.It Fl L 2667Instructs 2668.Nm 2669not to generate 2670.Dq #line 2671directives. 2672Without this option, 2673.Nm 2674peppers the generated scanner with #line directives so error messages 2675in the actions will be correctly located with respect to either the original 2676.Nm 2677input file 2678(if the errors are due to code in the input file), 2679or 2680.Pa lex.yy.c 2681(if the errors are 2682.Nm flex Ns 's 2683fault \- these sorts of errors should be reported to the email address 2684given below). 2685.It Fl l 2686Turns on maximum compatibility with the original 2687.At 2688.Nm lex 2689implementation. 2690Note that this does not mean full compatibility. 2691Use of this option costs a considerable amount of performance, 2692and it cannot be used with the 2693.Fl + , f , F , Cf , 2694or 2695.Fl CF 2696options. 2697For details on the compatibilities it provides, see the section 2698.Sx INCOMPATIBILITIES WITH LEX AND POSIX 2699below. 2700This option also results in the name 2701.Dv YY_FLEX_LEX_COMPAT 2702being #define'd in the generated scanner. 2703.It Fl n 2704Another do-nothing, deprecated option included only for 2705.Tn POSIX 2706compliance. 2707.It Fl o Ns Ar output 2708Directs 2709.Nm 2710to write the scanner to the file 2711.Ar output 2712instead of 2713.Pa lex.yy.c . 2714If 2715.Fl o 2716is combined with the 2717.Fl t 2718option, then the scanner is written to stdout but its 2719.Dq #line 2720directives 2721(see the 2722.Fl L 2723option above) 2724refer to the file 2725.Ar output . 2726.It Fl P Ns Ar prefix 2727Changes the default 2728.Qq yy 2729prefix used by 2730.Nm 2731for all globally visible variable and function names to instead be 2732.Ar prefix . 2733For example, 2734.Fl P Ns Ar foo 2735changes the name of 2736.Fa yytext 2737to 2738.Fa footext . 2739It also changes the name of the default output file from 2740.Pa lex.yy.c 2741to 2742.Pa lex.foo.c . 2743Here are all of the names affected: 2744.Bd -unfilled -offset indent 2745yy_create_buffer 2746yy_delete_buffer 2747yy_flex_debug 2748yy_init_buffer 2749yy_flush_buffer 2750yy_load_buffer_state 2751yy_switch_to_buffer 2752yyin 2753yyleng 2754yylex 2755yylineno 2756yyout 2757yyrestart 2758yytext 2759yywrap 2760.Ed 2761.Pp 2762(If using a C++ scanner, then only 2763.Fa yywrap 2764and 2765.Fa yyFlexLexer 2766are affected.) 2767Within the scanner itself, it is still possible to refer to the global variables 2768and functions using either version of their name; but externally, they 2769have the modified name. 2770.Pp 2771This option allows multiple 2772.Nm 2773programs to be easily linked together into the same executable. 2774Note, though, that using this option also renames 2775.Fn yywrap , 2776so now either an 2777.Pq appropriately named 2778version of the routine for the scanner must be supplied, or 2779.Dq %option noyywrap 2780must be used, as linking with 2781.Fl lfl 2782no longer provides one by default. 2783.It Fl p 2784Generates a performance report to stderr. 2785The report consists of comments regarding features of the 2786.Nm 2787input file which will cause a serious loss of performance in the resulting 2788scanner. 2789If the flag is specified twice, 2790comments regarding features that lead to minor performance losses 2791will also be reported> 2792.Pp 2793Note that the use of 2794.Em REJECT , 2795.Dq %option yylineno , 2796and variable trailing context 2797(see the 2798.Sx BUGS 2799section below) 2800entails a substantial performance penalty; use of 2801.Fn yymore , 2802the 2803.Sq ^ 2804operator, and the 2805.Fl I 2806flag entail minor performance penalties. 2807.It Fl S Ns Ar skeleton 2808Overrides the default skeleton file from which 2809.Nm 2810constructs its scanners. 2811This option is needed only for 2812.Nm 2813maintenance or development. 2814.It Fl s 2815Causes the default rule 2816.Pq that unmatched scanner input is echoed to stdout 2817to be suppressed. 2818If the scanner encounters input that does not 2819match any of its rules, it aborts with an error. 2820This option is useful for finding holes in a scanner's rule set. 2821.It Fl T 2822Makes 2823.Nm 2824run in 2825.Em trace 2826mode. 2827It will generate a lot of messages to stderr concerning 2828the form of the input and the resultant non-deterministic and deterministic 2829finite automata. 2830This option is mostly for use in maintaining 2831.Nm . 2832.It Fl t 2833Instructs 2834.Nm 2835to write the scanner it generates to standard output instead of 2836.Pa lex.yy.c . 2837.It Fl V 2838Prints the version number to stdout and exits. 2839.Fl Fl version 2840is a synonym for 2841.Fl V . 2842.It Fl v 2843Specifies that 2844.Nm 2845should write to stderr 2846a summary of statistics regarding the scanner it generates. 2847Most of the statistics are meaningless to the casual 2848.Nm 2849user, but the first line identifies the version of 2850.Nm 2851(same as reported by 2852.Fl V ) , 2853and the next line the flags used when generating the scanner, 2854including those that are on by default. 2855.It Fl w 2856Suppresses warning messages. 2857.It Fl + 2858Specifies that 2859.Nm 2860should generate a C++ scanner class. 2861See the section on 2862.Sx GENERATING C++ SCANNERS 2863below for details. 2864.El 2865.Pp 2866.Nm 2867also provides a mechanism for controlling options within the 2868scanner specification itself, rather than from the 2869.Nm 2870command line. 2871This is done by including 2872.Dq %option 2873directives in the first section of the scanner specification. 2874Multiple options can be specified with a single 2875.Dq %option 2876directive, and multiple directives in the first section of the 2877.Nm 2878input file. 2879.Pp 2880Most options are given simply as names, optionally preceded by the word 2881.Qq no 2882.Pq with no intervening whitespace 2883to negate their meaning. 2884A number are equivalent to 2885.Nm 2886flags or their negation: 2887.Bd -unfilled -offset indent 28887bit -7 option 28898bit -8 option 2890align -Ca option 2891backup -b option 2892batch -B option 2893c++ -+ option 2894 2895caseful or 2896case-sensitive opposite of -i (default) 2897 2898case-insensitive or 2899caseless -i option 2900 2901debug -d option 2902default opposite of -s option 2903ecs -Ce option 2904fast -F option 2905full -f option 2906interactive -I option 2907lex-compat -l option 2908meta-ecs -Cm option 2909perf-report -p option 2910read -Cr option 2911stdout -t option 2912verbose -v option 2913warn opposite of -w option 2914 (use "%option nowarn" for -w) 2915 2916array equivalent to "%array" 2917pointer equivalent to "%pointer" (default) 2918.Ed 2919.Pp 2920Some %option's provide features otherwise not available: 2921.Bl -tag -width Ds 2922.It always-interactive 2923Instructs 2924.Nm 2925to generate a scanner which always considers its input 2926.Qq interactive . 2927Normally, on each new input file the scanner calls 2928.Fn isatty 2929in an attempt to determine whether the scanner's input source is interactive 2930and thus should be read a character at a time. 2931When this option is used, however, no such call is made. 2932.It main 2933Directs 2934.Nm 2935to provide a default 2936.Fn main 2937program for the scanner, which simply calls 2938.Fn yylex . 2939This option implies 2940.Dq noyywrap 2941.Pq see below . 2942.It never-interactive 2943Instructs 2944.Nm 2945to generate a scanner which never considers its input 2946.Qq interactive 2947(again, no call made to 2948.Fn isatty ) . 2949This is the opposite of 2950.Dq always-interactive . 2951.It stack 2952Enables the use of start condition stacks 2953(see 2954.Sx START CONDITIONS 2955above). 2956.It stdinit 2957If set (i.e., 2958.Dq %option stdinit ) , 2959initializes 2960.Fa yyin 2961and 2962.Fa yyout 2963to stdin and stdout, instead of the default of 2964.Dq nil . 2965Some existing 2966.Nm lex 2967programs depend on this behavior, even though it is not compliant with ANSI C, 2968which does not require stdin and stdout to be compile-time constant. 2969.It yylineno 2970Directs 2971.Nm 2972to generate a scanner that maintains the number of the current line 2973read from its input in the global variable 2974.Fa yylineno . 2975This option is implied by 2976.Dq %option lex-compat . 2977.It yywrap 2978If unset (i.e., 2979.Dq %option noyywrap ) , 2980makes the scanner not call 2981.Fn yywrap 2982upon an end-of-file, but simply assume that there are no more files to scan 2983(until the user points 2984.Fa yyin 2985at a new file and calls 2986.Fn yylex 2987again). 2988.El 2989.Pp 2990.Nm 2991scans rule actions to determine whether the 2992.Em REJECT 2993or 2994.Fn yymore 2995features are being used. 2996The 2997.Dq reject 2998and 2999.Dq yymore 3000options are available to override its decision as to whether to use the 3001options, either by setting them (e.g., 3002.Dq %option reject ) 3003to indicate the feature is indeed used, 3004or unsetting them to indicate it actually is not used 3005(e.g., 3006.Dq %option noyymore ) . 3007.Pp 3008Three options take string-delimited values, offset with 3009.Sq = : 3010.Pp 3011.D1 %option outfile="ABC" 3012.Pp 3013is equivalent to 3014.Fl o Ns Ar ABC , 3015and 3016.Pp 3017.D1 %option prefix="XYZ" 3018.Pp 3019is equivalent to 3020.Fl P Ns Ar XYZ . 3021Finally, 3022.Pp 3023.D1 %option yyclass="foo" 3024.Pp 3025only applies when generating a C++ scanner 3026.Pf ( Fl + 3027option). 3028It informs 3029.Nm 3030that 3031.Dq foo 3032has been derived as a subclass of yyFlexLexer, so 3033.Nm 3034will place actions in the member function 3035.Dq foo::yylex() 3036instead of 3037.Dq yyFlexLexer::yylex() . 3038It also generates a 3039.Dq yyFlexLexer::yylex() 3040member function that emits a run-time error (by invoking 3041.Dq yyFlexLexer::LexerError() ) 3042if called. 3043See 3044.Sx GENERATING C++ SCANNERS , 3045below, for additional information. 3046.Pp 3047A number of options are available for 3048lint 3049purists who want to suppress the appearance of unneeded routines 3050in the generated scanner. 3051Each of the following, if unset 3052(e.g., 3053.Dq %option nounput ) , 3054results in the corresponding routine not appearing in the generated scanner: 3055.Bd -unfilled -offset indent 3056input, unput 3057yy_push_state, yy_pop_state, yy_top_state 3058yy_scan_buffer, yy_scan_bytes, yy_scan_string 3059.Ed 3060.Pp 3061(though 3062.Fn yy_push_state 3063and friends won't appear anyway unless 3064.Dq %option stack 3065is being used). 3066.Sh PERFORMANCE CONSIDERATIONS 3067The main design goal of 3068.Nm 3069is that it generate high-performance scanners. 3070It has been optimized for dealing well with large sets of rules. 3071Aside from the effects on scanner speed of the table compression 3072.Fl C 3073options outlined above, 3074there are a number of options/actions which degrade performance. 3075These are, from most expensive to least: 3076.Bd -unfilled -offset indent 3077REJECT 3078%option yylineno 3079arbitrary trailing context 3080 3081pattern sets that require backing up 3082%array 3083%option interactive 3084%option always-interactive 3085 3086\&'^' beginning-of-line operator 3087yymore() 3088.Ed 3089.Pp 3090with the first three all being quite expensive 3091and the last two being quite cheap. 3092Note also that 3093.Fn unput 3094is implemented as a routine call that potentially does quite a bit of work, 3095while 3096.Fn yyless 3097is a quite-cheap macro; so if just putting back some excess text, 3098use 3099.Fn yyless . 3100.Pp 3101.Em REJECT 3102should be avoided at all costs when performance is important. 3103It is a particularly expensive option. 3104.Pp 3105Getting rid of backing up is messy and often may be an enormous 3106amount of work for a complicated scanner. 3107In principal, one begins by using the 3108.Fl b 3109flag to generate a 3110.Pa lex.backup 3111file. 3112For example, on the input 3113.Bd -literal -offset indent 3114%% 3115foo return TOK_KEYWORD; 3116foobar return TOK_KEYWORD; 3117.Ed 3118.Pp 3119the file looks like: 3120.Bd -literal -offset indent 3121State #6 is non-accepting - 3122 associated rule line numbers: 3123 2 3 3124 out-transitions: [ o ] 3125 jam-transitions: EOF [ \e001-n p-\e177 ] 3126 3127State #8 is non-accepting - 3128 associated rule line numbers: 3129 3 3130 out-transitions: [ a ] 3131 jam-transitions: EOF [ \e001-` b-\e177 ] 3132 3133State #9 is non-accepting - 3134 associated rule line numbers: 3135 3 3136 out-transitions: [ r ] 3137 jam-transitions: EOF [ \e001-q s-\e177 ] 3138 3139Compressed tables always back up. 3140.Ed 3141.Pp 3142The first few lines tell us that there's a scanner state in 3143which it can make a transition on an 3144.Sq o 3145but not on any other character, 3146and that in that state the currently scanned text does not match any rule. 3147The state occurs when trying to match the rules found 3148at lines 2 and 3 in the input file. 3149If the scanner is in that state and then reads something other than an 3150.Sq o , 3151it will have to back up to find a rule which is matched. 3152With a bit of headscratching one can see that this must be the 3153state it's in when it has seen 3154.Sq fo . 3155When this has happened, if anything other than another 3156.Sq o 3157is seen, the scanner will have to back up to simply match the 3158.Sq f 3159.Pq by the default rule . 3160.Pp 3161The comment regarding State #8 indicates there's a problem when 3162.Qq foob 3163has been scanned. 3164Indeed, on any character other than an 3165.Sq a , 3166the scanner will have to back up to accept 3167.Qq foo . 3168Similarly, the comment for State #9 concerns when 3169.Qq fooba 3170has been scanned and an 3171.Sq r 3172does not follow. 3173.Pp 3174The final comment reminds us that there's no point going to 3175all the trouble of removing backing up from the rules unless we're using 3176.Fl Cf 3177or 3178.Fl CF , 3179since there's no performance gain doing so with compressed scanners. 3180.Pp 3181The way to remove the backing up is to add 3182.Qq error 3183rules: 3184.Bd -literal -offset indent 3185%% 3186foo return TOK_KEYWORD; 3187foobar return TOK_KEYWORD; 3188 3189fooba | 3190foob | 3191fo { 3192 /* false alarm, not really a keyword */ 3193 return TOK_ID; 3194} 3195.Ed 3196.Pp 3197Eliminating backing up among a list of keywords can also be done using a 3198.Qq catch-all 3199rule: 3200.Bd -literal -offset indent 3201%% 3202foo return TOK_KEYWORD; 3203foobar return TOK_KEYWORD; 3204 3205[a-z]+ return TOK_ID; 3206.Ed 3207.Pp 3208This is usually the best solution when appropriate. 3209.Pp 3210Backing up messages tend to cascade. 3211With a complicated set of rules it's not uncommon to get hundreds of messages. 3212If one can decipher them, though, 3213it often only takes a dozen or so rules to eliminate the backing up 3214(though it's easy to make a mistake and have an error rule accidentally match 3215a valid token; a possible future 3216.Nm 3217feature will be to automatically add rules to eliminate backing up). 3218.Pp 3219It's important to keep in mind that the benefits of eliminating 3220backing up are gained only if 3221.Em every 3222instance of backing up is eliminated. 3223Leaving just one gains nothing. 3224.Pp 3225.Em Variable 3226trailing context 3227(where both the leading and trailing parts do not have a fixed length) 3228entails almost the same performance loss as 3229.Em REJECT 3230.Pq i.e., substantial . 3231So when possible a rule like: 3232.Bd -literal -offset indent 3233%% 3234mouse|rat/(cat|dog) run(); 3235.Ed 3236.Pp 3237is better written: 3238.Bd -literal -offset indent 3239%% 3240mouse/cat|dog run(); 3241rat/cat|dog run(); 3242.Ed 3243.Pp 3244or as 3245.Bd -literal -offset indent 3246%% 3247mouse|rat/cat run(); 3248mouse|rat/dog run(); 3249.Ed 3250.Pp 3251Note that here the special 3252.Sq |\& 3253action does not provide any savings, and can even make things worse (see 3254.Sx BUGS 3255below). 3256.Pp 3257Another area where the user can increase a scanner's performance 3258.Pq and one that's easier to implement 3259arises from the fact that the longer the tokens matched, 3260the faster the scanner will run. 3261This is because with long tokens the processing of most input 3262characters takes place in the 3263.Pq short 3264inner scanning loop, and does not often have to go through the additional work 3265of setting up the scanning environment (e.g., 3266.Fa yytext ) 3267for the action. 3268Recall the scanner for C comments: 3269.Bd -literal -offset indent 3270%x comment 3271%% 3272int line_num = 1; 3273 3274"/*" BEGIN(comment); 3275 3276<comment>[^*\en]* 3277<comment>"*"+[^*/\en]* 3278<comment>\en ++line_num; 3279<comment>"*"+"/" BEGIN(INITIAL); 3280.Ed 3281.Pp 3282This could be sped up by writing it as: 3283.Bd -literal -offset indent 3284%x comment 3285%% 3286int line_num = 1; 3287 3288"/*" BEGIN(comment); 3289 3290<comment>[^*\en]* 3291<comment>[^*\en]*\en ++line_num; 3292<comment>"*"+[^*/\en]* 3293<comment>"*"+[^*/\en]*\en ++line_num; 3294<comment>"*"+"/" BEGIN(INITIAL); 3295.Ed 3296.Pp 3297Now instead of each newline requiring the processing of another action, 3298recognizing the newlines is 3299.Qq distributed 3300over the other rules to keep the matched text as long as possible. 3301Note that adding rules does 3302.Em not 3303slow down the scanner! 3304The speed of the scanner is independent of the number of rules or 3305(modulo the considerations given at the beginning of this section) 3306how complicated the rules are with regard to operators such as 3307.Sq * 3308and 3309.Sq |\& . 3310.Pp 3311A final example in speeding up a scanner: 3312scan through a file containing identifiers and keywords, one per line 3313and with no other extraneous characters, and recognize all the keywords. 3314A natural first approach is: 3315.Bd -literal -offset indent 3316%% 3317asm | 3318auto | 3319break | 3320\&... etc ... 3321volatile | 3322while /* it's a keyword */ 3323 3324\&.|\en /* it's not a keyword */ 3325.Ed 3326.Pp 3327To eliminate the back-tracking, introduce a catch-all rule: 3328.Bd -literal -offset indent 3329%% 3330asm | 3331auto | 3332break | 3333\&... etc ... 3334volatile | 3335while /* it's a keyword */ 3336 3337[a-z]+ | 3338\&.|\en /* it's not a keyword */ 3339.Ed 3340.Pp 3341Now, if it's guaranteed that there's exactly one word per line, 3342then we can reduce the total number of matches by a half by 3343merging in the recognition of newlines with that of the other tokens: 3344.Bd -literal -offset indent 3345%% 3346asm\en | 3347auto\en | 3348break\en | 3349\&... etc ... 3350volatile\en | 3351while\en /* it's a keyword */ 3352 3353[a-z]+\en | 3354\&.|\en /* it's not a keyword */ 3355.Ed 3356.Pp 3357One has to be careful here, 3358as we have now reintroduced backing up into the scanner. 3359In particular, while we know that there will never be any characters 3360in the input stream other than letters or newlines, 3361.Nm 3362can't figure this out, and it will plan for possibly needing to back up 3363when it has scanned a token like 3364.Qq auto 3365and then the next character is something other than a newline or a letter. 3366Previously it would then just match the 3367.Qq auto 3368rule and be done, but now it has no 3369.Qq auto 3370rule, only an 3371.Qq auto\en 3372rule. 3373To eliminate the possibility of backing up, 3374we could either duplicate all rules but without final newlines or, 3375since we never expect to encounter such an input and therefore don't 3376how it's classified, we can introduce one more catch-all rule, 3377this one which doesn't include a newline: 3378.Bd -literal -offset indent 3379%% 3380asm\en | 3381auto\en | 3382break\en | 3383\&... etc ... 3384volatile\en | 3385while\en /* it's a keyword */ 3386 3387[a-z]+\en | 3388[a-z]+ | 3389\&.|\en /* it's not a keyword */ 3390.Ed 3391.Pp 3392Compiled with 3393.Fl Cf , 3394this is about as fast as one can get a 3395.Nm 3396scanner to go for this particular problem. 3397.Pp 3398A final note: 3399.Nm 3400is slow when matching NUL's, 3401particularly when a token contains multiple NUL's. 3402It's best to write rules which match short 3403amounts of text if it's anticipated that the text will often include NUL's. 3404.Pp 3405Another final note regarding performance: as mentioned above in the section 3406.Sx HOW THE INPUT IS MATCHED , 3407dynamically resizing 3408.Fa yytext 3409to accommodate huge tokens is a slow process because it presently requires that 3410the 3411.Pq huge 3412token be rescanned from the beginning. 3413Thus if performance is vital, it is better to attempt to match 3414.Qq large 3415quantities of text but not 3416.Qq huge 3417quantities, where the cutoff between the two is at about 8K characters/token. 3418.Sh GENERATING C++ SCANNERS 3419.Nm 3420provides two different ways to generate scanners for use with C++. 3421The first way is to simply compile a scanner generated by 3422.Nm 3423using a C++ compiler instead of a C compiler. 3424This should not generate any compilation errors 3425(please report any found to the email address given in the 3426.Sx AUTHORS 3427section below). 3428C++ code can then be used in rule actions instead of C code. 3429Note that the default input source for scanners remains 3430.Fa yyin , 3431and default echoing is still done to 3432.Fa yyout . 3433Both of these remain 3434.Fa FILE * 3435variables and not C++ streams. 3436.Pp 3437.Nm 3438can also be used to generate a C++ scanner class, using the 3439.Fl + 3440option (or, equivalently, 3441.Dq %option c++ ) , 3442which is automatically specified if the name of the flex executable ends in a 3443.Sq + , 3444such as 3445.Nm flex++ . 3446When using this option, 3447.Nm 3448defaults to generating the scanner to the file 3449.Pa lex.yy.cc 3450instead of 3451.Pa lex.yy.c . 3452The generated scanner includes the header file 3453.In g++/FlexLexer.h , 3454which defines the interface to two C++ classes. 3455.Pp 3456The first class, 3457.Em FlexLexer , 3458provides an abstract base class defining the general scanner class interface. 3459It provides the following member functions: 3460.Bl -tag -width Ds 3461.It const char* YYText() 3462Returns the text of the most recently matched token, the equivalent of 3463.Fa yytext . 3464.It int YYLeng() 3465Returns the length of the most recently matched token, the equivalent of 3466.Fa yyleng . 3467.It int lineno() const 3468Returns the current input line number 3469(see 3470.Dq %option yylineno ) , 3471or 1 if 3472.Dq %option yylineno 3473was not used. 3474.It void set_debug(int flag) 3475Sets the debugging flag for the scanner, equivalent to assigning to 3476.Fa yy_flex_debug 3477(see the 3478.Sx OPTIONS 3479section above). 3480Note that the scanner must be built using 3481.Dq %option debug 3482to include debugging information in it. 3483.It int debug() const 3484Returns the current setting of the debugging flag. 3485.El 3486.Pp 3487Also provided are member functions equivalent to 3488.Fn yy_switch_to_buffer , 3489.Fn yy_create_buffer 3490(though the first argument is an 3491.Fa std::istream* 3492object pointer and not a 3493.Fa FILE* ) , 3494.Fn yy_flush_buffer , 3495.Fn yy_delete_buffer , 3496and 3497.Fn yyrestart 3498(again, the first argument is an 3499.Fa std::istream* 3500object pointer). 3501.Pp 3502The second class defined in 3503.In g++/FlexLexer.h 3504is 3505.Fa yyFlexLexer , 3506which is derived from 3507.Fa FlexLexer . 3508It defines the following additional member functions: 3509.Bl -tag -width Ds 3510.It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)" 3511Constructs a 3512.Fa yyFlexLexer 3513object using the given streams for input and output. 3514If not specified, the streams default to 3515.Fa cin 3516and 3517.Fa cout , 3518respectively. 3519.It virtual int yylex() 3520Performs the same role as 3521.Fn yylex 3522does for ordinary flex scanners: it scans the input stream, consuming 3523tokens, until a rule's action returns a value. 3524If subclass 3525.Sq S 3526is derived from 3527.Fa yyFlexLexer , 3528in order to access the member functions and variables of 3529.Sq S 3530inside 3531.Fn yylex , 3532use 3533.Dq %option yyclass="S" 3534to inform 3535.Nm 3536that the 3537.Sq S 3538subclass will be used instead of 3539.Fa yyFlexLexer . 3540In this case, rather than generating 3541.Dq yyFlexLexer::yylex() , 3542.Nm 3543generates 3544.Dq S::yylex() 3545(and also generates a dummy 3546.Dq yyFlexLexer::yylex() 3547that calls 3548.Dq yyFlexLexer::LexerError() 3549if called). 3550.It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)" 3551Reassigns 3552.Fa yyin 3553to 3554.Fa new_in 3555.Pq if non-nil 3556and 3557.Fa yyout 3558to 3559.Fa new_out 3560.Pq ditto , 3561deleting the previous input buffer if 3562.Fa yyin 3563is reassigned. 3564.It int yylex(std::istream* new_in, std::ostream* new_out = 0) 3565First switches the input streams via 3566.Dq switch_streams(new_in, new_out) 3567and then returns the value of 3568.Fn yylex . 3569.El 3570.Pp 3571In addition, 3572.Fa yyFlexLexer 3573defines the following protected virtual functions which can be redefined 3574in derived classes to tailor the scanner: 3575.Bl -tag -width Ds 3576.It virtual int LexerInput(char* buf, int max_size) 3577Reads up to 3578.Fa max_size 3579characters into 3580.Fa buf 3581and returns the number of characters read. 3582To indicate end-of-input, return 0 characters. 3583Note that 3584.Qq interactive 3585scanners (see the 3586.Fl B 3587and 3588.Fl I 3589flags) define the macro 3590.Dv YY_INTERACTIVE . 3591If 3592.Fn LexerInput 3593has been redefined, and it's necessary to take different actions depending on 3594whether or not the scanner might be scanning an interactive input source, 3595it's possible to test for the presence of this name via 3596.Dq #ifdef . 3597.It virtual void LexerOutput(const char* buf, int size) 3598Writes out 3599.Fa size 3600characters from the buffer 3601.Fa buf , 3602which, while NUL-terminated, may also contain 3603.Qq internal 3604NUL's if the scanner's rules can match text with NUL's in them. 3605.It virtual void LexerError(const char* msg) 3606Reports a fatal error message. 3607The default version of this function writes the message to the stream 3608.Fa cerr 3609and exits. 3610.El 3611.Pp 3612Note that a 3613.Fa yyFlexLexer 3614object contains its entire scanning state. 3615Thus such objects can be used to create reentrant scanners. 3616Multiple instances of the same 3617.Fa yyFlexLexer 3618class can be instantiated, and multiple C++ scanner classes can be combined 3619in the same program using the 3620.Fl P 3621option discussed above. 3622.Pp 3623Finally, note that the 3624.Dq %array 3625feature is not available to C++ scanner classes; 3626.Dq %pointer 3627must be used 3628.Pq the default . 3629.Pp 3630Here is an example of a simple C++ scanner: 3631.Bd -literal -offset indent 3632// An example of using the flex C++ scanner class. 3633 3634%{ 3635#include <errno.h> 3636int mylineno = 0; 3637%} 3638 3639string \e"[^\en"]+\e" 3640 3641ws [ \et]+ 3642 3643alpha [A-Za-z] 3644dig [0-9] 3645name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])* 3646num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)? 3647num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)? 3648number {num1}|{num2} 3649 3650%% 3651 3652{ws} /* skip blanks and tabs */ 3653 3654"/*" { 3655 int c; 3656 3657 while ((c = yyinput()) != 0) { 3658 if(c == '\en') 3659 ++mylineno; 3660 else if(c == '*') { 3661 if ((c = yyinput()) == '/') 3662 break; 3663 else 3664 unput(c); 3665 } 3666 } 3667} 3668 3669{number} cout << "number " << YYText() << '\en'; 3670 3671\en mylineno++; 3672 3673{name} cout << "name " << YYText() << '\en'; 3674 3675{string} cout << "string " << YYText() << '\en'; 3676 3677%% 3678 3679int main(int /* argc */, char** /* argv */) 3680{ 3681 FlexLexer* lexer = new yyFlexLexer; 3682 while(lexer->yylex() != 0) 3683 ; 3684 return 0; 3685} 3686.Ed 3687.Pp 3688To create multiple 3689.Pq different 3690lexer classes, use the 3691.Fl P 3692flag 3693(or the 3694.Dq prefix= 3695option) 3696to rename each 3697.Fa yyFlexLexer 3698to some other 3699.Fa xxFlexLexer . 3700.In g++/FlexLexer.h 3701can then be included in other sources once per lexer class, first renaming 3702.Fa yyFlexLexer 3703as follows: 3704.Bd -literal -offset indent 3705#undef yyFlexLexer 3706#define yyFlexLexer xxFlexLexer 3707#include <g++/FlexLexer.h> 3708 3709#undef yyFlexLexer 3710#define yyFlexLexer zzFlexLexer 3711#include <g++/FlexLexer.h> 3712.Ed 3713.Pp 3714If, for example, 3715.Dq %option prefix="xx" 3716is used for one scanner and 3717.Dq %option prefix="zz" 3718is used for the other. 3719.Pp 3720.Sy IMPORTANT : 3721the present form of the scanning class is experimental 3722and may change considerably between major releases. 3723.Sh INCOMPATIBILITIES WITH LEX AND POSIX 3724.Nm 3725is a rewrite of the 3726.At 3727.Nm lex 3728tool 3729(the two implementations do not share any code, though), 3730with some extensions and incompatibilities, both of which are of concern 3731to those who wish to write scanners acceptable to either implementation. 3732.Nm 3733is fully compliant with the 3734.Tn POSIX 3735.Nm lex 3736specification, except that when using 3737.Dq %pointer 3738.Pq the default , 3739a call to 3740.Fn unput 3741destroys the contents of 3742.Fa yytext , 3743which is counter to the 3744.Tn POSIX 3745specification. 3746.Pp 3747In this section we discuss all of the known areas of incompatibility between 3748.Nm , 3749.At 3750.Nm lex , 3751and the 3752.Tn POSIX 3753specification. 3754.Pp 3755.Nm flex Ns 's 3756.Fl l 3757option turns on maximum compatibility with the original 3758.At 3759.Nm lex 3760implementation, at the cost of a major loss in the generated scanner's 3761performance. 3762We note below which incompatibilities can be overcome using the 3763.Fl l 3764option. 3765.Pp 3766.Nm 3767is fully compatible with 3768.Nm lex 3769with the following exceptions: 3770.Bl -dash 3771.It 3772The undocumented 3773.Nm lex 3774scanner internal variable 3775.Fa yylineno 3776is not supported unless 3777.Fl l 3778or 3779.Dq %option yylineno 3780is used. 3781.Pp 3782.Fa yylineno 3783should be maintained on a per-buffer basis, rather than a per-scanner 3784.Pq single global variable 3785basis. 3786.Pp 3787.Fa yylineno 3788is not part of the 3789.Tn POSIX 3790specification. 3791.It 3792The 3793.Fn input 3794routine is not redefinable, though it may be called to read characters 3795following whatever has been matched by a rule. 3796If 3797.Fn input 3798encounters an end-of-file, the normal 3799.Fn yywrap 3800processing is done. 3801A 3802.Dq real 3803end-of-file is returned by 3804.Fn input 3805as 3806.Dv EOF . 3807.Pp 3808Input is instead controlled by defining the 3809.Dv YY_INPUT 3810macro. 3811.Pp 3812The 3813.Nm 3814restriction that 3815.Fn input 3816cannot be redefined is in accordance with the 3817.Tn POSIX 3818specification, which simply does not specify any way of controlling the 3819scanner's input other than by making an initial assignment to 3820.Fa yyin . 3821.It 3822The 3823.Fn unput 3824routine is not redefinable. 3825This restriction is in accordance with 3826.Tn POSIX . 3827.It 3828.Nm 3829scanners are not as reentrant as 3830.Nm lex 3831scanners. 3832In particular, if a scanner is interactive and 3833an interrupt handler long-jumps out of the scanner, 3834and the scanner is subsequently called again, 3835the following error message may be displayed: 3836.Pp 3837.D1 fatal flex scanner internal error--end of buffer missed 3838.Pp 3839To reenter the scanner, first use 3840.Pp 3841.Dl yyrestart(yyin); 3842.Pp 3843Note that this call will throw away any buffered input; 3844usually this isn't a problem with an interactive scanner. 3845.Pp 3846Also note that flex C++ scanner classes are reentrant, 3847so if using C++ is an option , they should be used instead. 3848See 3849.Sx GENERATING C++ SCANNERS 3850above for details. 3851.It 3852.Fn output 3853is not supported. 3854Output from the 3855.Em ECHO 3856macro is done to the file-pointer 3857.Fa yyout 3858.Pq default stdout . 3859.Pp 3860.Fn output 3861is not part of the 3862.Tn POSIX 3863specification. 3864.It 3865.Nm lex 3866does not support exclusive start conditions 3867.Pq %x , 3868though they are in the 3869.Tn POSIX 3870specification. 3871.It 3872When definitions are expanded, 3873.Nm 3874encloses them in parentheses. 3875With 3876.Nm lex , 3877the following: 3878.Bd -literal -offset indent 3879NAME [A-Z][A-Z0-9]* 3880%% 3881foo{NAME}? printf("Found it\en"); 3882%% 3883.Ed 3884.Pp 3885will not match the string 3886.Qq foo 3887because when the macro is expanded the rule is equivalent to 3888.Qq foo[A-Z][A-Z0-9]*? 3889and the precedence is such that the 3890.Sq ?\& 3891is associated with 3892.Qq [A-Z0-9]* . 3893With 3894.Nm , 3895the rule will be expanded to 3896.Qq foo([A-Z][A-Z0-9]*)? 3897and so the string 3898.Qq foo 3899will match. 3900.Pp 3901Note that if the definition begins with 3902.Sq ^ 3903or ends with 3904.Sq $ 3905then it is not expanded with parentheses, to allow these operators to appear in 3906definitions without losing their special meanings. 3907But the 3908.Sq <s> , 3909.Sq / , 3910and 3911.Sq <<EOF>> 3912operators cannot be used in a 3913.Nm 3914definition. 3915.Pp 3916Using 3917.Fl l 3918results in the 3919.Nm lex 3920behavior of no parentheses around the definition. 3921.Pp 3922The 3923.Tn POSIX 3924specification is that the definition be enclosed in parentheses. 3925.It 3926Some implementations of 3927.Nm lex 3928allow a rule's action to begin on a separate line, 3929if the rule's pattern has trailing whitespace: 3930.Bd -literal -offset indent 3931%% 3932foo|bar<space here> 3933 { foobar_action(); } 3934.Ed 3935.Pp 3936.Nm 3937does not support this feature. 3938.It 3939The 3940.Nm lex 3941.Sq %r 3942.Pq generate a Ratfor scanner 3943option is not supported. 3944It is not part of the 3945.Tn POSIX 3946specification. 3947.It 3948After a call to 3949.Fn unput , 3950.Fa yytext 3951is undefined until the next token is matched, 3952unless the scanner was built using 3953.Dq %array . 3954This is not the case with 3955.Nm lex 3956or the 3957.Tn POSIX 3958specification. 3959The 3960.Fl l 3961option does away with this incompatibility. 3962.It 3963The precedence of the 3964.Sq {} 3965.Pq numeric range 3966operator is different. 3967.Nm lex 3968interprets 3969.Qq abc{1,3} 3970as match one, two, or three occurrences of 3971.Sq abc , 3972whereas 3973.Nm 3974interprets it as match 3975.Sq ab 3976followed by one, two, or three occurrences of 3977.Sq c . 3978The latter is in agreement with the 3979.Tn POSIX 3980specification. 3981.It 3982The precedence of the 3983.Sq ^ 3984operator is different. 3985.Nm lex 3986interprets 3987.Qq ^foo|bar 3988as match either 3989.Sq foo 3990at the beginning of a line, or 3991.Sq bar 3992anywhere, whereas 3993.Nm 3994interprets it as match either 3995.Sq foo 3996or 3997.Sq bar 3998if they come at the beginning of a line. 3999The latter is in agreement with the 4000.Tn POSIX 4001specification. 4002.It 4003The special table-size declarations such as 4004.Sq %a 4005supported by 4006.Nm lex 4007are not required by 4008.Nm 4009scanners; 4010.Nm 4011ignores them. 4012.It 4013The name 4014.Dv FLEX_SCANNER 4015is #define'd so scanners may be written for use with either 4016.Nm 4017or 4018.Nm lex . 4019Scanners also include 4020.Dv YY_FLEX_MAJOR_VERSION 4021and 4022.Dv YY_FLEX_MINOR_VERSION 4023indicating which version of 4024.Nm 4025generated the scanner 4026(for example, for the 2.5 release, these defines would be 2 and 5, 4027respectively). 4028.El 4029.Pp 4030The following 4031.Nm 4032features are not included in 4033.Nm lex 4034or the 4035.Tn POSIX 4036specification: 4037.Bd -unfilled -offset indent 4038C++ scanners 4039%option 4040start condition scopes 4041start condition stacks 4042interactive/non-interactive scanners 4043yy_scan_string() and friends 4044yyterminate() 4045yy_set_interactive() 4046yy_set_bol() 4047YY_AT_BOL() 4048<<EOF>> 4049<*> 4050YY_DECL 4051YY_START 4052YY_USER_ACTION 4053YY_USER_INIT 4054#line directives 4055%{}'s around actions 4056multiple actions on a line 4057.Ed 4058.Pp 4059plus almost all of the 4060.Nm 4061flags. 4062The last feature in the list refers to the fact that with 4063.Nm 4064multiple actions can be placed on the same line, 4065separated with semi-colons, while with 4066.Nm lex , 4067the following 4068.Pp 4069.Dl foo handle_foo(); ++num_foos_seen; 4070.Pp 4071is 4072.Pq rather surprisingly 4073truncated to 4074.Pp 4075.Dl foo handle_foo(); 4076.Pp 4077.Nm 4078does not truncate the action. 4079Actions that are not enclosed in braces 4080are simply terminated at the end of the line. 4081.Sh FILES 4082.Bl -tag -width "<g++/FlexLexer.h>" 4083.It Pa flex.skl 4084Skeleton scanner. 4085This file is only used when building flex, not when 4086.Nm 4087executes. 4088.It Pa lex.backup 4089Backing-up information for the 4090.Fl b 4091flag (called 4092.Pa lex.bck 4093on some systems). 4094.It Pa lex.yy.c 4095Generated scanner 4096(called 4097.Pa lexyy.c 4098on some systems). 4099.It Pa lex.yy.cc 4100Generated C++ scanner class, when using 4101.Fl + . 4102.It In g++/FlexLexer.h 4103Header file defining the C++ scanner base class, 4104.Fa FlexLexer , 4105and its derived class, 4106.Fa yyFlexLexer . 4107.It Pa /usr/lib/libl.* 4108.Nm 4109libraries. 4110The 4111.Pa /usr/lib/libfl.*\& 4112libraries are links to these. 4113Scanners must be linked using either 4114.Fl \&ll 4115or 4116.Fl lfl . 4117.El 4118.Sh EXIT STATUS 4119.Ex -std flex 4120.Sh DIAGNOSTICS 4121.Bl -diag 4122.It warning, rule cannot be matched 4123Indicates that the given rule cannot be matched because it follows other rules 4124that will always match the same text as it. 4125For example, in the following 4126.Dq foo 4127cannot be matched because it comes after an identifier 4128.Qq catch-all 4129rule: 4130.Bd -literal -offset indent 4131[a-z]+ got_identifier(); 4132foo got_foo(); 4133.Ed 4134.Pp 4135Using 4136.Em REJECT 4137in a scanner suppresses this warning. 4138.It "warning, \-s option given but default rule can be matched" 4139Means that it is possible 4140.Pq perhaps only in a particular start condition 4141that the default rule 4142.Pq match any single character 4143is the only one that will match a particular input. 4144Since 4145.Fl s 4146was given, presumably this is not intended. 4147.It reject_used_but_not_detected undefined 4148.It yymore_used_but_not_detected undefined 4149These errors can occur at compile time. 4150They indicate that the scanner uses 4151.Em REJECT 4152or 4153.Fn yymore 4154but that 4155.Nm 4156failed to notice the fact, meaning that 4157.Nm 4158scanned the first two sections looking for occurrences of these actions 4159and failed to find any, but somehow they snuck in 4160.Pq via an #include file, for example . 4161Use 4162.Dq %option reject 4163or 4164.Dq %option yymore 4165to indicate to 4166.Nm 4167that these features are really needed. 4168.It flex scanner jammed 4169A scanner compiled with 4170.Fl s 4171has encountered an input string which wasn't matched by any of its rules. 4172This error can also occur due to internal problems. 4173.It token too large, exceeds YYLMAX 4174The scanner uses 4175.Dq %array 4176and one of its rules matched a string longer than the 4177.Dv YYLMAX 4178constant 4179.Pq 8K bytes by default . 4180The value can be increased by #define'ing 4181.Dv YYLMAX 4182in the definitions section of 4183.Nm 4184input. 4185.It "scanner requires \-8 flag to use the character 'x'" 4186The scanner specification includes recognizing the 8-bit character 4187.Sq x 4188and the 4189.Fl 8 4190flag was not specified, and defaulted to 7-bit because the 4191.Fl Cf 4192or 4193.Fl CF 4194table compression options were used. 4195See the discussion of the 4196.Fl 7 4197flag for details. 4198.It flex scanner push-back overflow 4199unput() was used to push back so much text that the scanner's buffer 4200could not hold both the pushed-back text and the current token in 4201.Fa yytext . 4202Ideally the scanner should dynamically resize the buffer in this case, 4203but at present it does not. 4204.It "input buffer overflow, can't enlarge buffer because scanner uses REJECT" 4205The scanner was working on matching an extremely large token and needed 4206to expand the input buffer. 4207This doesn't work with scanners that use 4208.Em REJECT . 4209.It "fatal flex scanner internal error--end of buffer missed" 4210This can occur in a scanner which is reentered after a long-jump 4211has jumped out 4212.Pq or over 4213the scanner's activation frame. 4214Before reentering the scanner, use: 4215.Pp 4216.Dl yyrestart(yyin); 4217.Pp 4218or, as noted above, switch to using the C++ scanner class. 4219.It "too many start conditions in <> construct!" 4220More start conditions than exist were listed in a <> construct 4221(so at least one of them must have been listed twice). 4222.El 4223.Sh SEE ALSO 4224.Xr awk 1 , 4225.Xr sed 1 , 4226.Xr yacc 1 4227.Rs 4228.\" 4.4BSD PSD:16 4229.%A M. E. Lesk 4230.%T Lex \(em Lexical Analyzer Generator 4231.%I AT&T Bell Laboratories 4232.%R Computing Science Technical Report 4233.%N 39 4234.%D October 1975 4235.Re 4236.Rs 4237.%A John Levine 4238.%A Tony Mason 4239.%A Doug Brown 4240.%B Lex & Yacc 4241.%I O'Reilly and Associates 4242.%N 2nd edition 4243.Re 4244.Rs 4245.%A Alfred Aho 4246.%A Ravi Sethi 4247.%A Jeffrey Ullman 4248.%B Compilers: Principles, Techniques and Tools 4249.%I Addison-Wesley 4250.%D 1986 4251.%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)" 4252.Re 4253.Sh STANDARDS 4254The 4255.Nm lex 4256utility is compliant with the 4257.St -p1003.1-2008 4258specification, 4259though its presence is optional. 4260.Pp 4261The flags 4262.Op Fl 78BbCdFfhIiLloPpSsTVw+? , 4263.Op Fl -help , 4264and 4265.Op Fl -version 4266are extensions to that specification. 4267.Pp 4268See also the 4269.Sx INCOMPATIBILITIES WITH LEX AND POSIX 4270section, above. 4271.Sh AUTHORS 4272Vern Paxson, with the help of many ideas and much inspiration from 4273Van Jacobson. 4274Original version by Jef Poskanzer. 4275The fast table representation is a partial implementation of a design done by 4276Van Jacobson. 4277The implementation was done by Kevin Gong and Vern Paxson. 4278.Pp 4279Thanks to the many 4280.Nm 4281beta-testers, feedbackers, and contributors, especially Francois Pinard, 4282Casey Leedom, 4283Robert Abramovitz, 4284Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai, 4285Neal Becker, Nelson H.F. Beebe, 4286.Mt benson@odi.com , 4287Karl Berry, Peter A. Bigot, Simon Blanchard, 4288Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, 4289Brian Clapper, J.T. Conklin, 4290Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David 4291Daniels, Chris G. Demetriou, Theo de Raadt, 4292Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, 4293Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl, 4294Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz, 4295Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel, 4296Jan Hajic, Charles Hemphill, NORO Hideo, 4297Jarkko Hietaniemi, Scott Hofmann, 4298Jeff Honig, Dana Hudes, Eric Hughes, John Interrante, 4299Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, 4300Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, 4301Amir Katz, 4302.Mt ken@ken.hilco.com , 4303Kevin B. Kenny, 4304Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht, 4305Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle, 4306David Loffredo, Mike Long, 4307Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall, 4308Bengt Martensson, Chris Metcalf, 4309Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum, 4310G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, 4311Richard Ohnemus, Karsten Pahnke, 4312Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre, 4313Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, 4314Frederic Raimbault, Pat Rankin, Rick Richardson, 4315Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini, 4316Andreas Scherer, Darrell Schiebel, Raf Schietekat, 4317Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, 4318Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist, 4319Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor, 4320Chris Thewalt, Richard M. Timoney, Jodi Tsai, 4321Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, 4322Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn, 4323and those whose names have slipped my marginal mail-archiving skills 4324but whose contributions are appreciated all the 4325same. 4326.Pp 4327Thanks to Keith Bostic, Jon Forrest, Noah Friedman, 4328John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. 4329Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various 4330distribution headaches. 4331.Pp 4332Thanks to Esmond Pitt and Earle Horton for 8-bit character support; 4333to Benson Margulies and Fred Burke for C++ support; 4334to Kent Williams and Tom Epperly for C++ class support; 4335to Ove Ewerlid for support of NUL's; 4336and to Eric Hughes for support of multiple buffers. 4337.Pp 4338This work was primarily done when I was with the Real Time Systems Group 4339at the Lawrence Berkeley Laboratory in Berkeley, CA. 4340Many thanks to all there for the support I received. 4341.Pp 4342Send comments to 4343.Aq Mt vern@ee.lbl.gov . 4344.Sh BUGS 4345Some trailing context patterns cannot be properly matched and generate 4346warning messages 4347.Pq "dangerous trailing context" . 4348These are patterns where the ending of the first part of the rule 4349matches the beginning of the second part, such as 4350.Qq zx*/xy* , 4351where the 4352.Sq x* 4353matches the 4354.Sq x 4355at the beginning of the trailing context. 4356(Note that the POSIX draft states that the text matched by such patterns 4357is undefined.) 4358.Pp 4359For some trailing context rules, parts which are actually fixed-length are 4360not recognized as such, leading to the above mentioned performance loss. 4361In particular, parts using 4362.Sq |\& 4363or 4364.Sq {n} 4365(such as 4366.Qq foo{3} ) 4367are always considered variable-length. 4368.Pp 4369Combining trailing context with the special 4370.Sq |\& 4371action can result in fixed trailing context being turned into 4372the more expensive variable trailing context. 4373For example, in the following: 4374.Bd -literal -offset indent 4375%% 4376abc | 4377xyz/def 4378.Ed 4379.Pp 4380Use of 4381.Fn unput 4382invalidates yytext and yyleng, unless the 4383.Dq %array 4384directive 4385or the 4386.Fl l 4387option has been used. 4388.Pp 4389Pattern-matching of NUL's is substantially slower than matching other 4390characters. 4391.Pp 4392Dynamic resizing of the input buffer is slow, as it entails rescanning 4393all the text matched so far by the current 4394.Pq generally huge 4395token. 4396.Pp 4397Due to both buffering of input and read-ahead, 4398it is not possible to intermix calls to 4399.In stdio.h 4400routines, such as, for example, 4401.Fn getchar , 4402with 4403.Nm 4404rules and expect it to work. 4405Call 4406.Fn input 4407instead. 4408.Pp 4409The total table entries listed by the 4410.Fl v 4411flag excludes the number of table entries needed to determine 4412what rule has been matched. 4413The number of entries is equal to the number of DFA states 4414if the scanner does not use 4415.Em REJECT , 4416and somewhat greater than the number of states if it does. 4417.Pp 4418.Em REJECT 4419cannot be used with the 4420.Fl f 4421or 4422.Fl F 4423options. 4424.Pp 4425The 4426.Nm 4427internal algorithms need documentation.