Markdown parser fork with extended syntax for personal use.
at hack 873 lines 28 kB view raw
1//! HTML (flow) occurs in the [flow][] content type. 2//! 3//! ## Grammar 4//! 5//! HTML (flow) forms with the following BNF 6//! (<small>see [construct][crate::construct] for character groups</small>): 7//! 8//! ```bnf 9//! html_flow ::= raw | comment | instruction | declaration | cdata | basic | complete 10//! 11//! ; Note: closing tag name does not need to match opening tag name. 12//! raw ::= '<' raw_tag_name [[space_or_tab *line | '>' *line] eol] *(*line eol) ['</' raw_tag_name *line] 13//! comment ::= '<!--' [*'-' '>' *line | *line *(eol *line) ['-->' *line]] 14//! instruction ::= '<?' ['>' *line | *line *(eol *line) ['?>' *line]] 15//! declaration ::= '<!' ascii_alphabetic *line *(eol *line) ['>' *line] 16//! cdata ::= '<![CDATA[' *line *(eol *line) [']]>' *line] 17//! basic ::= '< ['/'] basic_tag_name [['/'] '>' *line *(eol 1*line)] 18//! complete ::= (opening_tag | closing_tag) [*space_or_tab *(eol 1*line)] 19//! 20//! raw_tag_name ::= 'pre' | 'script' | 'style' | 'textarea' ; Note: case-insensitive. 21//! basic_tag_name ::= 'address' | 'article' | 'aside' | ... ; See `constants.rs`, and note: case-insensitive. 22//! opening_tag ::= '<' tag_name *(1*space_or_tab attribute) [*space_or_tab '/'] *space_or_tab '>' 23//! closing_tag ::= '</' tag_name *space_or_tab '>' 24//! tag_name ::= ascii_alphabetic *('-' | ascii_alphanumeric) 25//! attribute ::= attribute_name [*space_or_tab '=' *space_or_tab attribute_value] 26//! attribute_name ::= (':' | '_' | ascii_alphabetic) *('-' | '.' | ':' | '_' | ascii_alphanumeric) 27//! attribute_value ::= '"' *(line - '"') '"' | "'" *(line - "'") "'" | 1*(text - '"' - "'" - '/' - '<' - '=' - '>' - '`') 28//! ``` 29//! 30//! As this construct occurs in flow, like all flow constructs, it must be 31//! followed by an eol (line ending) or eof (end of file). 32//! 33//! The grammar for HTML in markdown does not follow the rules of parsing 34//! HTML according to the [*§ 13.2 Parsing HTML documents* in the HTML 35//! spec][html_parsing]. 36//! As such, HTML in markdown *resembles* HTML, but is instead a (naïve?) 37//! attempt to parse an XML-like language. 38//! By extension, another notable property of the grammar is that it can 39//! result in invalid HTML, in that it allows things that wouldn’t work or 40//! wouldn’t work well in HTML, such as mismatched tags. 41//! 42//! Interestingly, most of the productions above have a clear opening and 43//! closing condition (raw, comment, insutrction, declaration, cdata), but the 44//! closing condition does not need to be satisfied. 45//! In this case, the parser never has to backtrack. 46//! 47//! Because the **basic** and **complete** productions in the grammar form with 48//! a tag, followed by more stuff, and stop at a blank line, it is possible to 49//! interleave (a word for switching between languages) markdown and HTML 50//! together, by placing the opening and closing tags on their own lines, 51//! with blank lines between them and markdown. 52//! For example: 53//! 54//! ```markdown 55//! <div>This is <code>code</code> but this is not *emphasis*.</div> 56//! 57//! <div> 58//! 59//! This is a paragraph in a `div` and with `code` and *emphasis*. 60//! 61//! </div> 62//! ``` 63//! 64//! The **complete** production of HTML (flow) is not allowed to interrupt 65//! content. 66//! That means that a blank line is needed between a [paragraph][] and it. 67//! However, [HTML (text)][html_text] has a similar production, which will 68//! typically kick-in instead. 69//! 70//! The list of tag names allowed in the **raw** production are defined in 71//! [`HTML_RAW_NAMES`][]. 72//! This production exists because there are a few cases where markdown 73//! *inside* some elements, and hence interleaving, does not make sense. 74//! 75//! The list of tag names allowed in the **basic** production are defined in 76//! [`HTML_BLOCK_NAMES`][]. 77//! This production exists because there are a few cases where we can decide 78//! early that something is going to be a flow (block) element instead of a 79//! phrasing (inline) element. 80//! We *can* interrupt and don’t have to care too much about it being 81//! well-formed. 82//! 83//! ## Tokens 84//! 85//! * [`HtmlFlow`][Name::HtmlFlow] 86//! * [`HtmlFlowData`][Name::HtmlFlowData] 87//! * [`LineEnding`][Name::LineEnding] 88//! 89//! ## References 90//! 91//! * [`html-flow.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/html-flow.js) 92//! * [*§ 4.6 HTML blocks* in `CommonMark`](https://spec.commonmark.org/0.31/#html-blocks) 93//! 94//! [flow]: crate::construct::flow 95//! [html_text]: crate::construct::html_text 96//! [paragraph]: crate::construct::paragraph 97//! [html_raw_names]: crate::util::constant::HTML_RAW_NAMES 98//! [html_block_names]: crate::util::constant::HTML_BLOCK_NAMES 99//! [html_parsing]: https://html.spec.whatwg.org/multipage/parsing.html#parsing 100 101use crate::construct::partial_space_or_tab::{ 102 space_or_tab_with_options, Options as SpaceOrTabOptions, 103}; 104use crate::event::Name; 105use crate::state::{Name as StateName, State}; 106use crate::tokenizer::Tokenizer; 107use crate::util::{ 108 constant::{HTML_BLOCK_NAMES, HTML_CDATA_PREFIX, HTML_RAW_NAMES, HTML_RAW_SIZE_MAX, TAB_SIZE}, 109 slice::Slice, 110}; 111 112/// Symbol for `<script>` (condition 1). 113const RAW: u8 = 1; 114/// Symbol for `<!---->` (condition 2). 115const COMMENT: u8 = 2; 116/// Symbol for `<?php?>` (condition 3). 117const INSTRUCTION: u8 = 3; 118/// Symbol for `<!doctype>` (condition 4). 119const DECLARATION: u8 = 4; 120/// Symbol for `<![CDATA[]]>` (condition 5). 121const CDATA: u8 = 5; 122/// Symbol for `<div` (condition 6). 123const BASIC: u8 = 6; 124/// Symbol for `<x>` (condition 7). 125const COMPLETE: u8 = 7; 126 127/// Start of HTML (flow). 128/// 129/// ```markdown 130/// > | <x /> 131/// ^ 132/// ``` 133pub fn start(tokenizer: &mut Tokenizer) -> State { 134 if tokenizer.parse_state.options.constructs.html_flow { 135 tokenizer.enter(Name::HtmlFlow); 136 137 if matches!(tokenizer.current, Some(b'\t' | b' ')) { 138 tokenizer.attempt(State::Next(StateName::HtmlFlowBefore), State::Nok); 139 State::Retry(space_or_tab_with_options( 140 tokenizer, 141 SpaceOrTabOptions { 142 kind: Name::HtmlFlowData, 143 min: 0, 144 max: if tokenizer.parse_state.options.constructs.code_indented { 145 TAB_SIZE - 1 146 } else { 147 usize::MAX 148 }, 149 connect: false, 150 content: None, 151 }, 152 )) 153 } else { 154 State::Retry(StateName::HtmlFlowBefore) 155 } 156 } else { 157 State::Nok 158 } 159} 160 161/// At `<`, after optional whitespace. 162/// 163/// ```markdown 164/// > | <x /> 165/// ^ 166/// ``` 167pub fn before(tokenizer: &mut Tokenizer) -> State { 168 if Some(b'<') == tokenizer.current { 169 tokenizer.enter(Name::HtmlFlowData); 170 tokenizer.consume(); 171 State::Next(StateName::HtmlFlowOpen) 172 } else { 173 State::Nok 174 } 175} 176 177/// After `<`, at tag name or other stuff. 178/// 179/// ```markdown 180/// > | <x /> 181/// ^ 182/// > | <!doctype> 183/// ^ 184/// > | <!--xxx--> 185/// ^ 186/// ``` 187pub fn open(tokenizer: &mut Tokenizer) -> State { 188 match tokenizer.current { 189 Some(b'!') => { 190 tokenizer.consume(); 191 State::Next(StateName::HtmlFlowDeclarationOpen) 192 } 193 Some(b'/') => { 194 tokenizer.consume(); 195 tokenizer.tokenize_state.seen = true; 196 tokenizer.tokenize_state.start = tokenizer.point.index; 197 State::Next(StateName::HtmlFlowTagCloseStart) 198 } 199 Some(b'?') => { 200 tokenizer.consume(); 201 tokenizer.tokenize_state.marker = INSTRUCTION; 202 // Do not form containers. 203 tokenizer.concrete = true; 204 // While we’re in an instruction instead of a declaration, we’re on a `?` 205 // right now, so we do need to search for `>`, similar to declarations. 206 State::Next(StateName::HtmlFlowContinuationDeclarationInside) 207 } 208 // ASCII alphabetical. 209 Some(b'A'..=b'Z' | b'a'..=b'z') => { 210 tokenizer.tokenize_state.start = tokenizer.point.index; 211 State::Retry(StateName::HtmlFlowTagName) 212 } 213 _ => State::Nok, 214 } 215} 216 217/// After `<!`, at declaration, comment, or CDATA. 218/// 219/// ```markdown 220/// > | <!doctype> 221/// ^ 222/// > | <!--xxx--> 223/// ^ 224/// > | <![CDATA[>&<]]> 225/// ^ 226/// ``` 227pub fn declaration_open(tokenizer: &mut Tokenizer) -> State { 228 match tokenizer.current { 229 Some(b'-') => { 230 tokenizer.consume(); 231 tokenizer.tokenize_state.marker = COMMENT; 232 State::Next(StateName::HtmlFlowCommentOpenInside) 233 } 234 Some(b'A'..=b'Z' | b'a'..=b'z') => { 235 tokenizer.consume(); 236 tokenizer.tokenize_state.marker = DECLARATION; 237 // Do not form containers. 238 tokenizer.concrete = true; 239 State::Next(StateName::HtmlFlowContinuationDeclarationInside) 240 } 241 Some(b'[') => { 242 tokenizer.consume(); 243 tokenizer.tokenize_state.marker = CDATA; 244 State::Next(StateName::HtmlFlowCdataOpenInside) 245 } 246 _ => State::Nok, 247 } 248} 249 250/// After `<!-`, inside a comment, at another `-`. 251/// 252/// ```markdown 253/// > | <!--xxx--> 254/// ^ 255/// ``` 256pub fn comment_open_inside(tokenizer: &mut Tokenizer) -> State { 257 if let Some(b'-') = tokenizer.current { 258 tokenizer.consume(); 259 // Do not form containers. 260 tokenizer.concrete = true; 261 State::Next(StateName::HtmlFlowContinuationDeclarationInside) 262 } else { 263 tokenizer.tokenize_state.marker = 0; 264 State::Nok 265 } 266} 267 268/// After `<![`, inside CDATA, expecting `CDATA[`. 269/// 270/// ```markdown 271/// > | <![CDATA[>&<]]> 272/// ^^^^^^ 273/// ``` 274pub fn cdata_open_inside(tokenizer: &mut Tokenizer) -> State { 275 if tokenizer.current == Some(HTML_CDATA_PREFIX[tokenizer.tokenize_state.size]) { 276 tokenizer.consume(); 277 tokenizer.tokenize_state.size += 1; 278 279 if tokenizer.tokenize_state.size == HTML_CDATA_PREFIX.len() { 280 tokenizer.tokenize_state.size = 0; 281 // Do not form containers. 282 tokenizer.concrete = true; 283 State::Next(StateName::HtmlFlowContinuation) 284 } else { 285 State::Next(StateName::HtmlFlowCdataOpenInside) 286 } 287 } else { 288 tokenizer.tokenize_state.marker = 0; 289 tokenizer.tokenize_state.size = 0; 290 State::Nok 291 } 292} 293 294/// After `</`, in closing tag, at tag name. 295/// 296/// ```markdown 297/// > | </x> 298/// ^ 299/// ``` 300pub fn tag_close_start(tokenizer: &mut Tokenizer) -> State { 301 if let Some(b'A'..=b'Z' | b'a'..=b'z') = tokenizer.current { 302 tokenizer.consume(); 303 State::Next(StateName::HtmlFlowTagName) 304 } else { 305 tokenizer.tokenize_state.seen = false; 306 tokenizer.tokenize_state.start = 0; 307 State::Nok 308 } 309} 310 311/// In tag name. 312/// 313/// ```markdown 314/// > | <ab> 315/// ^^ 316/// > | </ab> 317/// ^^ 318/// ``` 319pub fn tag_name(tokenizer: &mut Tokenizer) -> State { 320 match tokenizer.current { 321 None | Some(b'\t' | b'\n' | b' ' | b'/' | b'>') => { 322 let closing_tag = tokenizer.tokenize_state.seen; 323 let slash = matches!(tokenizer.current, Some(b'/')); 324 // Guaranteed to be valid ASCII bytes. 325 let slice = Slice::from_indices( 326 tokenizer.parse_state.bytes, 327 tokenizer.tokenize_state.start, 328 tokenizer.point.index, 329 ); 330 let name = slice 331 .as_str() 332 // The line ending case might result in a `\r` that is already accounted for. 333 .trim() 334 .to_ascii_lowercase(); 335 tokenizer.tokenize_state.seen = false; 336 tokenizer.tokenize_state.start = 0; 337 338 if !slash && !closing_tag && HTML_RAW_NAMES.contains(&name.as_str()) { 339 tokenizer.tokenize_state.marker = RAW; 340 // Do not form containers. 341 tokenizer.concrete = true; 342 State::Retry(StateName::HtmlFlowContinuation) 343 } else if HTML_BLOCK_NAMES.contains(&name.as_str()) { 344 tokenizer.tokenize_state.marker = BASIC; 345 346 if slash { 347 tokenizer.consume(); 348 State::Next(StateName::HtmlFlowBasicSelfClosing) 349 } else { 350 // Do not form containers. 351 tokenizer.concrete = true; 352 State::Retry(StateName::HtmlFlowContinuation) 353 } 354 } else { 355 tokenizer.tokenize_state.marker = COMPLETE; 356 357 // Do not support complete HTML when interrupting. 358 if tokenizer.interrupt && !tokenizer.lazy { 359 tokenizer.tokenize_state.marker = 0; 360 State::Nok 361 } else if closing_tag { 362 State::Retry(StateName::HtmlFlowCompleteClosingTagAfter) 363 } else { 364 State::Retry(StateName::HtmlFlowCompleteAttributeNameBefore) 365 } 366 } 367 } 368 // ASCII alphanumerical and `-`. 369 Some(b'-' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') => { 370 tokenizer.consume(); 371 State::Next(StateName::HtmlFlowTagName) 372 } 373 Some(_) => { 374 tokenizer.tokenize_state.seen = false; 375 State::Nok 376 } 377 } 378} 379 380/// After closing slash of a basic tag name. 381/// 382/// ```markdown 383/// > | <div/> 384/// ^ 385/// ``` 386pub fn basic_self_closing(tokenizer: &mut Tokenizer) -> State { 387 if let Some(b'>') = tokenizer.current { 388 tokenizer.consume(); 389 // Do not form containers. 390 tokenizer.concrete = true; 391 State::Next(StateName::HtmlFlowContinuation) 392 } else { 393 tokenizer.tokenize_state.marker = 0; 394 State::Nok 395 } 396} 397 398/// After closing slash of a complete tag name. 399/// 400/// ```markdown 401/// > | <x/> 402/// ^ 403/// ``` 404pub fn complete_closing_tag_after(tokenizer: &mut Tokenizer) -> State { 405 match tokenizer.current { 406 Some(b'\t' | b' ') => { 407 tokenizer.consume(); 408 State::Next(StateName::HtmlFlowCompleteClosingTagAfter) 409 } 410 _ => State::Retry(StateName::HtmlFlowCompleteEnd), 411 } 412} 413 414/// At an attribute name. 415/// 416/// At first, this state is used after a complete tag name, after whitespace, 417/// where it expects optional attributes or the end of the tag. 418/// It is also reused after attributes, when expecting more optional 419/// attributes. 420/// 421/// ```markdown 422/// > | <a /> 423/// ^ 424/// > | <a :b> 425/// ^ 426/// > | <a _b> 427/// ^ 428/// > | <a b> 429/// ^ 430/// > | <a > 431/// ^ 432/// ``` 433pub fn complete_attribute_name_before(tokenizer: &mut Tokenizer) -> State { 434 match tokenizer.current { 435 Some(b'\t' | b' ') => { 436 tokenizer.consume(); 437 State::Next(StateName::HtmlFlowCompleteAttributeNameBefore) 438 } 439 Some(b'/') => { 440 tokenizer.consume(); 441 State::Next(StateName::HtmlFlowCompleteEnd) 442 } 443 // ASCII alphanumerical and `:` and `_`. 444 Some(b'0'..=b'9' | b':' | b'A'..=b'Z' | b'_' | b'a'..=b'z') => { 445 tokenizer.consume(); 446 State::Next(StateName::HtmlFlowCompleteAttributeName) 447 } 448 _ => State::Retry(StateName::HtmlFlowCompleteEnd), 449 } 450} 451 452/// In attribute name. 453/// 454/// ```markdown 455/// > | <a :b> 456/// ^ 457/// > | <a _b> 458/// ^ 459/// > | <a b> 460/// ^ 461/// ``` 462pub fn complete_attribute_name(tokenizer: &mut Tokenizer) -> State { 463 match tokenizer.current { 464 // ASCII alphanumerical and `-`, `.`, `:`, and `_`. 465 Some(b'-' | b'.' | b'0'..=b'9' | b':' | b'A'..=b'Z' | b'_' | b'a'..=b'z') => { 466 tokenizer.consume(); 467 State::Next(StateName::HtmlFlowCompleteAttributeName) 468 } 469 _ => State::Retry(StateName::HtmlFlowCompleteAttributeNameAfter), 470 } 471} 472 473/// After attribute name, at an optional initializer, the end of the tag, or 474/// whitespace. 475/// 476/// ```markdown 477/// > | <a b> 478/// ^ 479/// > | <a b=c> 480/// ^ 481/// ``` 482pub fn complete_attribute_name_after(tokenizer: &mut Tokenizer) -> State { 483 match tokenizer.current { 484 Some(b'\t' | b' ') => { 485 tokenizer.consume(); 486 State::Next(StateName::HtmlFlowCompleteAttributeNameAfter) 487 } 488 Some(b'=') => { 489 tokenizer.consume(); 490 State::Next(StateName::HtmlFlowCompleteAttributeValueBefore) 491 } 492 _ => State::Retry(StateName::HtmlFlowCompleteAttributeNameBefore), 493 } 494} 495 496/// Before unquoted, double quoted, or single quoted attribute value, allowing 497/// whitespace. 498/// 499/// ```markdown 500/// > | <a b=c> 501/// ^ 502/// > | <a b="c"> 503/// ^ 504/// ``` 505pub fn complete_attribute_value_before(tokenizer: &mut Tokenizer) -> State { 506 match tokenizer.current { 507 None | Some(b'<' | b'=' | b'>' | b'`') => { 508 tokenizer.tokenize_state.marker = 0; 509 State::Nok 510 } 511 Some(b'\t' | b' ') => { 512 tokenizer.consume(); 513 State::Next(StateName::HtmlFlowCompleteAttributeValueBefore) 514 } 515 Some(b'"' | b'\'') => { 516 tokenizer.tokenize_state.marker_b = tokenizer.current.unwrap(); 517 tokenizer.consume(); 518 State::Next(StateName::HtmlFlowCompleteAttributeValueQuoted) 519 } 520 _ => State::Retry(StateName::HtmlFlowCompleteAttributeValueUnquoted), 521 } 522} 523 524/// In double or single quoted attribute value. 525/// 526/// ```markdown 527/// > | <a b="c"> 528/// ^ 529/// > | <a b='c'> 530/// ^ 531/// ``` 532pub fn complete_attribute_value_quoted(tokenizer: &mut Tokenizer) -> State { 533 if tokenizer.current == Some(tokenizer.tokenize_state.marker_b) { 534 tokenizer.consume(); 535 tokenizer.tokenize_state.marker_b = 0; 536 State::Next(StateName::HtmlFlowCompleteAttributeValueQuotedAfter) 537 } else if matches!(tokenizer.current, None | Some(b'\n')) { 538 tokenizer.tokenize_state.marker = 0; 539 tokenizer.tokenize_state.marker_b = 0; 540 State::Nok 541 } else { 542 tokenizer.consume(); 543 State::Next(StateName::HtmlFlowCompleteAttributeValueQuoted) 544 } 545} 546 547/// In unquoted attribute value. 548/// 549/// ```markdown 550/// > | <a b=c> 551/// ^ 552/// ``` 553pub fn complete_attribute_value_unquoted(tokenizer: &mut Tokenizer) -> State { 554 match tokenizer.current { 555 None | Some(b'\t' | b'\n' | b' ' | b'"' | b'\'' | b'/' | b'<' | b'=' | b'>' | b'`') => { 556 State::Retry(StateName::HtmlFlowCompleteAttributeNameAfter) 557 } 558 Some(_) => { 559 tokenizer.consume(); 560 State::Next(StateName::HtmlFlowCompleteAttributeValueUnquoted) 561 } 562 } 563} 564 565/// After double or single quoted attribute value, before whitespace or the 566/// end of the tag. 567/// 568/// ```markdown 569/// > | <a b="c"> 570/// ^ 571/// ``` 572pub fn complete_attribute_value_quoted_after(tokenizer: &mut Tokenizer) -> State { 573 if let Some(b'\t' | b' ' | b'/' | b'>') = tokenizer.current { 574 State::Retry(StateName::HtmlFlowCompleteAttributeNameBefore) 575 } else { 576 tokenizer.tokenize_state.marker = 0; 577 State::Nok 578 } 579} 580 581/// In certain circumstances of a complete tag where only an `>` is allowed. 582/// 583/// ```markdown 584/// > | <a b="c"> 585/// ^ 586/// ``` 587pub fn complete_end(tokenizer: &mut Tokenizer) -> State { 588 if let Some(b'>') = tokenizer.current { 589 tokenizer.consume(); 590 State::Next(StateName::HtmlFlowCompleteAfter) 591 } else { 592 tokenizer.tokenize_state.marker = 0; 593 State::Nok 594 } 595} 596 597/// After `>` in a complete tag. 598/// 599/// ```markdown 600/// > | <x> 601/// ^ 602/// ``` 603pub fn complete_after(tokenizer: &mut Tokenizer) -> State { 604 match tokenizer.current { 605 None | Some(b'\n') => { 606 // Do not form containers. 607 tokenizer.concrete = true; 608 State::Retry(StateName::HtmlFlowContinuation) 609 } 610 Some(b'\t' | b' ') => { 611 tokenizer.consume(); 612 State::Next(StateName::HtmlFlowCompleteAfter) 613 } 614 Some(_) => { 615 tokenizer.tokenize_state.marker = 0; 616 State::Nok 617 } 618 } 619} 620 621/// In continuation of any HTML kind. 622/// 623/// ```markdown 624/// > | <!--xxx--> 625/// ^ 626/// ``` 627pub fn continuation(tokenizer: &mut Tokenizer) -> State { 628 if tokenizer.tokenize_state.marker == COMMENT && tokenizer.current == Some(b'-') { 629 tokenizer.consume(); 630 State::Next(StateName::HtmlFlowContinuationCommentInside) 631 } else if tokenizer.tokenize_state.marker == RAW && tokenizer.current == Some(b'<') { 632 tokenizer.consume(); 633 State::Next(StateName::HtmlFlowContinuationRawTagOpen) 634 } else if tokenizer.tokenize_state.marker == DECLARATION && tokenizer.current == Some(b'>') { 635 tokenizer.consume(); 636 State::Next(StateName::HtmlFlowContinuationClose) 637 } else if tokenizer.tokenize_state.marker == INSTRUCTION && tokenizer.current == Some(b'?') { 638 tokenizer.consume(); 639 State::Next(StateName::HtmlFlowContinuationDeclarationInside) 640 } else if tokenizer.tokenize_state.marker == CDATA && tokenizer.current == Some(b']') { 641 tokenizer.consume(); 642 State::Next(StateName::HtmlFlowContinuationCdataInside) 643 } else if matches!(tokenizer.tokenize_state.marker, BASIC | COMPLETE) 644 && tokenizer.current == Some(b'\n') 645 { 646 tokenizer.exit(Name::HtmlFlowData); 647 tokenizer.check( 648 State::Next(StateName::HtmlFlowContinuationAfter), 649 State::Next(StateName::HtmlFlowContinuationStart), 650 ); 651 State::Retry(StateName::HtmlFlowBlankLineBefore) 652 } else if matches!(tokenizer.current, None | Some(b'\n')) { 653 tokenizer.exit(Name::HtmlFlowData); 654 State::Retry(StateName::HtmlFlowContinuationStart) 655 } else { 656 tokenizer.consume(); 657 State::Next(StateName::HtmlFlowContinuation) 658 } 659} 660 661/// In continuation, at eol. 662/// 663/// ```markdown 664/// > | <x> 665/// ^ 666/// | asd 667/// ``` 668pub fn continuation_start(tokenizer: &mut Tokenizer) -> State { 669 tokenizer.check( 670 State::Next(StateName::HtmlFlowContinuationStartNonLazy), 671 State::Next(StateName::HtmlFlowContinuationAfter), 672 ); 673 State::Retry(StateName::NonLazyContinuationStart) 674} 675 676/// In continuation, at eol, before non-lazy content. 677/// 678/// ```markdown 679/// > | <x> 680/// ^ 681/// | asd 682/// ``` 683pub fn continuation_start_non_lazy(tokenizer: &mut Tokenizer) -> State { 684 match tokenizer.current { 685 Some(b'\n') => { 686 tokenizer.enter(Name::LineEnding); 687 tokenizer.consume(); 688 tokenizer.exit(Name::LineEnding); 689 State::Next(StateName::HtmlFlowContinuationBefore) 690 } 691 _ => unreachable!("expected eol"), 692 } 693} 694 695/// In continuation, before non-lazy content. 696/// 697/// ```markdown 698/// | <x> 699/// > | asd 700/// ^ 701/// ``` 702pub fn continuation_before(tokenizer: &mut Tokenizer) -> State { 703 match tokenizer.current { 704 None | Some(b'\n') => State::Retry(StateName::HtmlFlowContinuationStart), 705 _ => { 706 tokenizer.enter(Name::HtmlFlowData); 707 State::Retry(StateName::HtmlFlowContinuation) 708 } 709 } 710} 711 712/// In comment continuation, after one `-`, expecting another. 713/// 714/// ```markdown 715/// > | <!--xxx--> 716/// ^ 717/// ``` 718pub fn continuation_comment_inside(tokenizer: &mut Tokenizer) -> State { 719 match tokenizer.current { 720 Some(b'-') => { 721 tokenizer.consume(); 722 State::Next(StateName::HtmlFlowContinuationDeclarationInside) 723 } 724 _ => State::Retry(StateName::HtmlFlowContinuation), 725 } 726} 727 728/// In raw continuation, after `<`, at `/`. 729/// 730/// ```markdown 731/// > | <script>console.log(1)</script> 732/// ^ 733/// ``` 734pub fn continuation_raw_tag_open(tokenizer: &mut Tokenizer) -> State { 735 match tokenizer.current { 736 Some(b'/') => { 737 tokenizer.consume(); 738 tokenizer.tokenize_state.start = tokenizer.point.index; 739 State::Next(StateName::HtmlFlowContinuationRawEndTag) 740 } 741 _ => State::Retry(StateName::HtmlFlowContinuation), 742 } 743} 744 745/// In raw continuation, after `</`, in a raw tag name. 746/// 747/// ```markdown 748/// > | <script>console.log(1)</script> 749/// ^^^^^^ 750/// ``` 751pub fn continuation_raw_end_tag(tokenizer: &mut Tokenizer) -> State { 752 match tokenizer.current { 753 Some(b'>') => { 754 // Guaranteed to be valid ASCII bytes. 755 let slice = Slice::from_indices( 756 tokenizer.parse_state.bytes, 757 tokenizer.tokenize_state.start, 758 tokenizer.point.index, 759 ); 760 let name = slice.as_str().to_ascii_lowercase(); 761 762 tokenizer.tokenize_state.start = 0; 763 764 if HTML_RAW_NAMES.contains(&name.as_str()) { 765 tokenizer.consume(); 766 State::Next(StateName::HtmlFlowContinuationClose) 767 } else { 768 State::Retry(StateName::HtmlFlowContinuation) 769 } 770 } 771 Some(b'A'..=b'Z' | b'a'..=b'z') 772 if tokenizer.point.index - tokenizer.tokenize_state.start < HTML_RAW_SIZE_MAX => 773 { 774 tokenizer.consume(); 775 State::Next(StateName::HtmlFlowContinuationRawEndTag) 776 } 777 _ => { 778 tokenizer.tokenize_state.start = 0; 779 State::Retry(StateName::HtmlFlowContinuation) 780 } 781 } 782} 783 784/// In cdata continuation, after `]`, expecting `]>`. 785/// 786/// ```markdown 787/// > | <![CDATA[>&<]]> 788/// ^ 789/// ``` 790pub fn continuation_cdata_inside(tokenizer: &mut Tokenizer) -> State { 791 match tokenizer.current { 792 Some(b']') => { 793 tokenizer.consume(); 794 State::Next(StateName::HtmlFlowContinuationDeclarationInside) 795 } 796 _ => State::Retry(StateName::HtmlFlowContinuation), 797 } 798} 799 800/// In declaration or instruction continuation, at `>`. 801/// 802/// ```markdown 803/// > | <!--> 804/// ^ 805/// > | <?> 806/// ^ 807/// > | <!q> 808/// ^ 809/// > | <!--ab--> 810/// ^ 811/// > | <![CDATA[>&<]]> 812/// ^ 813/// ``` 814pub fn continuation_declaration_inside(tokenizer: &mut Tokenizer) -> State { 815 if tokenizer.tokenize_state.marker == COMMENT && tokenizer.current == Some(b'-') { 816 tokenizer.consume(); 817 State::Next(StateName::HtmlFlowContinuationDeclarationInside) 818 } else if tokenizer.current == Some(b'>') { 819 tokenizer.consume(); 820 State::Next(StateName::HtmlFlowContinuationClose) 821 } else { 822 State::Retry(StateName::HtmlFlowContinuation) 823 } 824} 825 826/// In closed continuation: everything we get until the eol/eof is part of it. 827/// 828/// ```markdown 829/// > | <!doctype> 830/// ^ 831/// ``` 832pub fn continuation_close(tokenizer: &mut Tokenizer) -> State { 833 match tokenizer.current { 834 None | Some(b'\n') => { 835 tokenizer.exit(Name::HtmlFlowData); 836 State::Retry(StateName::HtmlFlowContinuationAfter) 837 } 838 _ => { 839 tokenizer.consume(); 840 State::Next(StateName::HtmlFlowContinuationClose) 841 } 842 } 843} 844 845/// Done. 846/// 847/// ```markdown 848/// > | <!doctype> 849/// ^ 850/// ``` 851pub fn continuation_after(tokenizer: &mut Tokenizer) -> State { 852 tokenizer.exit(Name::HtmlFlow); 853 tokenizer.tokenize_state.marker = 0; 854 // Feel free to interrupt. 855 tokenizer.interrupt = false; 856 // No longer concrete. 857 tokenizer.concrete = false; 858 State::Ok 859} 860 861/// Before eol, expecting blank line. 862/// 863/// ```markdown 864/// > | <div> 865/// ^ 866/// | 867/// ``` 868pub fn blank_line_before(tokenizer: &mut Tokenizer) -> State { 869 tokenizer.enter(Name::LineEnding); 870 tokenizer.consume(); 871 tokenizer.exit(Name::LineEnding); 872 State::Next(StateName::BlankLineStart) 873}