Markdown parser fork with extended syntax for personal use.
at hack 423 lines 16 kB view raw
1//! GFM: Footnote definition occurs in the [document][] content type. 2//! 3//! ## Grammar 4//! 5//! Footnote definitions form with the following BNF 6//! (<small>see [construct][crate::construct] for character groups</small>): 7//! 8//! ```bnf 9//! ; Restriction: `label` must start with `^` (and not be empty after it). 10//! ; See the `label` construct for the BNF of that part. 11//! gfm_footnote_definition_start ::= label ':' *space_or_tab 12//! 13//! ; Restriction: blank line allowed. 14//! gfm_footnote_definition_cont ::= 4(space_or_tab) 15//! ``` 16//! 17//! Further lines that are not prefixed with `gfm_footnote_definition_cont` 18//! cause the footnote definition to be exited, except when those lines are 19//! lazy continuation or blank. 20//! Like so many things in markdown, footnote definition too are complex. 21//! See [*§ Phase 1: block structure* in `CommonMark`][commonmark_block] for 22//! more on parsing details. 23//! 24//! See [`label`][label] for grammar, notes, and recommendations on that part. 25//! 26//! The `label` part is interpreted as the [string][] content type. 27//! That means that [character escapes][character_escape] and 28//! [character references][character_reference] are allowed. 29//! 30//! Definitions match to calls through identifiers. 31//! To match, both labels must be equal after normalizing with 32//! [`normalize_identifier`][]. 33//! One definition can match to multiple calls. 34//! Multiple definitions with the same, normalized, identifier are ignored: the 35//! first definition is preferred. 36//! To illustrate, the definition with the content of `x` wins: 37//! 38//! ```markdown 39//! [^a]: x 40//! [^a]: y 41//! 42//! [^a] 43//! ``` 44//! 45//! Importantly, while labels *can* include [string][] content (character 46//! escapes and character references), these are not considered when matching. 47//! To illustrate, neither definition matches the call: 48//! 49//! ```markdown 50//! [^a&amp;b]: x 51//! [^a\&b]: y 52//! 53//! [^a&b] 54//! ``` 55//! 56//! Because footnote definitions are containers (like block quotes and list 57//! items), they can contain more footnote definitions, and they can include 58//! calls to themselves. 59//! 60//! ## HTML 61//! 62//! GFM footnote definitions do not, on their own, relate to anything in HTML. 63//! When matched with a [label end][label_end], which in turns matches to a 64//! [GFM label start (footnote)][gfm_label_start_footnote], the definition 65//! relates to several elements in HTML. 66//! 67//! When one or more definitions are called, a footnote section is generated 68//! at the end of the document, using `<section>`, `<h2>`, and `<ol>` elements: 69//! 70//! ```html 71//! <section data-footnotes="" class="footnotes"><h2 id="footnote-label" class="sr-only">Footnotes</h2> 72//! <ol>…</ol> 73//! </section> 74//! ``` 75//! 76//! Each definition is generated as a `<li>` in the `<ol>`, in the order they 77//! were first called: 78//! 79//! ```html 80//! <li id="user-content-fn-1">…</li> 81//! ``` 82//! 83//! Backreferences are injected at the end of the first paragraph, or, when 84//! there is no paragraph, at the end of the definition. 85//! When a definition is called multiple times, multiple backreferences are 86//! generated. 87//! Further backreferences use an extra counter in the `href` attribute and 88//! visually in a `<span>` after `↩`. 89//! 90//! ```html 91//! <a href="#user-content-fnref-1" data-footnote-backref="" class="data-footnote-backref" aria-label="Back to content">↩</a> <a href="#user-content-fnref-1-2" data-footnote-backref="" class="data-footnote-backref" aria-label="Back to content">↩<sup>2</sup></a> 92//! ``` 93//! 94//! See 95//! [*§ 4.5.1 The `a` element*][html_a], 96//! [*§ 4.3.6 The `h1`, `h2`, `h3`, `h4`, `h5`, and `h6` elements*][html_h], 97//! [*§ 4.4.8 The `li` element*][html_li], 98//! [*§ 4.4.5 The `ol` element*][html_ol], 99//! [*§ 4.4.1 The `p` element*][html_p], 100//! [*§ 4.3.3 The `section` element*][html_section], and 101//! [*§ 4.5.19 The `sub` and `sup` elements*][html_sup] 102//! in the HTML spec for more info. 103//! 104//! ## Recommendation 105//! 106//! When authoring markdown with footnotes, it’s recommended to use words 107//! instead of numbers (or letters or anything with an order) as calls. 108//! That makes it easier to reuse and reorder footnotes. 109//! 110//! It’s recommended to place footnotes definitions at the bottom of the document. 111//! 112//! ## Bugs 113//! 114//! GitHub’s own algorithm to parse footnote definitions contains several bugs. 115//! These are not present in this project. 116//! The issues relating to footnote definitions are: 117//! 118//! * [Footnote reference call identifiers are trimmed, but definition identifiers aren’t](https://github.com/github/cmark-gfm/issues/237)\ 119//! — initial and final whitespace in labels causes them not to match 120//! * [Footnotes are matched case-insensitive, but links keep their casing, breaking them](https://github.com/github/cmark-gfm/issues/239)\ 121//! — using uppercase (or any character that will be percent encoded) in identifiers breaks links 122//! * [Colons in footnotes generate links w/o `href`](https://github.com/github/cmark-gfm/issues/250)\ 123//! — colons in identifiers generate broken links 124//! * [Character escape of `]` does not work in footnote identifiers](https://github.com/github/cmark-gfm/issues/240)\ 125//! — some character escapes don’t work 126//! * [Footnotes in links are broken](https://github.com/github/cmark-gfm/issues/249)\ 127//! — while `CommonMark` prevents links in links, GitHub does not prevent footnotes (which turn into links) in links 128//! * [Footnote-like brackets around image, break that image](https://github.com/github/cmark-gfm/issues/275)\ 129//! — images can’t be used in what looks like a footnote call 130//! * [GFM footnotes: line ending in footnote definition label causes text to disappear](https://github.com/github/cmark-gfm/issues/282)\ 131//! — line endings in footnote definitions cause text to disappear 132//! 133//! ## Tokens 134//! 135//! * [`DefinitionMarker`][Name::DefinitionMarker] 136//! * [`GfmFootnoteDefinition`][Name::GfmFootnoteDefinition] 137//! * [`GfmFootnoteDefinitionLabel`][Name::GfmFootnoteDefinitionLabel] 138//! * [`GfmFootnoteDefinitionLabelMarker`][Name::GfmFootnoteDefinitionLabelMarker] 139//! * [`GfmFootnoteDefinitionLabelString`][Name::GfmFootnoteDefinitionLabelString] 140//! * [`GfmFootnoteDefinitionMarker`][Name::GfmFootnoteDefinitionMarker] 141//! * [`GfmFootnoteDefinitionPrefix`][Name::GfmFootnoteDefinitionPrefix] 142//! * [`SpaceOrTab`][Name::SpaceOrTab] 143//! 144//! ## References 145//! 146//! * [`micromark-extension-gfm-footnote`](https://github.com/micromark/micromark-extension-gfm-footnote) 147//! 148//! > 👉 **Note**: Footnotes are not specified in GFM yet. 149//! > See [`github/cmark-gfm#270`](https://github.com/github/cmark-gfm/issues/270) 150//! > for the related issue. 151//! 152//! [document]: crate::construct::document 153//! [string]: crate::construct::string 154//! [character_reference]: crate::construct::character_reference 155//! [character_escape]: crate::construct::character_escape 156//! [label]: crate::construct::partial_label 157//! [label_end]: crate::construct::label_end 158//! [gfm_label_start_footnote]: crate::construct::gfm_label_start_footnote 159//! [commonmark_block]: https://spec.commonmark.org/0.31/#phase-1-block-structure 160//! [html_a]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-a-element 161//! [html_h]: https://html.spec.whatwg.org/multipage/sections.html#the-h1,-h2,-h3,-h4,-h5,-and-h6-elements 162//! [html_li]: https://html.spec.whatwg.org/multipage/grouping-content.html#the-li-element 163//! [html_ol]: https://html.spec.whatwg.org/multipage/grouping-content.html#the-ol-element 164//! [html_p]: https://html.spec.whatwg.org/multipage/grouping-content.html#the-p-element 165//! [html_section]: https://html.spec.whatwg.org/multipage/sections.html#the-section-element 166//! [html_sup]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-sub-and-sup-elements 167 168use crate::construct::partial_space_or_tab::space_or_tab_min_max; 169use crate::event::{Content, Link, Name}; 170use crate::state::{Name as StateName, State}; 171use crate::tokenizer::Tokenizer; 172use crate::util::{ 173 constant::{LINK_REFERENCE_SIZE_MAX, TAB_SIZE}, 174 normalize_identifier::normalize_identifier, 175 skip, 176 slice::{Position, Slice}, 177}; 178 179/// Start of GFM footnote definition. 180/// 181/// ```markdown 182/// > | [^a]: b 183/// ^ 184/// ``` 185pub fn start(tokenizer: &mut Tokenizer) -> State { 186 if tokenizer 187 .parse_state 188 .options 189 .constructs 190 .gfm_footnote_definition 191 { 192 tokenizer.enter(Name::GfmFootnoteDefinition); 193 194 if matches!(tokenizer.current, Some(b'\t' | b' ')) { 195 tokenizer.attempt( 196 State::Next(StateName::GfmFootnoteDefinitionLabelBefore), 197 State::Nok, 198 ); 199 State::Retry(space_or_tab_min_max( 200 tokenizer, 201 1, 202 if tokenizer.parse_state.options.constructs.code_indented { 203 TAB_SIZE - 1 204 } else { 205 usize::MAX 206 }, 207 )) 208 } else { 209 State::Retry(StateName::GfmFootnoteDefinitionLabelBefore) 210 } 211 } else { 212 State::Nok 213 } 214} 215 216/// Before definition label (after optional whitespace). 217/// 218/// ```markdown 219/// > | [^a]: b 220/// ^ 221/// ``` 222pub fn label_before(tokenizer: &mut Tokenizer) -> State { 223 match tokenizer.current { 224 Some(b'[') => { 225 tokenizer.enter(Name::GfmFootnoteDefinitionPrefix); 226 tokenizer.enter(Name::GfmFootnoteDefinitionLabel); 227 tokenizer.enter(Name::GfmFootnoteDefinitionLabelMarker); 228 tokenizer.consume(); 229 tokenizer.exit(Name::GfmFootnoteDefinitionLabelMarker); 230 State::Next(StateName::GfmFootnoteDefinitionLabelAtMarker) 231 } 232 _ => State::Nok, 233 } 234} 235 236/// In label, at caret. 237/// 238/// ```markdown 239/// > | [^a]: b 240/// ^ 241/// ``` 242pub fn label_at_marker(tokenizer: &mut Tokenizer) -> State { 243 if tokenizer.current == Some(b'^') { 244 tokenizer.enter(Name::GfmFootnoteDefinitionMarker); 245 tokenizer.consume(); 246 tokenizer.exit(Name::GfmFootnoteDefinitionMarker); 247 tokenizer.enter(Name::GfmFootnoteDefinitionLabelString); 248 tokenizer.enter_link( 249 Name::Data, 250 Link { 251 previous: None, 252 next: None, 253 content: Content::String, 254 }, 255 ); 256 State::Next(StateName::GfmFootnoteDefinitionLabelInside) 257 } else { 258 State::Nok 259 } 260} 261 262/// In label. 263/// 264/// > 👉 **Note**: `cmark-gfm` prevents whitespace from occurring in footnote 265/// > definition labels. 266/// 267/// ```markdown 268/// > | [^a]: b 269/// ^ 270/// ``` 271pub fn label_inside(tokenizer: &mut Tokenizer) -> State { 272 // Too long. 273 if tokenizer.tokenize_state.size > LINK_REFERENCE_SIZE_MAX 274 // Space or tab is not supported by GFM for some reason (`\n` and 275 // `[` make sense). 276 || matches!(tokenizer.current, None | Some(b'\t' | b'\n' | b' ' | b'[')) 277 // Closing brace with nothing. 278 || (matches!(tokenizer.current, Some(b']')) && tokenizer.tokenize_state.size == 0) 279 { 280 tokenizer.tokenize_state.size = 0; 281 State::Nok 282 } else if matches!(tokenizer.current, Some(b']')) { 283 tokenizer.tokenize_state.size = 0; 284 tokenizer.exit(Name::Data); 285 tokenizer.exit(Name::GfmFootnoteDefinitionLabelString); 286 tokenizer.enter(Name::GfmFootnoteDefinitionLabelMarker); 287 tokenizer.consume(); 288 tokenizer.exit(Name::GfmFootnoteDefinitionLabelMarker); 289 tokenizer.exit(Name::GfmFootnoteDefinitionLabel); 290 State::Next(StateName::GfmFootnoteDefinitionLabelAfter) 291 } else { 292 let next = if matches!(tokenizer.current.unwrap(), b'\\') { 293 StateName::GfmFootnoteDefinitionLabelEscape 294 } else { 295 StateName::GfmFootnoteDefinitionLabelInside 296 }; 297 tokenizer.consume(); 298 tokenizer.tokenize_state.size += 1; 299 State::Next(next) 300 } 301} 302 303/// After `\`, at a special character. 304/// 305/// > 👉 **Note**: `cmark-gfm` currently does not support escaped brackets: 306/// > <https://github.com/github/cmark-gfm/issues/240> 307/// 308/// ```markdown 309/// > | [^a\*b]: c 310/// ^ 311/// ``` 312pub fn label_escape(tokenizer: &mut Tokenizer) -> State { 313 match tokenizer.current { 314 Some(b'[' | b'\\' | b']') => { 315 tokenizer.tokenize_state.size += 1; 316 tokenizer.consume(); 317 State::Next(StateName::GfmFootnoteDefinitionLabelInside) 318 } 319 _ => State::Retry(StateName::GfmFootnoteDefinitionLabelInside), 320 } 321} 322 323/// After definition label. 324/// 325/// ```markdown 326/// > | [^a]: b 327/// ^ 328/// ``` 329pub fn label_after(tokenizer: &mut Tokenizer) -> State { 330 match tokenizer.current { 331 Some(b':') => { 332 let end = skip::to_back( 333 &tokenizer.events, 334 tokenizer.events.len() - 1, 335 &[Name::GfmFootnoteDefinitionLabelString], 336 ); 337 338 // Note: we don’t care about virtual spaces, so `as_str` is fine. 339 let id = normalize_identifier( 340 Slice::from_position( 341 tokenizer.parse_state.bytes, 342 &Position::from_exit_event(&tokenizer.events, end), 343 ) 344 .as_str(), 345 ); 346 347 // Note: we don’t care about uniqueness. 348 // It’s likely that that doesn’t happen very frequently. 349 // It is more likely that it wastes precious time. 350 tokenizer.tokenize_state.gfm_footnote_definitions.push(id); 351 352 tokenizer.enter(Name::DefinitionMarker); 353 tokenizer.consume(); 354 tokenizer.exit(Name::DefinitionMarker); 355 tokenizer.attempt( 356 State::Next(StateName::GfmFootnoteDefinitionWhitespaceAfter), 357 State::Nok, 358 ); 359 // Any whitespace after the marker is eaten, forming indented code 360 // is not possible. 361 // No space is also fine, just like a block quote marker. 362 State::Next(space_or_tab_min_max(tokenizer, 0, usize::MAX)) 363 } 364 _ => State::Nok, 365 } 366} 367 368/// After definition prefix. 369/// 370/// ```markdown 371/// > | [^a]: b 372/// ^ 373/// ``` 374pub fn whitespace_after(tokenizer: &mut Tokenizer) -> State { 375 tokenizer.exit(Name::GfmFootnoteDefinitionPrefix); 376 State::Ok 377} 378 379/// Start of footnote definition continuation. 380/// 381/// ```markdown 382/// | [^a]: b 383/// > | c 384/// ^ 385/// ``` 386pub fn cont_start(tokenizer: &mut Tokenizer) -> State { 387 tokenizer.check( 388 State::Next(StateName::GfmFootnoteDefinitionContBlank), 389 State::Next(StateName::GfmFootnoteDefinitionContFilled), 390 ); 391 State::Retry(StateName::BlankLineStart) 392} 393 394/// Start of footnote definition continuation, at a blank line. 395/// 396/// ```markdown 397/// | [^a]: b 398/// > | ␠␠␊ 399/// ^ 400/// ``` 401pub fn cont_blank(tokenizer: &mut Tokenizer) -> State { 402 if matches!(tokenizer.current, Some(b'\t' | b' ')) { 403 State::Retry(space_or_tab_min_max(tokenizer, 0, TAB_SIZE)) 404 } else { 405 State::Ok 406 } 407} 408 409/// Start of footnote definition continuation, at a filled line. 410/// 411/// ```markdown 412/// | [^a]: b 413/// > | c 414/// ^ 415/// ``` 416pub fn cont_filled(tokenizer: &mut Tokenizer) -> State { 417 if matches!(tokenizer.current, Some(b'\t' | b' ')) { 418 // Consume exactly `TAB_SIZE`. 419 State::Retry(space_or_tab_min_max(tokenizer, TAB_SIZE, TAB_SIZE)) 420 } else { 421 State::Nok 422 } 423}