Markdown parser fork with extended syntax for personal use.
at hack 440 lines 16 kB view raw
1//! Attention (emphasis, strong, optionally GFM strikethrough) occurs in the 2//! [text][] content type. 3//! 4//! ## Grammar 5//! 6//! Attention sequences form with the following BNF 7//! (<small>see [construct][crate::construct] for character groups</small>): 8//! 9//! ```bnf 10//! attention_sequence ::= 1*'*' | 1*'_' 11//! gfm_attention_sequence ::= 1*'~' 12//! ``` 13//! 14//! Sequences are matched together to form attention based on which character 15//! they contain, how long they are, and what character occurs before and after 16//! each sequence. 17//! Otherwise they are turned into data. 18//! 19//! ## HTML 20//! 21//! When asterisk/underscore sequences match, and two markers can be “taken” 22//! from them, they together relate to the `<strong>` element in HTML. 23//! When one marker can be taken, they relate to the `<em>` element. 24//! See [*§ 4.5.2 The `em` element*][html-em] and 25//! [*§ 4.5.3 The `strong` element*][html-strong] in the HTML spec for more 26//! info. 27//! 28//! When tilde sequences match, they together relate to the `<del>` element in 29//! HTML. 30//! See [*§ 4.7.2 The `del` element*][html-del] in the HTML spec for more info. 31//! 32//! ## Recommendation 33//! 34//! It is recommended to use asterisks for emphasis/strong attention when 35//! writing markdown. 36//! 37//! There are some small differences in whether sequences can open and/or close 38//! based on whether they are formed with asterisks or underscores. 39//! Because underscores also frequently occur in natural language inside words, 40//! while asterisks typically never do, `CommonMark` prohibits underscore 41//! sequences from opening or closing when *inside* a word. 42//! 43//! Because asterisks can be used to form the most markdown constructs, using 44//! them has the added benefit of making it easier to gloss over markdown: you 45//! can look for asterisks to find syntax while not worrying about other 46//! characters. 47//! 48//! For strikethrough attention, it is recommended to use two markers. 49//! While `github.com` allows single tildes too, it technically prohibits it in 50//! their spec. 51//! 52//! ## Tokens 53//! 54//! * [`Emphasis`][Name::Emphasis] 55//! * [`EmphasisSequence`][Name::EmphasisSequence] 56//! * [`EmphasisText`][Name::EmphasisText] 57//! * [`GfmStrikethrough`][Name::GfmStrikethrough] 58//! * [`GfmStrikethroughSequence`][Name::GfmStrikethroughSequence] 59//! * [`GfmStrikethroughText`][Name::GfmStrikethroughText] 60//! * [`Strong`][Name::Strong] 61//! * [`StrongSequence`][Name::StrongSequence] 62//! * [`StrongText`][Name::StrongText] 63//! 64//! > 👉 **Note**: while parsing, [`AttentionSequence`][Name::AttentionSequence] 65//! > is used, which is later compiled away. 66//! 67//! ## References 68//! 69//! * [`attention.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/attention.js) 70//! * [`micromark-extension-gfm-strikethrough`](https://github.com/micromark/micromark-extension-gfm-strikethrough) 71//! * [*§ 6.2 Emphasis and strong emphasis* in `CommonMark`](https://spec.commonmark.org/0.31/#emphasis-and-strong-emphasis) 72//! * [*§ 6.5 Strikethrough (extension)* in `GFM`](https://github.github.com/gfm/#strikethrough-extension-) 73//! 74//! [text]: crate::construct::text 75//! [html-em]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element 76//! [html-strong]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-strong-element 77//! [html-del]: https://html.spec.whatwg.org/multipage/edits.html#the-del-element 78 79use crate::event::{Event, Kind, Name, Point}; 80use crate::resolve::Name as ResolveName; 81use crate::state::{Name as StateName, State}; 82use crate::subtokenize::Subresult; 83use crate::tokenizer::Tokenizer; 84use crate::util::char::{ 85 after_index as char_after_index, before_index as char_before_index, classify_opt, 86 Kind as CharacterKind, 87}; 88use alloc::{vec, vec::Vec}; 89 90/// Attentention sequence that we can take markers from. 91#[derive(Debug)] 92struct Sequence { 93 /// Marker as a byte (`u8`) used in this sequence. 94 marker: u8, 95 /// We track whether sequences are in balanced events, and where those 96 /// events start, so that one attention doesn’t start in say, one link, and 97 /// end in another. 98 stack: Vec<usize>, 99 /// The index into events where this sequence’s `Enter` currently resides. 100 index: usize, 101 /// The (shifted) point where this sequence starts. 102 start_point: Point, 103 /// The (shifted) point where this sequence end. 104 end_point: Point, 105 /// The number of markers we can still use. 106 size: usize, 107 /// Whether this sequence can open attention. 108 open: bool, 109 /// Whether this sequence can close attention. 110 close: bool, 111} 112 113/// At start of attention. 114/// 115/// ```markdown 116/// > | ** 117/// ^ 118/// ``` 119pub fn start(tokenizer: &mut Tokenizer) -> State { 120 // Emphasis/strong: 121 if (tokenizer.parse_state.options.constructs.attention 122 && matches!(tokenizer.current, Some(b'*' | b'_'))) 123 // GFM strikethrough: 124 || (tokenizer.parse_state.options.constructs.gfm_strikethrough && tokenizer.current == Some(b'~')) 125 { 126 tokenizer.tokenize_state.marker = tokenizer.current.unwrap(); 127 tokenizer.enter(Name::AttentionSequence); 128 State::Retry(StateName::AttentionInside) 129 } else { 130 State::Nok 131 } 132} 133 134/// In sequence. 135/// 136/// ```markdown 137/// > | ** 138/// ^^ 139/// ``` 140pub fn inside(tokenizer: &mut Tokenizer) -> State { 141 if tokenizer.current == Some(tokenizer.tokenize_state.marker) { 142 tokenizer.consume(); 143 State::Next(StateName::AttentionInside) 144 } else { 145 tokenizer.exit(Name::AttentionSequence); 146 tokenizer.register_resolver(ResolveName::Attention); 147 tokenizer.tokenize_state.marker = 0; 148 State::Ok 149 } 150} 151 152/// Resolve sequences. 153pub fn resolve(tokenizer: &mut Tokenizer) -> Option<Subresult> { 154 // Find all sequences, gather info about them. 155 let mut sequences = get_sequences(tokenizer); 156 157 // Now walk through them and match them. 158 let mut close = 0; 159 160 while close < sequences.len() { 161 let sequence_close = &sequences[close]; 162 let mut next_index = close + 1; 163 164 // Find a sequence that can close. 165 if sequence_close.close { 166 let mut open = close; 167 168 // Now walk back to find an opener. 169 while open > 0 { 170 open -= 1; 171 172 let sequence_open = &sequences[open]; 173 174 // An opener matching our closer: 175 if sequence_open.open 176 && sequence_close.marker == sequence_open.marker 177 && sequence_close.stack == sequence_open.stack 178 { 179 // If the opening can close or the closing can open, 180 // and the close size *is not* a multiple of three, 181 // but the sum of the opening and closing size *is* 182 // multiple of three, then **don’t** match. 183 if (sequence_open.close || sequence_close.open) 184 && sequence_close.size % 3 != 0 185 && (sequence_open.size + sequence_close.size) % 3 == 0 186 { 187 continue; 188 } 189 190 // For GFM strikethrough: 191 // * both sequences must have the same size 192 // * more than 2 markers don’t work 193 // * one marker is prohibited by the spec, but supported by GH 194 if sequence_close.marker == b'~' 195 && (sequence_close.size != sequence_open.size 196 || sequence_close.size > 2 197 || sequence_close.size == 1 198 && !tokenizer.parse_state.options.gfm_strikethrough_single_tilde) 199 { 200 continue; 201 } 202 203 // We found a match! 204 next_index = match_sequences(tokenizer, &mut sequences, open, close); 205 206 break; 207 } 208 } 209 } 210 211 close = next_index; 212 } 213 214 // Mark remaining sequences as data. 215 let mut index = 0; 216 while index < sequences.len() { 217 let sequence = &sequences[index]; 218 tokenizer.events[sequence.index].name = Name::Data; 219 tokenizer.events[sequence.index + 1].name = Name::Data; 220 index += 1; 221 } 222 223 tokenizer.map.consume(&mut tokenizer.events); 224 None 225} 226 227/// Get sequences. 228fn get_sequences(tokenizer: &mut Tokenizer) -> Vec<Sequence> { 229 let mut index = 0; 230 let mut stack = vec![]; 231 let mut sequences = vec![]; 232 233 while index < tokenizer.events.len() { 234 let enter = &tokenizer.events[index]; 235 236 if enter.name == Name::AttentionSequence { 237 if enter.kind == Kind::Enter { 238 let end = index + 1; 239 let exit = &tokenizer.events[end]; 240 241 let marker = tokenizer.parse_state.bytes[enter.point.index]; 242 let before_char = char_before_index(tokenizer.parse_state.bytes, enter.point.index); 243 let before = classify_opt(before_char); 244 let after_char = char_after_index(tokenizer.parse_state.bytes, exit.point.index); 245 let after = classify_opt(after_char); 246 let open = after == CharacterKind::Other 247 || (after == CharacterKind::Punctuation && before != CharacterKind::Other) 248 // For regular attention markers (not strikethrough), the 249 // other attention markers can be used around them 250 || (marker != b'~' && matches!(after_char, Some('*' | '_'))) 251 || (marker != b'~' && tokenizer.parse_state.options.constructs.gfm_strikethrough && matches!(after_char, Some('~'))); 252 let close = before == CharacterKind::Other 253 || (before == CharacterKind::Punctuation && after != CharacterKind::Other) 254 || (marker != b'~' && matches!(before_char, Some('*' | '_'))) 255 || (marker != b'~' 256 && tokenizer.parse_state.options.constructs.gfm_strikethrough 257 && matches!(before_char, Some('~'))); 258 259 sequences.push(Sequence { 260 index, 261 stack: stack.clone(), 262 start_point: enter.point.clone(), 263 end_point: exit.point.clone(), 264 size: exit.point.index - enter.point.index, 265 open: if marker == b'_' { 266 open && (before != CharacterKind::Other || !close) 267 } else { 268 open 269 }, 270 close: if marker == b'_' { 271 close && (after != CharacterKind::Other || !open) 272 } else { 273 close 274 }, 275 marker, 276 }); 277 } 278 } else if enter.kind == Kind::Enter { 279 stack.push(index); 280 } else { 281 stack.pop(); 282 } 283 284 index += 1; 285 } 286 287 sequences 288} 289 290/// Match two sequences. 291#[allow(clippy::too_many_lines)] 292fn match_sequences( 293 tokenizer: &mut Tokenizer, 294 sequences: &mut Vec<Sequence>, 295 open: usize, 296 close: usize, 297) -> usize { 298 // Where to move to next. 299 // Stay on this closing sequence for the next iteration: it 300 // might close more things. 301 // It’s changed if sequences are removed. 302 let mut next = close; 303 304 // Number of markers to use from the sequence. 305 let take = if sequences[open].size > 1 && sequences[close].size > 1 { 306 2 307 } else { 308 1 309 }; 310 311 // We’re *on* a closing sequence, with a matching opening 312 // sequence. 313 // Now we make sure that we can’t have misnested attention: 314 // 315 // ```html 316 // <em>a <strong>b</em> c</strong> 317 // ``` 318 // 319 // Do that by marking everything between it as no longer 320 // possible to open anything. 321 // Theoretically we should mark as `close: false` too, but 322 // we don’t look for closers backwards, so it’s not needed. 323 let mut between = open + 1; 324 325 while between < close { 326 sequences[between].open = false; 327 between += 1; 328 } 329 330 let (group_name, seq_name, text_name) = if sequences[open].marker == b'~' { 331 ( 332 Name::GfmStrikethrough, 333 Name::GfmStrikethroughSequence, 334 Name::GfmStrikethroughText, 335 ) 336 } else if take == 1 { 337 (Name::Emphasis, Name::EmphasisSequence, Name::EmphasisText) 338 } else { 339 (Name::Strong, Name::StrongSequence, Name::StrongText) 340 }; 341 let open_index = sequences[open].index; 342 let close_index = sequences[close].index; 343 let open_exit = sequences[open].end_point.clone(); 344 let close_enter = sequences[close].start_point.clone(); 345 346 // No need to worry about `VS`, because sequences are only actual characters. 347 sequences[open].size -= take; 348 sequences[close].size -= take; 349 sequences[open].end_point.column -= take; 350 sequences[open].end_point.index -= take; 351 sequences[close].start_point.column += take; 352 sequences[close].start_point.index += take; 353 354 // Opening. 355 tokenizer.map.add_before( 356 // Add after the current sequence (it might remain). 357 open_index + 2, 358 0, 359 vec![ 360 Event { 361 kind: Kind::Enter, 362 name: group_name.clone(), 363 point: sequences[open].end_point.clone(), 364 link: None, 365 }, 366 Event { 367 kind: Kind::Enter, 368 name: seq_name.clone(), 369 point: sequences[open].end_point.clone(), 370 link: None, 371 }, 372 Event { 373 kind: Kind::Exit, 374 name: seq_name.clone(), 375 point: open_exit.clone(), 376 link: None, 377 }, 378 Event { 379 kind: Kind::Enter, 380 name: text_name.clone(), 381 point: open_exit, 382 link: None, 383 }, 384 ], 385 ); 386 // Closing. 387 tokenizer.map.add( 388 close_index, 389 0, 390 vec![ 391 Event { 392 kind: Kind::Exit, 393 name: text_name, 394 point: close_enter.clone(), 395 link: None, 396 }, 397 Event { 398 kind: Kind::Enter, 399 name: seq_name.clone(), 400 point: close_enter, 401 link: None, 402 }, 403 Event { 404 kind: Kind::Exit, 405 name: seq_name, 406 point: sequences[close].start_point.clone(), 407 link: None, 408 }, 409 Event { 410 kind: Kind::Exit, 411 name: group_name, 412 point: sequences[close].start_point.clone(), 413 link: None, 414 }, 415 ], 416 ); 417 418 // Remove closing sequence if fully used. 419 if sequences[close].size == 0 { 420 sequences.remove(close); 421 tokenizer.map.add(close_index, 2, vec![]); 422 } else { 423 // Shift remaining closing sequence forward. 424 // Do it here because a sequence can open and close different 425 // other sequences, and the remainder can be on any side or 426 // somewhere in the middle. 427 tokenizer.events[close_index].point = sequences[close].start_point.clone(); 428 } 429 430 if sequences[open].size == 0 { 431 sequences.remove(open); 432 tokenizer.map.add(open_index, 2, vec![]); 433 // Everything shifts one to the left, account for it in next iteration. 434 next -= 1; 435 } else { 436 tokenizer.events[open_index + 1].point = sequences[open].end_point.clone(); 437 } 438 439 next 440}