src/construct/attention.rs at hack · crashkeys.dev/markdown-rs

crashkeys.dev / markdown-rs
fork atom
Markdown parser fork with extended syntax for personal use.
fork atom
markdown-rs / src / construct / attention.rs
at hack 440 lines 16 kB view raw
wrap content
Titus Wormer Refactor docs 11mo ago
e0ca3f6c
  1//! Attention (emphasis, strong, optionally GFM strikethrough) occurs in the
  2//! [text][] content type.
  3//!
  4//! ## Grammar
  5//!
  6//! Attention sequences form with the following BNF
  7//! (<small>see [construct][crate::construct] for character groups</small>):
  8//!
  9//! ```bnf
 10//! attention_sequence ::= 1*'*' | 1*'_'
 11//! gfm_attention_sequence ::= 1*'~'
 12//! ```
 13//!
 14//! Sequences are matched together to form attention based on which character
 15//! they contain, how long they are, and what character occurs before and after
 16//! each sequence.
 17//! Otherwise they are turned into data.
 18//!
 19//! ## HTML
 20//!
 21//! When asterisk/underscore sequences match, and two markers can be “taken”
 22//! from them, they together relate to the `<strong>` element in HTML.
 23//! When one marker can be taken, they relate to the `<em>` element.
 24//! See [*§ 4.5.2 The `em` element*][html-em] and
 25//! [*§ 4.5.3 The `strong` element*][html-strong] in the HTML spec for more
 26//! info.
 27//!
 28//! When tilde sequences match, they together relate to the `<del>` element in
 29//! HTML.
 30//! See [*§ 4.7.2 The `del` element*][html-del] in the HTML spec for more info.
 31//!
 32//! ## Recommendation
 33//!
 34//! It is recommended to use asterisks for emphasis/strong attention when
 35//! writing markdown.
 36//!
 37//! There are some small differences in whether sequences can open and/or close
 38//! based on whether they are formed with asterisks or underscores.
 39//! Because underscores also frequently occur in natural language inside words,
 40//! while asterisks typically never do, `CommonMark` prohibits underscore
 41//! sequences from opening or closing when *inside* a word.
 42//!
 43//! Because asterisks can be used to form the most markdown constructs, using
 44//! them has the added benefit of making it easier to gloss over markdown: you
 45//! can look for asterisks to find syntax while not worrying about other
 46//! characters.
 47//!
 48//! For strikethrough attention, it is recommended to use two markers.
 49//! While `github.com` allows single tildes too, it technically prohibits it in
 50//! their spec.
 51//!
 52//! ## Tokens
 53//!
 54//! * [`Emphasis`][Name::Emphasis]
 55//! * [`EmphasisSequence`][Name::EmphasisSequence]
 56//! * [`EmphasisText`][Name::EmphasisText]
 57//! * [`GfmStrikethrough`][Name::GfmStrikethrough]
 58//! * [`GfmStrikethroughSequence`][Name::GfmStrikethroughSequence]
 59//! * [`GfmStrikethroughText`][Name::GfmStrikethroughText]
 60//! * [`Strong`][Name::Strong]
 61//! * [`StrongSequence`][Name::StrongSequence]
 62//! * [`StrongText`][Name::StrongText]
 63//!
 64//! > 👉 **Note**: while parsing, [`AttentionSequence`][Name::AttentionSequence]
 65//! > is used, which is later compiled away.
 66//!
 67//! ## References
 68//!
 69//! * [`attention.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/attention.js)
 70//! * [`micromark-extension-gfm-strikethrough`](https://github.com/micromark/micromark-extension-gfm-strikethrough)
 71//! * [*§ 6.2 Emphasis and strong emphasis* in `CommonMark`](https://spec.commonmark.org/0.31/#emphasis-and-strong-emphasis)
 72//! * [*§ 6.5 Strikethrough (extension)* in `GFM`](https://github.github.com/gfm/#strikethrough-extension-)
 73//!
 74//! [text]: crate::construct::text
 75//! [html-em]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element
 76//! [html-strong]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-strong-element
 77//! [html-del]: https://html.spec.whatwg.org/multipage/edits.html#the-del-element
 78
 79use crate::event::{Event, Kind, Name, Point};
 80use crate::resolve::Name as ResolveName;
 81use crate::state::{Name as StateName, State};
 82use crate::subtokenize::Subresult;
 83use crate::tokenizer::Tokenizer;
 84use crate::util::char::{
 85    after_index as char_after_index, before_index as char_before_index, classify_opt,
 86    Kind as CharacterKind,
 87};
 88use alloc::{vec, vec::Vec};
 89
 90/// Attentention sequence that we can take markers from.
 91#[derive(Debug)]
 92struct Sequence {
 93    /// Marker as a byte (`u8`) used in this sequence.
 94    marker: u8,
 95    /// We track whether sequences are in balanced events, and where those
 96    /// events start, so that one attention doesn’t start in say, one link, and
 97    /// end in another.
 98    stack: Vec<usize>,
 99    /// The index into events where this sequence’s `Enter` currently resides.
100    index: usize,
101    /// The (shifted) point where this sequence starts.
102    start_point: Point,
103    /// The (shifted) point where this sequence end.
104    end_point: Point,
105    /// The number of markers we can still use.
106    size: usize,
107    /// Whether this sequence can open attention.
108    open: bool,
109    /// Whether this sequence can close attention.
110    close: bool,
111}
112
113/// At start of attention.
114///
115/// ```markdown
116/// > | **
117///     ^
118/// ```
119pub fn start(tokenizer: &mut Tokenizer) -> State {
120    // Emphasis/strong:
121    if (tokenizer.parse_state.options.constructs.attention
122        && matches!(tokenizer.current, Some(b'*' | b'_')))
123        // GFM strikethrough:
124        || (tokenizer.parse_state.options.constructs.gfm_strikethrough && tokenizer.current == Some(b'~'))
125    {
126        tokenizer.tokenize_state.marker = tokenizer.current.unwrap();
127        tokenizer.enter(Name::AttentionSequence);
128        State::Retry(StateName::AttentionInside)
129    } else {
130        State::Nok
131    }
132}
133
134/// In sequence.
135///
136/// ```markdown
137/// > | **
138///     ^^
139/// ```
140pub fn inside(tokenizer: &mut Tokenizer) -> State {
141    if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
142        tokenizer.consume();
143        State::Next(StateName::AttentionInside)
144    } else {
145        tokenizer.exit(Name::AttentionSequence);
146        tokenizer.register_resolver(ResolveName::Attention);
147        tokenizer.tokenize_state.marker = 0;
148        State::Ok
149    }
150}
151
152/// Resolve sequences.
153pub fn resolve(tokenizer: &mut Tokenizer) -> Option<Subresult> {
154    // Find all sequences, gather info about them.
155    let mut sequences = get_sequences(tokenizer);
156
157    // Now walk through them and match them.
158    let mut close = 0;
159
160    while close < sequences.len() {
161        let sequence_close = &sequences[close];
162        let mut next_index = close + 1;
163
164        // Find a sequence that can close.
165        if sequence_close.close {
166            let mut open = close;
167
168            // Now walk back to find an opener.
169            while open > 0 {
170                open -= 1;
171
172                let sequence_open = &sequences[open];
173
174                // An opener matching our closer:
175                if sequence_open.open
176                    && sequence_close.marker == sequence_open.marker
177                    && sequence_close.stack == sequence_open.stack
178                {
179                    // If the opening can close or the closing can open,
180                    // and the close size *is not* a multiple of three,
181                    // but the sum of the opening and closing size *is*
182                    // multiple of three, then **don’t** match.
183                    if (sequence_open.close || sequence_close.open)
184                        && sequence_close.size % 3 != 0
185                        && (sequence_open.size + sequence_close.size) % 3 == 0
186                    {
187                        continue;
188                    }
189
190                    // For GFM strikethrough:
191                    // * both sequences must have the same size
192                    // * more than 2 markers don’t work
193                    // * one marker is prohibited by the spec, but supported by GH
194                    if sequence_close.marker == b'~'
195                        && (sequence_close.size != sequence_open.size
196                            || sequence_close.size > 2
197                            || sequence_close.size == 1
198                                && !tokenizer.parse_state.options.gfm_strikethrough_single_tilde)
199                    {
200                        continue;
201                    }
202
203                    // We found a match!
204                    next_index = match_sequences(tokenizer, &mut sequences, open, close);
205
206                    break;
207                }
208            }
209        }
210
211        close = next_index;
212    }
213
214    // Mark remaining sequences as data.
215    let mut index = 0;
216    while index < sequences.len() {
217        let sequence = &sequences[index];
218        tokenizer.events[sequence.index].name = Name::Data;
219        tokenizer.events[sequence.index + 1].name = Name::Data;
220        index += 1;
221    }
222
223    tokenizer.map.consume(&mut tokenizer.events);
224    None
225}
226
227/// Get sequences.
228fn get_sequences(tokenizer: &mut Tokenizer) -> Vec<Sequence> {
229    let mut index = 0;
230    let mut stack = vec![];
231    let mut sequences = vec![];
232
233    while index < tokenizer.events.len() {
234        let enter = &tokenizer.events[index];
235
236        if enter.name == Name::AttentionSequence {
237            if enter.kind == Kind::Enter {
238                let end = index + 1;
239                let exit = &tokenizer.events[end];
240
241                let marker = tokenizer.parse_state.bytes[enter.point.index];
242                let before_char = char_before_index(tokenizer.parse_state.bytes, enter.point.index);
243                let before = classify_opt(before_char);
244                let after_char = char_after_index(tokenizer.parse_state.bytes, exit.point.index);
245                let after = classify_opt(after_char);
246                let open = after == CharacterKind::Other
247                    || (after == CharacterKind::Punctuation && before != CharacterKind::Other)
248                    // For regular attention markers (not strikethrough), the
249                    // other attention markers can be used around them
250                    || (marker != b'~' && matches!(after_char, Some('*' | '_')))
251                    || (marker != b'~' && tokenizer.parse_state.options.constructs.gfm_strikethrough && matches!(after_char, Some('~')));
252                let close = before == CharacterKind::Other
253                    || (before == CharacterKind::Punctuation && after != CharacterKind::Other)
254                    || (marker != b'~' && matches!(before_char, Some('*' | '_')))
255                    || (marker != b'~'
256                        && tokenizer.parse_state.options.constructs.gfm_strikethrough
257                        && matches!(before_char, Some('~')));
258
259                sequences.push(Sequence {
260                    index,
261                    stack: stack.clone(),
262                    start_point: enter.point.clone(),
263                    end_point: exit.point.clone(),
264                    size: exit.point.index - enter.point.index,
265                    open: if marker == b'_' {
266                        open && (before != CharacterKind::Other || !close)
267                    } else {
268                        open
269                    },
270                    close: if marker == b'_' {
271                        close && (after != CharacterKind::Other || !open)
272                    } else {
273                        close
274                    },
275                    marker,
276                });
277            }
278        } else if enter.kind == Kind::Enter {
279            stack.push(index);
280        } else {
281            stack.pop();
282        }
283
284        index += 1;
285    }
286
287    sequences
288}
289
290/// Match two sequences.
291#[allow(clippy::too_many_lines)]
292fn match_sequences(
293    tokenizer: &mut Tokenizer,
294    sequences: &mut Vec<Sequence>,
295    open: usize,
296    close: usize,
297) -> usize {
298    // Where to move to next.
299    // Stay on this closing sequence for the next iteration: it
300    // might close more things.
301    // It’s changed if sequences are removed.
302    let mut next = close;
303
304    // Number of markers to use from the sequence.
305    let take = if sequences[open].size > 1 && sequences[close].size > 1 {
306        2
307    } else {
308        1
309    };
310
311    // We’re *on* a closing sequence, with a matching opening
312    // sequence.
313    // Now we make sure that we can’t have misnested attention:
314    //
315    // ```html
316    // <em>a <strong>b</em> c</strong>
317    // ```
318    //
319    // Do that by marking everything between it as no longer
320    // possible to open anything.
321    // Theoretically we should mark as `close: false` too, but
322    // we don’t look for closers backwards, so it’s not needed.
323    let mut between = open + 1;
324
325    while between < close {
326        sequences[between].open = false;
327        between += 1;
328    }
329
330    let (group_name, seq_name, text_name) = if sequences[open].marker == b'~' {
331        (
332            Name::GfmStrikethrough,
333            Name::GfmStrikethroughSequence,
334            Name::GfmStrikethroughText,
335        )
336    } else if take == 1 {
337        (Name::Emphasis, Name::EmphasisSequence, Name::EmphasisText)
338    } else {
339        (Name::Strong, Name::StrongSequence, Name::StrongText)
340    };
341    let open_index = sequences[open].index;
342    let close_index = sequences[close].index;
343    let open_exit = sequences[open].end_point.clone();
344    let close_enter = sequences[close].start_point.clone();
345
346    // No need to worry about `VS`, because sequences are only actual characters.
347    sequences[open].size -= take;
348    sequences[close].size -= take;
349    sequences[open].end_point.column -= take;
350    sequences[open].end_point.index -= take;
351    sequences[close].start_point.column += take;
352    sequences[close].start_point.index += take;
353
354    // Opening.
355    tokenizer.map.add_before(
356        // Add after the current sequence (it might remain).
357        open_index + 2,
358        0,
359        vec![
360            Event {
361                kind: Kind::Enter,
362                name: group_name.clone(),
363                point: sequences[open].end_point.clone(),
364                link: None,
365            },
366            Event {
367                kind: Kind::Enter,
368                name: seq_name.clone(),
369                point: sequences[open].end_point.clone(),
370                link: None,
371            },
372            Event {
373                kind: Kind::Exit,
374                name: seq_name.clone(),
375                point: open_exit.clone(),
376                link: None,
377            },
378            Event {
379                kind: Kind::Enter,
380                name: text_name.clone(),
381                point: open_exit,
382                link: None,
383            },
384        ],
385    );
386    // Closing.
387    tokenizer.map.add(
388        close_index,
389        0,
390        vec![
391            Event {
392                kind: Kind::Exit,
393                name: text_name,
394                point: close_enter.clone(),
395                link: None,
396            },
397            Event {
398                kind: Kind::Enter,
399                name: seq_name.clone(),
400                point: close_enter,
401                link: None,
402            },
403            Event {
404                kind: Kind::Exit,
405                name: seq_name,
406                point: sequences[close].start_point.clone(),
407                link: None,
408            },
409            Event {
410                kind: Kind::Exit,
411                name: group_name,
412                point: sequences[close].start_point.clone(),
413                link: None,
414            },
415        ],
416    );
417
418    // Remove closing sequence if fully used.
419    if sequences[close].size == 0 {
420        sequences.remove(close);
421        tokenizer.map.add(close_index, 2, vec![]);
422    } else {
423        // Shift remaining closing sequence forward.
424        // Do it here because a sequence can open and close different
425        // other sequences, and the remainder can be on any side or
426        // somewhere in the middle.
427        tokenizer.events[close_index].point = sequences[close].start_point.clone();
428    }
429
430    if sequences[open].size == 0 {
431        sequences.remove(open);
432        tokenizer.map.add(open_index, 2, vec![]);
433        // Everything shifts one to the left, account for it in next iteration.
434        next -= 1;
435    } else {
436        tokenizer.events[open_index + 1].point = sequences[open].end_point.clone();
437    }
438
439    next
440}