Markdown parser fork with extended syntax for personal use.
1//! Attention (emphasis, strong, optionally GFM strikethrough) occurs in the
2//! [text][] content type.
3//!
4//! ## Grammar
5//!
6//! Attention sequences form with the following BNF
7//! (<small>see [construct][crate::construct] for character groups</small>):
8//!
9//! ```bnf
10//! attention_sequence ::= 1*'*' | 1*'_'
11//! gfm_attention_sequence ::= 1*'~'
12//! ```
13//!
14//! Sequences are matched together to form attention based on which character
15//! they contain, how long they are, and what character occurs before and after
16//! each sequence.
17//! Otherwise they are turned into data.
18//!
19//! ## HTML
20//!
21//! When asterisk/underscore sequences match, and two markers can be “taken”
22//! from them, they together relate to the `<strong>` element in HTML.
23//! When one marker can be taken, they relate to the `<em>` element.
24//! See [*§ 4.5.2 The `em` element*][html-em] and
25//! [*§ 4.5.3 The `strong` element*][html-strong] in the HTML spec for more
26//! info.
27//!
28//! When tilde sequences match, they together relate to the `<del>` element in
29//! HTML.
30//! See [*§ 4.7.2 The `del` element*][html-del] in the HTML spec for more info.
31//!
32//! ## Recommendation
33//!
34//! It is recommended to use asterisks for emphasis/strong attention when
35//! writing markdown.
36//!
37//! There are some small differences in whether sequences can open and/or close
38//! based on whether they are formed with asterisks or underscores.
39//! Because underscores also frequently occur in natural language inside words,
40//! while asterisks typically never do, `CommonMark` prohibits underscore
41//! sequences from opening or closing when *inside* a word.
42//!
43//! Because asterisks can be used to form the most markdown constructs, using
44//! them has the added benefit of making it easier to gloss over markdown: you
45//! can look for asterisks to find syntax while not worrying about other
46//! characters.
47//!
48//! For strikethrough attention, it is recommended to use two markers.
49//! While `github.com` allows single tildes too, it technically prohibits it in
50//! their spec.
51//!
52//! ## Tokens
53//!
54//! * [`Emphasis`][Name::Emphasis]
55//! * [`EmphasisSequence`][Name::EmphasisSequence]
56//! * [`EmphasisText`][Name::EmphasisText]
57//! * [`GfmStrikethrough`][Name::GfmStrikethrough]
58//! * [`GfmStrikethroughSequence`][Name::GfmStrikethroughSequence]
59//! * [`GfmStrikethroughText`][Name::GfmStrikethroughText]
60//! * [`Strong`][Name::Strong]
61//! * [`StrongSequence`][Name::StrongSequence]
62//! * [`StrongText`][Name::StrongText]
63//!
64//! > 👉 **Note**: while parsing, [`AttentionSequence`][Name::AttentionSequence]
65//! > is used, which is later compiled away.
66//!
67//! ## References
68//!
69//! * [`attention.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/attention.js)
70//! * [`micromark-extension-gfm-strikethrough`](https://github.com/micromark/micromark-extension-gfm-strikethrough)
71//! * [*§ 6.2 Emphasis and strong emphasis* in `CommonMark`](https://spec.commonmark.org/0.31/#emphasis-and-strong-emphasis)
72//! * [*§ 6.5 Strikethrough (extension)* in `GFM`](https://github.github.com/gfm/#strikethrough-extension-)
73//!
74//! [text]: crate::construct::text
75//! [html-em]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-em-element
76//! [html-strong]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-strong-element
77//! [html-del]: https://html.spec.whatwg.org/multipage/edits.html#the-del-element
78
79use crate::event::{Event, Kind, Name, Point};
80use crate::resolve::Name as ResolveName;
81use crate::state::{Name as StateName, State};
82use crate::subtokenize::Subresult;
83use crate::tokenizer::Tokenizer;
84use crate::util::char::{
85 after_index as char_after_index, before_index as char_before_index, classify_opt,
86 Kind as CharacterKind,
87};
88use alloc::{vec, vec::Vec};
89
90/// Attentention sequence that we can take markers from.
91#[derive(Debug)]
92struct Sequence {
93 /// Marker as a byte (`u8`) used in this sequence.
94 marker: u8,
95 /// We track whether sequences are in balanced events, and where those
96 /// events start, so that one attention doesn’t start in say, one link, and
97 /// end in another.
98 stack: Vec<usize>,
99 /// The index into events where this sequence’s `Enter` currently resides.
100 index: usize,
101 /// The (shifted) point where this sequence starts.
102 start_point: Point,
103 /// The (shifted) point where this sequence end.
104 end_point: Point,
105 /// The number of markers we can still use.
106 size: usize,
107 /// Whether this sequence can open attention.
108 open: bool,
109 /// Whether this sequence can close attention.
110 close: bool,
111}
112
113/// At start of attention.
114///
115/// ```markdown
116/// > | **
117/// ^
118/// ```
119pub fn start(tokenizer: &mut Tokenizer) -> State {
120 // Emphasis/strong:
121 if (tokenizer.parse_state.options.constructs.attention
122 && matches!(tokenizer.current, Some(b'*' | b'_')))
123 // GFM strikethrough:
124 || (tokenizer.parse_state.options.constructs.gfm_strikethrough && tokenizer.current == Some(b'~'))
125 {
126 tokenizer.tokenize_state.marker = tokenizer.current.unwrap();
127 tokenizer.enter(Name::AttentionSequence);
128 State::Retry(StateName::AttentionInside)
129 } else {
130 State::Nok
131 }
132}
133
134/// In sequence.
135///
136/// ```markdown
137/// > | **
138/// ^^
139/// ```
140pub fn inside(tokenizer: &mut Tokenizer) -> State {
141 if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
142 tokenizer.consume();
143 State::Next(StateName::AttentionInside)
144 } else {
145 tokenizer.exit(Name::AttentionSequence);
146 tokenizer.register_resolver(ResolveName::Attention);
147 tokenizer.tokenize_state.marker = 0;
148 State::Ok
149 }
150}
151
152/// Resolve sequences.
153pub fn resolve(tokenizer: &mut Tokenizer) -> Option<Subresult> {
154 // Find all sequences, gather info about them.
155 let mut sequences = get_sequences(tokenizer);
156
157 // Now walk through them and match them.
158 let mut close = 0;
159
160 while close < sequences.len() {
161 let sequence_close = &sequences[close];
162 let mut next_index = close + 1;
163
164 // Find a sequence that can close.
165 if sequence_close.close {
166 let mut open = close;
167
168 // Now walk back to find an opener.
169 while open > 0 {
170 open -= 1;
171
172 let sequence_open = &sequences[open];
173
174 // An opener matching our closer:
175 if sequence_open.open
176 && sequence_close.marker == sequence_open.marker
177 && sequence_close.stack == sequence_open.stack
178 {
179 // If the opening can close or the closing can open,
180 // and the close size *is not* a multiple of three,
181 // but the sum of the opening and closing size *is*
182 // multiple of three, then **don’t** match.
183 if (sequence_open.close || sequence_close.open)
184 && sequence_close.size % 3 != 0
185 && (sequence_open.size + sequence_close.size) % 3 == 0
186 {
187 continue;
188 }
189
190 // For GFM strikethrough:
191 // * both sequences must have the same size
192 // * more than 2 markers don’t work
193 // * one marker is prohibited by the spec, but supported by GH
194 if sequence_close.marker == b'~'
195 && (sequence_close.size != sequence_open.size
196 || sequence_close.size > 2
197 || sequence_close.size == 1
198 && !tokenizer.parse_state.options.gfm_strikethrough_single_tilde)
199 {
200 continue;
201 }
202
203 // We found a match!
204 next_index = match_sequences(tokenizer, &mut sequences, open, close);
205
206 break;
207 }
208 }
209 }
210
211 close = next_index;
212 }
213
214 // Mark remaining sequences as data.
215 let mut index = 0;
216 while index < sequences.len() {
217 let sequence = &sequences[index];
218 tokenizer.events[sequence.index].name = Name::Data;
219 tokenizer.events[sequence.index + 1].name = Name::Data;
220 index += 1;
221 }
222
223 tokenizer.map.consume(&mut tokenizer.events);
224 None
225}
226
227/// Get sequences.
228fn get_sequences(tokenizer: &mut Tokenizer) -> Vec<Sequence> {
229 let mut index = 0;
230 let mut stack = vec![];
231 let mut sequences = vec![];
232
233 while index < tokenizer.events.len() {
234 let enter = &tokenizer.events[index];
235
236 if enter.name == Name::AttentionSequence {
237 if enter.kind == Kind::Enter {
238 let end = index + 1;
239 let exit = &tokenizer.events[end];
240
241 let marker = tokenizer.parse_state.bytes[enter.point.index];
242 let before_char = char_before_index(tokenizer.parse_state.bytes, enter.point.index);
243 let before = classify_opt(before_char);
244 let after_char = char_after_index(tokenizer.parse_state.bytes, exit.point.index);
245 let after = classify_opt(after_char);
246 let open = after == CharacterKind::Other
247 || (after == CharacterKind::Punctuation && before != CharacterKind::Other)
248 // For regular attention markers (not strikethrough), the
249 // other attention markers can be used around them
250 || (marker != b'~' && matches!(after_char, Some('*' | '_')))
251 || (marker != b'~' && tokenizer.parse_state.options.constructs.gfm_strikethrough && matches!(after_char, Some('~')));
252 let close = before == CharacterKind::Other
253 || (before == CharacterKind::Punctuation && after != CharacterKind::Other)
254 || (marker != b'~' && matches!(before_char, Some('*' | '_')))
255 || (marker != b'~'
256 && tokenizer.parse_state.options.constructs.gfm_strikethrough
257 && matches!(before_char, Some('~')));
258
259 sequences.push(Sequence {
260 index,
261 stack: stack.clone(),
262 start_point: enter.point.clone(),
263 end_point: exit.point.clone(),
264 size: exit.point.index - enter.point.index,
265 open: if marker == b'_' {
266 open && (before != CharacterKind::Other || !close)
267 } else {
268 open
269 },
270 close: if marker == b'_' {
271 close && (after != CharacterKind::Other || !open)
272 } else {
273 close
274 },
275 marker,
276 });
277 }
278 } else if enter.kind == Kind::Enter {
279 stack.push(index);
280 } else {
281 stack.pop();
282 }
283
284 index += 1;
285 }
286
287 sequences
288}
289
290/// Match two sequences.
291#[allow(clippy::too_many_lines)]
292fn match_sequences(
293 tokenizer: &mut Tokenizer,
294 sequences: &mut Vec<Sequence>,
295 open: usize,
296 close: usize,
297) -> usize {
298 // Where to move to next.
299 // Stay on this closing sequence for the next iteration: it
300 // might close more things.
301 // It’s changed if sequences are removed.
302 let mut next = close;
303
304 // Number of markers to use from the sequence.
305 let take = if sequences[open].size > 1 && sequences[close].size > 1 {
306 2
307 } else {
308 1
309 };
310
311 // We’re *on* a closing sequence, with a matching opening
312 // sequence.
313 // Now we make sure that we can’t have misnested attention:
314 //
315 // ```html
316 // <em>a <strong>b</em> c</strong>
317 // ```
318 //
319 // Do that by marking everything between it as no longer
320 // possible to open anything.
321 // Theoretically we should mark as `close: false` too, but
322 // we don’t look for closers backwards, so it’s not needed.
323 let mut between = open + 1;
324
325 while between < close {
326 sequences[between].open = false;
327 between += 1;
328 }
329
330 let (group_name, seq_name, text_name) = if sequences[open].marker == b'~' {
331 (
332 Name::GfmStrikethrough,
333 Name::GfmStrikethroughSequence,
334 Name::GfmStrikethroughText,
335 )
336 } else if take == 1 {
337 (Name::Emphasis, Name::EmphasisSequence, Name::EmphasisText)
338 } else {
339 (Name::Strong, Name::StrongSequence, Name::StrongText)
340 };
341 let open_index = sequences[open].index;
342 let close_index = sequences[close].index;
343 let open_exit = sequences[open].end_point.clone();
344 let close_enter = sequences[close].start_point.clone();
345
346 // No need to worry about `VS`, because sequences are only actual characters.
347 sequences[open].size -= take;
348 sequences[close].size -= take;
349 sequences[open].end_point.column -= take;
350 sequences[open].end_point.index -= take;
351 sequences[close].start_point.column += take;
352 sequences[close].start_point.index += take;
353
354 // Opening.
355 tokenizer.map.add_before(
356 // Add after the current sequence (it might remain).
357 open_index + 2,
358 0,
359 vec![
360 Event {
361 kind: Kind::Enter,
362 name: group_name.clone(),
363 point: sequences[open].end_point.clone(),
364 link: None,
365 },
366 Event {
367 kind: Kind::Enter,
368 name: seq_name.clone(),
369 point: sequences[open].end_point.clone(),
370 link: None,
371 },
372 Event {
373 kind: Kind::Exit,
374 name: seq_name.clone(),
375 point: open_exit.clone(),
376 link: None,
377 },
378 Event {
379 kind: Kind::Enter,
380 name: text_name.clone(),
381 point: open_exit,
382 link: None,
383 },
384 ],
385 );
386 // Closing.
387 tokenizer.map.add(
388 close_index,
389 0,
390 vec![
391 Event {
392 kind: Kind::Exit,
393 name: text_name,
394 point: close_enter.clone(),
395 link: None,
396 },
397 Event {
398 kind: Kind::Enter,
399 name: seq_name.clone(),
400 point: close_enter,
401 link: None,
402 },
403 Event {
404 kind: Kind::Exit,
405 name: seq_name,
406 point: sequences[close].start_point.clone(),
407 link: None,
408 },
409 Event {
410 kind: Kind::Exit,
411 name: group_name,
412 point: sequences[close].start_point.clone(),
413 link: None,
414 },
415 ],
416 );
417
418 // Remove closing sequence if fully used.
419 if sequences[close].size == 0 {
420 sequences.remove(close);
421 tokenizer.map.add(close_index, 2, vec![]);
422 } else {
423 // Shift remaining closing sequence forward.
424 // Do it here because a sequence can open and close different
425 // other sequences, and the remainder can be on any side or
426 // somewhere in the middle.
427 tokenizer.events[close_index].point = sequences[close].start_point.clone();
428 }
429
430 if sequences[open].size == 0 {
431 sequences.remove(open);
432 tokenizer.map.add(open_index, 2, vec![]);
433 // Everything shifts one to the left, account for it in next iteration.
434 next -= 1;
435 } else {
436 tokenizer.events[open_index + 1].point = sequences[open].end_point.clone();
437 }
438
439 next
440}