Markdown parser fork with extended syntax for personal use.
1//! Raw (flow) occurs in the [flow][] content type.
2//! It forms code (fenced) and math (flow).
3//!
4//! ## Grammar
5//!
6//! Code (fenced) forms with the following BNF
7//! (<small>see [construct][crate::construct] for character groups</small>):
8//!
9//! ```bnf
10//! raw_flow ::= fence_open *( eol *byte ) [ eol fence_close ]
11//!
12//! ; Restriction: math (flow) does not support the `info` part.
13//! fence_open ::= sequence [*space_or_tab info [1*space_or_tab meta]] *space_or_tab
14//! ; Restriction: the number of markers in the closing fence sequence must be
15//! ; equal to or greater than the number of markers in the opening fence
16//! ; sequence.
17//! ; Restriction: the marker in the closing fence sequence must match the
18//! ; marker in the opening fence sequence
19//! fence_close ::= sequence *space_or_tab
20//! sequence ::= 3*'`' | 3*'~' | 2*'$'
21//! ; Restriction: the marker cannot occur in `info` if it is the `$` or `` ` `` character.
22//! info ::= 1*text
23//! ; Restriction: the marker cannot occur in `meta` if it is the `$` or `` ` `` character.
24//! meta ::= 1*text *(*space_or_tab 1*text)
25//! ```
26//!
27//! As this construct occurs in flow, like all flow constructs, it must be
28//! followed by an eol (line ending) or eof (end of file).
29//!
30//! The above grammar does not show how indentation (with `space_or_tab`) of
31//! each line is handled.
32//! To parse raw (flow), let `x` be the number of `space_or_tab` characters
33//! before the opening fence sequence.
34//! Each line of text is then allowed (not required) to be indented with up
35//! to `x` spaces or tabs, which are then ignored as an indent instead of being
36//! considered as part of the content.
37//! This indent does not affect the closing fence.
38//! It can be indented up to a separate 3 spaces or tabs.
39//! A bigger indent makes it part of the content instead of a fence.
40//!
41//! The `info` and `meta` parts are interpreted as the [string][] content type.
42//! That means that [character escapes][character_escape] and
43//! [character references][character_reference] are allowed.
44//! Math (flow) does not support `info`.
45//!
46//! The optional `meta` part is ignored: it is not used when parsing or
47//! rendering.
48//!
49//! The optional `info` part is used and is expected to specify the programming
50//! language that the content is in.
51//! Which value it holds depends on what your syntax highlighter supports, if
52//! one is used.
53//!
54//! In markdown, it is also possible to use [raw (text)][raw_text] in the
55//! [text][] content type.
56//! It is also possible to create code with the
57//! [code (indented)][code_indented] construct.
58//!
59//! ## HTML
60//!
61//! Code (fenced) relates to both the `<pre>` and the `<code>` elements in
62//! HTML.
63//! See [*§ 4.4.3 The `pre` element*][html_pre] and the [*§ 4.5.15 The `code`
64//! element*][html_code] in the HTML spec for more info.
65//!
66//! Math (flow) does not relate to HTML elements.
67//! `MathML`, which is sort of like SVG but for math, exists but it doesn’t work
68//! well and isn’t widely supported.
69//! Instead, it is recommended to use client side JavaScript with something like
70//! `KaTeX` or `MathJax` to process the math
71//! For that, the math is compiled as a `<pre>`, and a `<code>` element with two
72//! classes: `language-math` and `math-display`.
73//! Client side JavaScript can look for these classes to process them further.
74//!
75//! The `info` is, when rendering to HTML, typically exposed as a class.
76//! This behavior stems from the HTML spec ([*§ 4.5.15 The `code`
77//! element*][html_code]).
78//! For example:
79//!
80//! ```markdown
81//! ~~~css
82//! * { color: tomato }
83//! ~~~
84//! ```
85//!
86//! Yields:
87//!
88//! ```html
89//! <pre><code class="language-css">* { color: tomato }
90//! </code></pre>
91//! ```
92//!
93//! ## Recommendation
94//!
95//! It is recommended to use code (fenced) instead of code (indented).
96//! Code (fenced) is more explicit, similar to code (text), and has support
97//! for specifying the programming language.
98//!
99//! When authoring markdown with math, keep in mind that math doesn’t work in
100//! most places.
101//! Notably, GitHub currently has a really weird crappy client-side regex-based
102//! thing.
103//! But on your own (math-heavy?) site it can be great!
104//! You can use code (fenced) with an info string of `math` to improve this, as
105//! that works in many places.
106//!
107//! ## Tokens
108//!
109//! * [`CodeFenced`][Name::CodeFenced]
110//! * [`CodeFencedFence`][Name::CodeFencedFence]
111//! * [`CodeFencedFenceInfo`][Name::CodeFencedFenceInfo]
112//! * [`CodeFencedFenceMeta`][Name::CodeFencedFenceMeta]
113//! * [`CodeFencedFenceSequence`][Name::CodeFencedFenceSequence]
114//! * [`CodeFlowChunk`][Name::CodeFlowChunk]
115//! * [`LineEnding`][Name::LineEnding]
116//! * [`MathFlow`][Name::MathFlow]
117//! * [`MathFlowFence`][Name::MathFlowFence]
118//! * [`MathFlowFenceMeta`][Name::MathFlowFenceMeta]
119//! * [`MathFlowFenceSequence`][Name::MathFlowFenceSequence]
120//! * [`MathFlowChunk`][Name::MathFlowChunk]
121//! * [`SpaceOrTab`][Name::SpaceOrTab]
122//!
123//! ## References
124//!
125//! * [`code-fenced.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/code-fenced.js)
126//! * [`micromark-extension-math`](https://github.com/micromark/micromark-extension-math)
127//! * [*§ 4.5 Fenced code blocks* in `CommonMark`](https://spec.commonmark.org/0.31/#fenced-code-blocks)
128//!
129//! > 👉 **Note**: math is not specified anywhere.
130//!
131//! [flow]: crate::construct::flow
132//! [string]: crate::construct::string
133//! [text]: crate::construct::text
134//! [character_escape]: crate::construct::character_escape
135//! [character_reference]: crate::construct::character_reference
136//! [code_indented]: crate::construct::code_indented
137//! [raw_text]: crate::construct::raw_text
138//! [html_code]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-code-element
139//! [html_pre]: https://html.spec.whatwg.org/multipage/grouping-content.html#the-pre-element
140
141use crate::construct::partial_space_or_tab::{space_or_tab, space_or_tab_min_max};
142use crate::event::{Content, Link, Name};
143use crate::state::{Name as StateName, State};
144use crate::tokenizer::Tokenizer;
145use crate::util::{
146 constant::{CODE_FENCED_SEQUENCE_SIZE_MIN, MATH_FLOW_SEQUENCE_SIZE_MIN, TAB_SIZE},
147 slice::{Position, Slice},
148};
149
150/// Start of raw.
151///
152/// ```markdown
153/// > | ~~~js
154/// ^
155/// | console.log(1)
156/// | ~~~
157/// ```
158pub fn start(tokenizer: &mut Tokenizer) -> State {
159 if tokenizer.parse_state.options.constructs.code_fenced
160 || tokenizer.parse_state.options.constructs.math_flow
161 {
162 if matches!(tokenizer.current, Some(b'\t' | b' ')) {
163 tokenizer.attempt(
164 State::Next(StateName::RawFlowBeforeSequenceOpen),
165 State::Nok,
166 );
167 return State::Retry(space_or_tab_min_max(
168 tokenizer,
169 0,
170 if tokenizer.parse_state.options.constructs.code_indented {
171 TAB_SIZE - 1
172 } else {
173 usize::MAX
174 },
175 ));
176 }
177
178 if matches!(tokenizer.current, Some(b'$' | b'`' | b'~')) {
179 return State::Retry(StateName::RawFlowBeforeSequenceOpen);
180 }
181 }
182
183 State::Nok
184}
185
186/// In opening fence, after prefix, at sequence.
187///
188/// ```markdown
189/// > | ~~~js
190/// ^
191/// | console.log(1)
192/// | ~~~
193/// ```
194pub fn before_sequence_open(tokenizer: &mut Tokenizer) -> State {
195 let tail = tokenizer.events.last();
196 let mut prefix = 0;
197
198 if let Some(event) = tail {
199 if event.name == Name::SpaceOrTab {
200 prefix = Slice::from_position(
201 tokenizer.parse_state.bytes,
202 &Position::from_exit_event(&tokenizer.events, tokenizer.events.len() - 1),
203 )
204 .len();
205 }
206 }
207
208 // Code (fenced).
209 if (tokenizer.parse_state.options.constructs.code_fenced
210 && matches!(tokenizer.current, Some(b'`' | b'~')))
211 // Math (flow).
212 || (tokenizer.parse_state.options.constructs.math_flow && tokenizer.current == Some(b'$'))
213 {
214 tokenizer.tokenize_state.marker = tokenizer.current.unwrap();
215 tokenizer.tokenize_state.size_c = prefix;
216 if tokenizer.tokenize_state.marker == b'$' {
217 tokenizer.tokenize_state.token_1 = Name::MathFlow;
218 tokenizer.tokenize_state.token_2 = Name::MathFlowFence;
219 tokenizer.tokenize_state.token_3 = Name::MathFlowFenceSequence;
220 // Math (flow) does not support an `info` part: everything after the
221 // opening sequence is the `meta` part.
222 tokenizer.tokenize_state.token_5 = Name::MathFlowFenceMeta;
223 tokenizer.tokenize_state.token_6 = Name::MathFlowChunk;
224 } else {
225 tokenizer.tokenize_state.token_1 = Name::CodeFenced;
226 tokenizer.tokenize_state.token_2 = Name::CodeFencedFence;
227 tokenizer.tokenize_state.token_3 = Name::CodeFencedFenceSequence;
228 tokenizer.tokenize_state.token_4 = Name::CodeFencedFenceInfo;
229 tokenizer.tokenize_state.token_5 = Name::CodeFencedFenceMeta;
230 tokenizer.tokenize_state.token_6 = Name::CodeFlowChunk;
231 }
232
233 tokenizer.enter(tokenizer.tokenize_state.token_1.clone());
234 tokenizer.enter(tokenizer.tokenize_state.token_2.clone());
235 tokenizer.enter(tokenizer.tokenize_state.token_3.clone());
236 State::Retry(StateName::RawFlowSequenceOpen)
237 } else {
238 State::Nok
239 }
240}
241
242/// In opening fence sequence.
243///
244/// ```markdown
245/// > | ~~~js
246/// ^
247/// | console.log(1)
248/// | ~~~
249/// ```
250pub fn sequence_open(tokenizer: &mut Tokenizer) -> State {
251 if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
252 tokenizer.tokenize_state.size += 1;
253 tokenizer.consume();
254 State::Next(StateName::RawFlowSequenceOpen)
255 } else if tokenizer.tokenize_state.size
256 < (if tokenizer.tokenize_state.marker == b'$' {
257 MATH_FLOW_SEQUENCE_SIZE_MIN
258 } else {
259 CODE_FENCED_SEQUENCE_SIZE_MIN
260 })
261 {
262 tokenizer.tokenize_state.marker = 0;
263 tokenizer.tokenize_state.size_c = 0;
264 tokenizer.tokenize_state.size = 0;
265 tokenizer.tokenize_state.token_1 = Name::Data;
266 tokenizer.tokenize_state.token_2 = Name::Data;
267 tokenizer.tokenize_state.token_3 = Name::Data;
268 tokenizer.tokenize_state.token_4 = Name::Data;
269 tokenizer.tokenize_state.token_5 = Name::Data;
270 tokenizer.tokenize_state.token_6 = Name::Data;
271 State::Nok
272 } else {
273 // Math (flow) does not support an `info` part: everything after the
274 // opening sequence is the `meta` part.
275 let next = if tokenizer.tokenize_state.marker == b'$' {
276 StateName::RawFlowMetaBefore
277 } else {
278 StateName::RawFlowInfoBefore
279 };
280
281 if matches!(tokenizer.current, Some(b'\t' | b' ')) {
282 tokenizer.exit(tokenizer.tokenize_state.token_3.clone());
283 tokenizer.attempt(State::Next(next), State::Nok);
284 State::Retry(space_or_tab(tokenizer))
285 } else {
286 tokenizer.exit(tokenizer.tokenize_state.token_3.clone());
287 State::Retry(next)
288 }
289 }
290}
291
292/// In opening fence, after the sequence (and optional whitespace), before info.
293///
294/// ```markdown
295/// > | ~~~js
296/// ^
297/// | console.log(1)
298/// | ~~~
299/// ```
300pub fn info_before(tokenizer: &mut Tokenizer) -> State {
301 match tokenizer.current {
302 None | Some(b'\n') => {
303 tokenizer.exit(tokenizer.tokenize_state.token_2.clone());
304 // Do not form containers.
305 tokenizer.concrete = true;
306 tokenizer.check(
307 State::Next(StateName::RawFlowAtNonLazyBreak),
308 State::Next(StateName::RawFlowAfter),
309 );
310 State::Retry(StateName::NonLazyContinuationStart)
311 }
312 _ => {
313 tokenizer.enter(tokenizer.tokenize_state.token_4.clone());
314 tokenizer.enter_link(
315 Name::Data,
316 Link {
317 previous: None,
318 next: None,
319 content: Content::String,
320 },
321 );
322 State::Retry(StateName::RawFlowInfo)
323 }
324 }
325}
326
327/// In info.
328///
329/// ```markdown
330/// > | ~~~js
331/// ^
332/// | console.log(1)
333/// | ~~~
334/// ```
335pub fn info(tokenizer: &mut Tokenizer) -> State {
336 match tokenizer.current {
337 None | Some(b'\n') => {
338 tokenizer.exit(Name::Data);
339 tokenizer.exit(tokenizer.tokenize_state.token_4.clone());
340 State::Retry(StateName::RawFlowInfoBefore)
341 }
342 Some(b'\t' | b' ') => {
343 tokenizer.exit(Name::Data);
344 tokenizer.exit(tokenizer.tokenize_state.token_4.clone());
345 tokenizer.attempt(State::Next(StateName::RawFlowMetaBefore), State::Nok);
346 State::Retry(space_or_tab(tokenizer))
347 }
348 Some(byte) => {
349 // This looks like code (text) / math (text).
350 // Note: no reason to check for `~`, because 3 of them can‘t be
351 // used as strikethrough in text.
352 if tokenizer.tokenize_state.marker == byte && matches!(byte, b'$' | b'`') {
353 tokenizer.concrete = false;
354 tokenizer.tokenize_state.marker = 0;
355 tokenizer.tokenize_state.size_c = 0;
356 tokenizer.tokenize_state.size = 0;
357 tokenizer.tokenize_state.token_1 = Name::Data;
358 tokenizer.tokenize_state.token_2 = Name::Data;
359 tokenizer.tokenize_state.token_3 = Name::Data;
360 tokenizer.tokenize_state.token_4 = Name::Data;
361 tokenizer.tokenize_state.token_5 = Name::Data;
362 tokenizer.tokenize_state.token_6 = Name::Data;
363 State::Nok
364 } else {
365 tokenizer.consume();
366 State::Next(StateName::RawFlowInfo)
367 }
368 }
369 }
370}
371
372/// In opening fence, after info and whitespace, before meta.
373///
374/// ```markdown
375/// > | ~~~js eval
376/// ^
377/// | console.log(1)
378/// | ~~~
379/// ```
380pub fn meta_before(tokenizer: &mut Tokenizer) -> State {
381 match tokenizer.current {
382 None | Some(b'\n') => State::Retry(StateName::RawFlowInfoBefore),
383 _ => {
384 tokenizer.enter(tokenizer.tokenize_state.token_5.clone());
385 tokenizer.enter_link(
386 Name::Data,
387 Link {
388 previous: None,
389 next: None,
390 content: Content::String,
391 },
392 );
393 State::Retry(StateName::RawFlowMeta)
394 }
395 }
396}
397
398/// In meta.
399///
400/// ```markdown
401/// > | ~~~js eval
402/// ^
403/// | console.log(1)
404/// | ~~~
405/// ```
406pub fn meta(tokenizer: &mut Tokenizer) -> State {
407 match tokenizer.current {
408 None | Some(b'\n') => {
409 tokenizer.exit(Name::Data);
410 tokenizer.exit(tokenizer.tokenize_state.token_5.clone());
411 State::Retry(StateName::RawFlowInfoBefore)
412 }
413 Some(byte) => {
414 // This looks like code (text) / math (text).
415 // Note: no reason to check for `~`, because 3 of them can‘t be
416 // used as strikethrough in text.
417 if tokenizer.tokenize_state.marker == byte && matches!(byte, b'$' | b'`') {
418 tokenizer.concrete = false;
419 tokenizer.tokenize_state.marker = 0;
420 tokenizer.tokenize_state.size_c = 0;
421 tokenizer.tokenize_state.size = 0;
422 tokenizer.tokenize_state.token_1 = Name::Data;
423 tokenizer.tokenize_state.token_2 = Name::Data;
424 tokenizer.tokenize_state.token_3 = Name::Data;
425 tokenizer.tokenize_state.token_4 = Name::Data;
426 tokenizer.tokenize_state.token_5 = Name::Data;
427 tokenizer.tokenize_state.token_6 = Name::Data;
428 State::Nok
429 } else {
430 tokenizer.consume();
431 State::Next(StateName::RawFlowMeta)
432 }
433 }
434 }
435}
436
437/// At eol/eof in raw, before a non-lazy closing fence or content.
438///
439/// ```markdown
440/// > | ~~~js
441/// ^
442/// > | console.log(1)
443/// ^
444/// | ~~~
445/// ```
446pub fn at_non_lazy_break(tokenizer: &mut Tokenizer) -> State {
447 tokenizer.attempt(
448 State::Next(StateName::RawFlowAfter),
449 State::Next(StateName::RawFlowContentBefore),
450 );
451 tokenizer.enter(Name::LineEnding);
452 tokenizer.consume();
453 tokenizer.exit(Name::LineEnding);
454 State::Next(StateName::RawFlowCloseStart)
455}
456
457/// Before closing fence, at optional whitespace.
458///
459/// ```markdown
460/// | ~~~js
461/// | console.log(1)
462/// > | ~~~
463/// ^
464/// ```
465pub fn close_start(tokenizer: &mut Tokenizer) -> State {
466 tokenizer.enter(tokenizer.tokenize_state.token_2.clone());
467
468 if matches!(tokenizer.current, Some(b'\t' | b' ')) {
469 tokenizer.attempt(
470 State::Next(StateName::RawFlowBeforeSequenceClose),
471 State::Nok,
472 );
473
474 State::Retry(space_or_tab_min_max(
475 tokenizer,
476 0,
477 if tokenizer.parse_state.options.constructs.code_indented {
478 TAB_SIZE - 1
479 } else {
480 usize::MAX
481 },
482 ))
483 } else {
484 State::Retry(StateName::RawFlowBeforeSequenceClose)
485 }
486}
487
488/// In closing fence, after optional whitespace, at sequence.
489///
490/// ```markdown
491/// | ~~~js
492/// | console.log(1)
493/// > | ~~~
494/// ^
495/// ```
496pub fn before_sequence_close(tokenizer: &mut Tokenizer) -> State {
497 if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
498 tokenizer.enter(tokenizer.tokenize_state.token_3.clone());
499 State::Retry(StateName::RawFlowSequenceClose)
500 } else {
501 State::Nok
502 }
503}
504
505/// In closing fence sequence.
506///
507/// ```markdown
508/// | ~~~js
509/// | console.log(1)
510/// > | ~~~
511/// ^
512/// ```
513pub fn sequence_close(tokenizer: &mut Tokenizer) -> State {
514 if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
515 tokenizer.tokenize_state.size_b += 1;
516 tokenizer.consume();
517 State::Next(StateName::RawFlowSequenceClose)
518 } else if tokenizer.tokenize_state.size_b >= tokenizer.tokenize_state.size {
519 tokenizer.tokenize_state.size_b = 0;
520 tokenizer.exit(tokenizer.tokenize_state.token_3.clone());
521
522 if matches!(tokenizer.current, Some(b'\t' | b' ')) {
523 tokenizer.attempt(
524 State::Next(StateName::RawFlowAfterSequenceClose),
525 State::Nok,
526 );
527 State::Retry(space_or_tab(tokenizer))
528 } else {
529 State::Retry(StateName::RawFlowAfterSequenceClose)
530 }
531 } else {
532 tokenizer.tokenize_state.size_b = 0;
533 State::Nok
534 }
535}
536
537/// After closing fence sequence, after optional whitespace.
538///
539/// ```markdown
540/// | ~~~js
541/// | console.log(1)
542/// > | ~~~
543/// ^
544/// ```
545pub fn sequence_close_after(tokenizer: &mut Tokenizer) -> State {
546 match tokenizer.current {
547 None | Some(b'\n') => {
548 tokenizer.exit(tokenizer.tokenize_state.token_2.clone());
549 State::Ok
550 }
551 _ => State::Nok,
552 }
553}
554
555/// Before raw content, not a closing fence, at eol.
556///
557/// ```markdown
558/// | ~~~js
559/// > | console.log(1)
560/// ^
561/// | ~~~
562/// ```
563pub fn content_before(tokenizer: &mut Tokenizer) -> State {
564 tokenizer.enter(Name::LineEnding);
565 tokenizer.consume();
566 tokenizer.exit(Name::LineEnding);
567 State::Next(StateName::RawFlowContentStart)
568}
569
570/// Before raw content, not a closing fence.
571///
572/// ```markdown
573/// | ~~~js
574/// > | console.log(1)
575/// ^
576/// | ~~~
577/// ```
578pub fn content_start(tokenizer: &mut Tokenizer) -> State {
579 if matches!(tokenizer.current, Some(b'\t' | b' ')) {
580 tokenizer.attempt(
581 State::Next(StateName::RawFlowBeforeContentChunk),
582 State::Nok,
583 );
584 State::Retry(space_or_tab_min_max(
585 tokenizer,
586 0,
587 tokenizer.tokenize_state.size_c,
588 ))
589 } else {
590 State::Retry(StateName::RawFlowBeforeContentChunk)
591 }
592}
593
594/// Before raw content, after optional prefix.
595///
596/// ```markdown
597/// | ~~~js
598/// > | console.log(1)
599/// ^
600/// | ~~~
601/// ```
602pub fn before_content_chunk(tokenizer: &mut Tokenizer) -> State {
603 match tokenizer.current {
604 None | Some(b'\n') => {
605 tokenizer.check(
606 State::Next(StateName::RawFlowAtNonLazyBreak),
607 State::Next(StateName::RawFlowAfter),
608 );
609 State::Retry(StateName::NonLazyContinuationStart)
610 }
611 _ => {
612 tokenizer.enter(tokenizer.tokenize_state.token_6.clone());
613 State::Retry(StateName::RawFlowContentChunk)
614 }
615 }
616}
617
618/// In raw content.
619///
620/// ```markdown
621/// | ~~~js
622/// > | console.log(1)
623/// ^^^^^^^^^^^^^^
624/// | ~~~
625/// ```
626pub fn content_chunk(tokenizer: &mut Tokenizer) -> State {
627 match tokenizer.current {
628 None | Some(b'\n') => {
629 tokenizer.exit(tokenizer.tokenize_state.token_6.clone());
630 State::Retry(StateName::RawFlowBeforeContentChunk)
631 }
632 _ => {
633 tokenizer.consume();
634 State::Next(StateName::RawFlowContentChunk)
635 }
636 }
637}
638
639/// After raw.
640///
641/// ```markdown
642/// | ~~~js
643/// | console.log(1)
644/// > | ~~~
645/// ^
646/// ```
647pub fn after(tokenizer: &mut Tokenizer) -> State {
648 tokenizer.exit(tokenizer.tokenize_state.token_1.clone());
649 tokenizer.tokenize_state.marker = 0;
650 tokenizer.tokenize_state.size_c = 0;
651 tokenizer.tokenize_state.size = 0;
652 tokenizer.tokenize_state.token_1 = Name::Data;
653 tokenizer.tokenize_state.token_2 = Name::Data;
654 tokenizer.tokenize_state.token_3 = Name::Data;
655 tokenizer.tokenize_state.token_4 = Name::Data;
656 tokenizer.tokenize_state.token_5 = Name::Data;
657 tokenizer.tokenize_state.token_6 = Name::Data;
658 // Feel free to interrupt.
659 tokenizer.interrupt = false;
660 // No longer concrete.
661 tokenizer.concrete = false;
662 State::Ok
663}