src/construct/raw_text.rs at hack · crashkeys.dev/markdown-rs

crashkeys.dev / markdown-rs
fork atom
Markdown parser fork with extended syntax for personal use.
fork atom
markdown-rs / src / construct / raw_text.rs
at hack 271 lines 9.5 kB view raw
wrap content
Titus Wormer Refactor docs 11mo ago
e0ca3f6c
  1//! Raw (text) occurs in the [text][] content type.
  2//! It forms code (text) and math (text).
  3//!
  4//! ## Grammar
  5//!
  6//! Raw (text) forms with the following BNF
  7//! (<small>see [construct][crate::construct] for character groups</small>):
  8//!
  9//! ```bnf
 10//! ; Restriction: the number of markers in the closing sequence must be equal
 11//! ; to the number of markers in the opening sequence.
 12//! raw_text ::= sequence 1*byte sequence
 13//!
 14//! ; Restriction: not preceded or followed by the same marker.
 15//! sequence ::= 1*'`' | 1*'$'
 16//! ```
 17//!
 18//! The above grammar shows that it is not possible to create empty raw (text).
 19//! It is possible to include the sequence marker (grave accent for code,
 20//! dollar for math) in raw (text), by wrapping it in bigger or smaller
 21//! sequences:
 22//!
 23//! ```markdown
 24//! Include more: `a``b` or include less: ``a`b``.
 25//! ```
 26//!
 27//! It is also possible to include just one marker:
 28//!
 29//! ```markdown
 30//! Include just one: `` ` ``.
 31//! ```
 32//!
 33//! Sequences are “gready”, in that they cannot be preceded or followed by
 34//! more markers.
 35//! To illustrate:
 36//!
 37//! ```markdown
 38//! Not code: ``x`.
 39//!
 40//! Not code: `x``.
 41//!
 42//! Escapes work, this is code: \``x`.
 43//!
 44//! Escapes work, this is code: `x`\`.
 45//! ```
 46//!
 47//! Yields:
 48//!
 49//! ```html
 50//! <p>Not code: ``x`.</p>
 51//! <p>Not code: `x``.</p>
 52//! <p>Escapes work, this is code: `<code>x</code>.</p>
 53//! <p>Escapes work, this is code: <code>x</code>`.</p>
 54//! ```
 55//!
 56//! That is because, when turning markdown into HTML, the first and last space,
 57//! if both exist and there is also a non-space in the code, are removed.
 58//! Line endings, at that stage, are considered as spaces.
 59//!
 60//! In markdown, it is possible to create code or math with the
 61//! [raw (flow)][raw_flow] (or [code (indented)][code_indented]) constructs
 62//! in the [flow][] content type.
 63//!
 64//! ## HTML
 65//!
 66//! Code (text) relates to the `<code>` element in HTML.
 67//! See [*§ 4.5.15 The `code` element*][html_code] in the HTML spec for more
 68//! info.
 69//!
 70//! Math (text) does not relate to HTML elements.
 71//! `MathML`, which is sort of like SVG but for math, exists but it doesn’t work
 72//! well and isn’t widely supported.
 73//! Instead, it is recommended to use client side JavaScript with something like
 74//! `KaTeX` or `MathJax` to process the math
 75//! For that, the math is compiled as a `<code>` element with two classes:
 76//! `language-math` and `math-inline`.
 77//! Client side JavaScript can look for these classes to process them further.
 78//!
 79//! When turning markdown into HTML, each line ending in raw (text) is turned
 80//! into a space.
 81//!
 82//! ## Recommendations
 83//!
 84//! When authoring markdown with math, keep in mind that math doesn’t work in
 85//! most places.
 86//! Notably, GitHub currently has a really weird crappy client-side regex-based
 87//! thing.
 88//! But on your own (math-heavy?) site it can be great!
 89//! You can set [`parse_options.math_text_single_dollar: false`][parse_options]
 90//! to improve this, as it prevents single dollars from being seen as math, and
 91//! thus prevents normal dollars in text from being seen as math.
 92//!
 93//! ## Tokens
 94//!
 95//! * [`CodeText`][Name::CodeText]
 96//! * [`CodeTextData`][Name::CodeTextData]
 97//! * [`CodeTextSequence`][Name::CodeTextSequence]
 98//! * [`MathText`][Name::MathText]
 99//! * [`MathTextData`][Name::MathTextData]
100//! * [`MathTextSequence`][Name::MathTextSequence]
101//! * [`LineEnding`][Name::LineEnding]
102//!
103//! ## References
104//!
105//! * [`code-text.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/code-text.js)
106//! * [`micromark-extension-math`](https://github.com/micromark/micromark-extension-math)
107//! * [*§ 6.1 Code spans* in `CommonMark`](https://spec.commonmark.org/0.31/#code-spans)
108//!
109//! > 👉 **Note**: math is not specified anywhere.
110//!
111//! [flow]: crate::construct::flow
112//! [text]: crate::construct::text
113//! [code_indented]: crate::construct::code_indented
114//! [raw_flow]: crate::construct::raw_flow
115//! [html_code]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-code-element
116//! [parse_options]: crate::ParseOptions
117
118use crate::event::Name;
119use crate::state::{Name as StateName, State};
120use crate::tokenizer::Tokenizer;
121
122/// Start of raw (text).
123///
124/// ```markdown
125/// > | `a`
126///     ^
127/// > | \`a`
128///      ^
129/// ```
130pub fn start(tokenizer: &mut Tokenizer) -> State {
131    // Code (text):
132    if ((tokenizer.parse_state.options.constructs.code_text && tokenizer.current == Some(b'`'))
133        // Math (text):
134        || (tokenizer.parse_state.options.constructs.math_text && tokenizer.current == Some(b'$')))
135        // Not the same marker (except when escaped).
136        && (tokenizer.previous != tokenizer.current
137            || (!tokenizer.events.is_empty()
138                && tokenizer.events[tokenizer.events.len() - 1].name == Name::CharacterEscape))
139    {
140        let marker = tokenizer.current.unwrap();
141        if marker == b'`' {
142            tokenizer.tokenize_state.token_1 = Name::CodeText;
143            tokenizer.tokenize_state.token_2 = Name::CodeTextSequence;
144            tokenizer.tokenize_state.token_3 = Name::CodeTextData;
145        } else {
146            tokenizer.tokenize_state.token_1 = Name::MathText;
147            tokenizer.tokenize_state.token_2 = Name::MathTextSequence;
148            tokenizer.tokenize_state.token_3 = Name::MathTextData;
149        }
150        tokenizer.tokenize_state.marker = marker;
151        tokenizer.enter(tokenizer.tokenize_state.token_1.clone());
152        tokenizer.enter(tokenizer.tokenize_state.token_2.clone());
153        State::Retry(StateName::RawTextSequenceOpen)
154    } else {
155        State::Nok
156    }
157}
158
159/// In opening sequence.
160///
161/// ```markdown
162/// > | `a`
163///     ^
164/// ```
165pub fn sequence_open(tokenizer: &mut Tokenizer) -> State {
166    if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
167        tokenizer.tokenize_state.size += 1;
168        tokenizer.consume();
169        State::Next(StateName::RawTextSequenceOpen)
170    }
171    // Not enough markers in the sequence.
172    else if tokenizer.tokenize_state.marker == b'$'
173        && tokenizer.tokenize_state.size == 1
174        && !tokenizer.parse_state.options.math_text_single_dollar
175    {
176        tokenizer.tokenize_state.marker = 0;
177        tokenizer.tokenize_state.size = 0;
178        tokenizer.tokenize_state.token_1 = Name::Data;
179        tokenizer.tokenize_state.token_2 = Name::Data;
180        tokenizer.tokenize_state.token_3 = Name::Data;
181        State::Nok
182    } else {
183        tokenizer.exit(tokenizer.tokenize_state.token_2.clone());
184        State::Retry(StateName::RawTextBetween)
185    }
186}
187
188/// Between something and something else.
189///
190/// ```markdown
191/// > | `a`
192///      ^^
193/// ```
194pub fn between(tokenizer: &mut Tokenizer) -> State {
195    match tokenizer.current {
196        None => {
197            tokenizer.tokenize_state.marker = 0;
198            tokenizer.tokenize_state.size = 0;
199            tokenizer.tokenize_state.token_1 = Name::Data;
200            tokenizer.tokenize_state.token_2 = Name::Data;
201            tokenizer.tokenize_state.token_3 = Name::Data;
202            State::Nok
203        }
204        Some(b'\n') => {
205            tokenizer.enter(Name::LineEnding);
206            tokenizer.consume();
207            tokenizer.exit(Name::LineEnding);
208            State::Next(StateName::RawTextBetween)
209        }
210        _ => {
211            if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
212                tokenizer.enter(tokenizer.tokenize_state.token_2.clone());
213                State::Retry(StateName::RawTextSequenceClose)
214            } else {
215                tokenizer.enter(tokenizer.tokenize_state.token_3.clone());
216                State::Retry(StateName::RawTextData)
217            }
218        }
219    }
220}
221
222/// In data.
223///
224/// ```markdown
225/// > | `a`
226///      ^
227/// ```
228pub fn data(tokenizer: &mut Tokenizer) -> State {
229    if matches!(tokenizer.current, None | Some(b'\n'))
230        || tokenizer.current == Some(tokenizer.tokenize_state.marker)
231    {
232        tokenizer.exit(tokenizer.tokenize_state.token_3.clone());
233        State::Retry(StateName::RawTextBetween)
234    } else {
235        tokenizer.consume();
236        State::Next(StateName::RawTextData)
237    }
238}
239
240/// In closing sequence.
241///
242/// ```markdown
243/// > | `a`
244///       ^
245/// ```
246pub fn sequence_close(tokenizer: &mut Tokenizer) -> State {
247    if tokenizer.current == Some(tokenizer.tokenize_state.marker) {
248        tokenizer.tokenize_state.size_b += 1;
249        tokenizer.consume();
250        State::Next(StateName::RawTextSequenceClose)
251    } else {
252        tokenizer.exit(tokenizer.tokenize_state.token_2.clone());
253        if tokenizer.tokenize_state.size == tokenizer.tokenize_state.size_b {
254            tokenizer.exit(tokenizer.tokenize_state.token_1.clone());
255            tokenizer.tokenize_state.marker = 0;
256            tokenizer.tokenize_state.size = 0;
257            tokenizer.tokenize_state.size_b = 0;
258            tokenizer.tokenize_state.token_1 = Name::Data;
259            tokenizer.tokenize_state.token_2 = Name::Data;
260            tokenizer.tokenize_state.token_3 = Name::Data;
261            State::Ok
262        } else {
263            // More or less accents: mark as data.
264            let len = tokenizer.events.len();
265            tokenizer.events[len - 2].name = tokenizer.tokenize_state.token_3.clone();
266            tokenizer.events[len - 1].name = tokenizer.tokenize_state.token_3.clone();
267            tokenizer.tokenize_state.size_b = 0;
268            State::Retry(StateName::RawTextBetween)
269        }
270    }
271}