Markdown parser fork with extended syntax for personal use.
at hack 358 lines 12 kB view raw
1//! Autolink occurs in the [text][] content type. 2//! 3//! ## Grammar 4//! 5//! Autolink forms with the following BNF 6//! (<small>see [construct][crate::construct] for character groups</small>): 7//! 8//! ```bnf 9//! autolink ::= '<' (url | email) '>' 10//! 11//! url ::= protocol *url_byte 12//! protocol ::= ascii_alphabetic 0*31(protocol_byte) ':' 13//! protocol_byte ::= '+' '-' '.' ascii_alphanumeric 14//! url_byte ::= byte - ascii_control - ' ' 15//! 16//! email ::= 1*ascii_atext '@' email_domain *('.' email_domain) 17//! ; Restriction: up to (including) 63 character are allowed in each domain. 18//! email_domain ::= ascii_alphanumeric *(ascii_alphanumeric | '-' ascii_alphanumeric) 19//! 20//! ascii_atext ::= ascii_alphanumeric | '!' | '"' | '#' | '$' | '%' | '&' | '\'' | '*' | '+' | '-' | '/' | '=' | '?' | '^' | '_' | '`' | '{' | '|' | '}' | '~' 21//! ``` 22//! 23//! The maximum allowed size of a scheme is `31` (inclusive), which is defined 24//! in [`AUTOLINK_SCHEME_SIZE_MAX`][]. 25//! The maximum allowed size of a domain is `63` (inclusive), which is defined 26//! in [`AUTOLINK_DOMAIN_SIZE_MAX`][]. 27//! 28//! The grammar for autolinks is quite strict and prohibits the use of ASCII control 29//! characters or spaces. 30//! To use non-ascii characters and otherwise impossible characters in URLs, 31//! you can use percent encoding: 32//! 33//! ```markdown 34//! <https://example.com/alpha%20bravo> 35//! ``` 36//! 37//! Yields: 38//! 39//! ```html 40//! <p><a href="https://example.com/alpha%20bravo">https://example.com/alpha%20bravo</a></p> 41//! ``` 42//! 43//! There are several cases where incorrect encoding of URLs would, in other 44//! languages, result in a parse error. 45//! In markdown, there are no errors, and URLs are normalized. 46//! In addition, many characters are percent encoded 47//! ([`sanitize_uri`][sanitize_uri]). 48//! For example: 49//! 50//! ```markdown 51//! <https://a👍b%> 52//! ``` 53//! 54//! Yields: 55//! 56//! ```html 57//! <p><a href="https://a%F0%9F%91%8Db%25">https://a👍b%</a></p> 58//! ``` 59//! 60//! Interestingly, there are a couple of things that are valid autolinks in 61//! markdown but in HTML would be valid tags, such as `<svg:rect>` and 62//! `<xml:lang/>`. 63//! However, because `CommonMark` employs a naïve HTML parsing algorithm, those 64//! are not considered HTML. 65//! 66//! While `CommonMark` restricts links from occurring in other links in the 67//! case of labels (see [label end][label_end]), this restriction is not in 68//! place for autolinks inside labels: 69//! 70//! ```markdown 71//! [<https://example.com>](#) 72//! ``` 73//! 74//! Yields: 75//! 76//! ```html 77//! <p><a href="#"><a href="https://example.com">https://example.com</a></a></p> 78//! ``` 79//! 80//! The generated output, in this case, is invalid according to HTML. 81//! When a browser sees that markup, it will instead parse it as: 82//! 83//! ```html 84//! <p><a href="#"></a><a href="https://example.com">https://example.com</a></p> 85//! ``` 86//! 87//! ## HTML 88//! 89//! Autolinks relate to the `<a>` element in HTML. 90//! See [*§ 4.5.1 The `a` element*][html_a] in the HTML spec for more info. 91//! When an email autolink is used (so, without a protocol), the string 92//! `mailto:` is prepended before the email, when generating the `href` 93//! attribute of the hyperlink. 94//! 95//! ## Recommendation 96//! 97//! It is recommended to use labels ([label start link][label_start_link], 98//! [label end][label_end]), either with a resource or a definition 99//! ([definition][]), instead of autolinks, as those allow more characters in 100//! URLs, and allow relative URLs and `www.` URLs. 101//! They also allow for descriptive text to explain the URL in prose. 102//! 103//! ## Tokens 104//! 105//! * [`Autolink`][Name::Autolink] 106//! * [`AutolinkEmail`][Name::AutolinkEmail] 107//! * [`AutolinkMarker`][Name::AutolinkMarker] 108//! * [`AutolinkProtocol`][Name::AutolinkProtocol] 109//! 110//! ## References 111//! 112//! * [`autolink.js` in `micromark`](https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/autolink.js) 113//! * [*§ 6.4 Autolinks* in `CommonMark`](https://spec.commonmark.org/0.31/#autolinks) 114//! 115//! [text]: crate::construct::text 116//! [definition]: crate::construct::definition 117//! [label_start_link]: crate::construct::label_start_link 118//! [label_end]: crate::construct::label_end 119//! [autolink_scheme_size_max]: crate::util::constant::AUTOLINK_SCHEME_SIZE_MAX 120//! [autolink_domain_size_max]: crate::util::constant::AUTOLINK_DOMAIN_SIZE_MAX 121//! [sanitize_uri]: crate::util::sanitize_uri 122//! [html_a]: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-a-element 123 124use crate::event::Name; 125use crate::state::{Name as StateName, State}; 126use crate::tokenizer::Tokenizer; 127use crate::util::constant::{AUTOLINK_DOMAIN_SIZE_MAX, AUTOLINK_SCHEME_SIZE_MAX}; 128 129/// Start of an autolink. 130/// 131/// ```markdown 132/// > | a<https://example.com>b 133/// ^ 134/// > | a<user@example.com>b 135/// ^ 136/// ``` 137pub fn start(tokenizer: &mut Tokenizer) -> State { 138 if tokenizer.parse_state.options.constructs.autolink && tokenizer.current == Some(b'<') { 139 tokenizer.enter(Name::Autolink); 140 tokenizer.enter(Name::AutolinkMarker); 141 tokenizer.consume(); 142 tokenizer.exit(Name::AutolinkMarker); 143 tokenizer.enter(Name::AutolinkProtocol); 144 State::Next(StateName::AutolinkOpen) 145 } else { 146 State::Nok 147 } 148} 149 150/// After `<`, at protocol or atext. 151/// 152/// ```markdown 153/// > | a<https://example.com>b 154/// ^ 155/// > | a<user@example.com>b 156/// ^ 157/// ``` 158pub fn open(tokenizer: &mut Tokenizer) -> State { 159 match tokenizer.current { 160 // ASCII alphabetic. 161 Some(b'A'..=b'Z' | b'a'..=b'z') => { 162 tokenizer.consume(); 163 State::Next(StateName::AutolinkSchemeOrEmailAtext) 164 } 165 Some(b'@') => State::Nok, 166 _ => State::Retry(StateName::AutolinkEmailAtext), 167 } 168} 169 170/// At second byte of protocol or atext. 171/// 172/// ```markdown 173/// > | a<https://example.com>b 174/// ^ 175/// > | a<user@example.com>b 176/// ^ 177/// ``` 178pub fn scheme_or_email_atext(tokenizer: &mut Tokenizer) -> State { 179 match tokenizer.current { 180 // ASCII alphanumeric and `+`, `-`, and `.`. 181 Some(b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') => { 182 // Count the previous alphabetical from `open` too. 183 tokenizer.tokenize_state.size = 1; 184 State::Retry(StateName::AutolinkSchemeInsideOrEmailAtext) 185 } 186 _ => State::Retry(StateName::AutolinkEmailAtext), 187 } 188} 189 190/// In ambiguous protocol or atext. 191/// 192/// ```markdown 193/// > | a<https://example.com>b 194/// ^ 195/// > | a<user@example.com>b 196/// ^ 197/// ``` 198pub fn scheme_inside_or_email_atext(tokenizer: &mut Tokenizer) -> State { 199 match tokenizer.current { 200 Some(b':') => { 201 tokenizer.consume(); 202 tokenizer.tokenize_state.size = 0; 203 State::Next(StateName::AutolinkUrlInside) 204 } 205 // ASCII alphanumeric and `+`, `-`, and `.`. 206 Some(b'+' | b'-' | b'.' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') 207 if tokenizer.tokenize_state.size < AUTOLINK_SCHEME_SIZE_MAX => 208 { 209 tokenizer.consume(); 210 tokenizer.tokenize_state.size += 1; 211 State::Next(StateName::AutolinkSchemeInsideOrEmailAtext) 212 } 213 _ => { 214 tokenizer.tokenize_state.size = 0; 215 State::Retry(StateName::AutolinkEmailAtext) 216 } 217 } 218} 219 220/// After protocol, in URL. 221/// 222/// ```markdown 223/// > | a<https://example.com>b 224/// ^ 225/// ``` 226pub fn url_inside(tokenizer: &mut Tokenizer) -> State { 227 match tokenizer.current { 228 Some(b'>') => { 229 tokenizer.exit(Name::AutolinkProtocol); 230 tokenizer.enter(Name::AutolinkMarker); 231 tokenizer.consume(); 232 tokenizer.exit(Name::AutolinkMarker); 233 tokenizer.exit(Name::Autolink); 234 State::Ok 235 } 236 // ASCII control, space, or `<`. 237 None | Some(b'\0'..=0x1F | b' ' | b'<' | 0x7F) => State::Nok, 238 Some(_) => { 239 tokenizer.consume(); 240 State::Next(StateName::AutolinkUrlInside) 241 } 242 } 243} 244 245/// In email atext. 246/// 247/// ```markdown 248/// > | a<user.name@example.com>b 249/// ^ 250/// ``` 251pub fn email_atext(tokenizer: &mut Tokenizer) -> State { 252 match tokenizer.current { 253 Some(b'@') => { 254 tokenizer.consume(); 255 State::Next(StateName::AutolinkEmailAtSignOrDot) 256 } 257 // ASCII atext. 258 // 259 // atext is an ASCII alphanumeric (see [`is_ascii_alphanumeric`][]), or 260 // a byte in the inclusive ranges U+0023 NUMBER SIGN (`#`) to U+0027 261 // APOSTROPHE (`'`), U+002A ASTERISK (`*`), U+002B PLUS SIGN (`+`), 262 // U+002D DASH (`-`), U+002F SLASH (`/`), U+003D EQUALS TO (`=`), 263 // U+003F QUESTION MARK (`?`), U+005E CARET (`^`) to U+0060 GRAVE 264 // ACCENT (`` ` ``), or U+007B LEFT CURLY BRACE (`{`) to U+007E TILDE 265 // (`~`). 266 // 267 // See: 268 // **\[RFC5322]**: 269 // [Internet Message Format](https://tools.ietf.org/html/rfc5322). 270 // P. Resnick. 271 // IETF. 272 // 273 // [`is_ascii_alphanumeric`]: char::is_ascii_alphanumeric 274 Some( 275 b'#'..=b'\'' | b'*' | b'+' | b'-'..=b'9' | b'=' | b'?' | b'A'..=b'Z' | b'^'..=b'~', 276 ) => { 277 tokenizer.consume(); 278 State::Next(StateName::AutolinkEmailAtext) 279 } 280 _ => State::Nok, 281 } 282} 283 284/// In label, after at-sign or dot. 285/// 286/// ```markdown 287/// > | a<user.name@example.com>b 288/// ^ ^ 289/// ``` 290pub fn email_at_sign_or_dot(tokenizer: &mut Tokenizer) -> State { 291 match tokenizer.current { 292 // ASCII alphanumeric. 293 Some(b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') => { 294 State::Retry(StateName::AutolinkEmailValue) 295 } 296 _ => State::Nok, 297 } 298} 299 300/// In label, where `.` and `>` are allowed. 301/// 302/// ```markdown 303/// > | a<user.name@example.com>b 304/// ^ 305/// ``` 306pub fn email_label(tokenizer: &mut Tokenizer) -> State { 307 match tokenizer.current { 308 Some(b'.') => { 309 tokenizer.consume(); 310 tokenizer.tokenize_state.size = 0; 311 State::Next(StateName::AutolinkEmailAtSignOrDot) 312 } 313 Some(b'>') => { 314 let index = tokenizer.events.len(); 315 tokenizer.exit(Name::AutolinkProtocol); 316 // Change the event name. 317 tokenizer.events[index - 1].name = Name::AutolinkEmail; 318 tokenizer.events[index].name = Name::AutolinkEmail; 319 tokenizer.enter(Name::AutolinkMarker); 320 tokenizer.consume(); 321 tokenizer.exit(Name::AutolinkMarker); 322 tokenizer.exit(Name::Autolink); 323 tokenizer.tokenize_state.size = 0; 324 State::Ok 325 } 326 _ => State::Retry(StateName::AutolinkEmailValue), 327 } 328} 329 330/// In label, where `.` and `>` are *not* allowed. 331/// 332/// Though, this is also used in `email_label` to parse other values. 333/// 334/// ```markdown 335/// > | a<user.name@ex-ample.com>b 336/// ^ 337/// ``` 338pub fn email_value(tokenizer: &mut Tokenizer) -> State { 339 match tokenizer.current { 340 // ASCII alphanumeric or `-`. 341 Some(b'-' | b'0'..=b'9' | b'A'..=b'Z' | b'a'..=b'z') 342 if tokenizer.tokenize_state.size < AUTOLINK_DOMAIN_SIZE_MAX => 343 { 344 let name = if matches!(tokenizer.current, Some(b'-')) { 345 StateName::AutolinkEmailValue 346 } else { 347 StateName::AutolinkEmailLabel 348 }; 349 tokenizer.tokenize_state.size += 1; 350 tokenizer.consume(); 351 State::Next(name) 352 } 353 _ => { 354 tokenizer.tokenize_state.size = 0; 355 State::Nok 356 } 357 } 358}