OCaml HTML5 parser/serialiser based on Python's JustHTML

ocamldoc-to-spec

Changed files
+1627 -287
lib
+530 -87
lib/dom/node.mli
··· 5 5 6 6 (** HTML5 DOM Node Types and Operations 7 7 8 - This module provides the DOM node representation used by the HTML5 parser. 9 - Nodes form a tree structure representing HTML documents. The type follows 10 - the WHATWG HTML5 specification for document structure. 8 + This module provides the DOM (Document Object Model) node representation 9 + used by the HTML5 parser. The DOM is a programming interface that 10 + represents an HTML document as a tree of nodes, where each node represents 11 + part of the document (an element, text content, comment, etc.). 12 + 13 + {2 What is the DOM?} 14 + 15 + When an HTML parser processes markup like [<p>Hello <b>world</b></p>], it 16 + doesn't store the text directly. Instead, it builds a tree structure in 17 + memory: 18 + 19 + {v 20 + Document 21 + └── html 22 + └── body 23 + └── p 24 + ├── #text "Hello " 25 + └── b 26 + └── #text "world" 27 + v} 28 + 29 + This tree is the DOM. Each box in the tree is a {i node}. Programs can 30 + traverse and modify this tree to read or change the document. 31 + 32 + @see <https://html.spec.whatwg.org/multipage/dom.html> 33 + WHATWG: The elements of HTML (DOM chapter) 11 34 12 35 {2 Node Types} 13 36 14 37 The HTML5 DOM includes several node types, all represented by the same 15 38 record type with different field usage: 16 39 17 - - {b Element nodes}: Regular HTML elements like [<div>], [<p>], [<span>] 18 - - {b Text nodes}: Text content within elements 19 - - {b Comment nodes}: HTML comments [<!-- comment -->] 20 - - {b Document nodes}: The root node representing the entire document 21 - - {b Document fragment nodes}: A lightweight container (used for templates) 22 - - {b Doctype nodes}: The [<!DOCTYPE html>] declaration 40 + - {b Element nodes}: HTML elements like [<div>], [<p>], [<a href="...">]. 41 + Elements are the building blocks of HTML documents. They can have 42 + attributes and contain other nodes. 43 + 44 + - {b Text nodes}: The actual text content within elements. For example, 45 + in [<p>Hello</p>], "Hello" is a text node that is a child of the [<p>] 46 + element. 47 + 48 + - {b Comment nodes}: HTML comments written as [<!-- comment text -->]. 49 + Comments are preserved in the DOM but not rendered. 50 + 51 + - {b Document nodes}: The root of the entire document tree. Every HTML 52 + document has exactly one Document node at the top. 53 + 54 + - {b Document fragment nodes}: Lightweight containers that hold a 55 + collection of nodes without a parent. Used for efficient batch DOM 56 + operations and [<template>] element contents. 57 + 58 + - {b Doctype nodes}: The [<!DOCTYPE html>] declaration at the start of 59 + HTML5 documents. This declaration tells browsers to render the page 60 + in standards mode. 61 + 62 + @see <https://html.spec.whatwg.org/multipage/dom.html#kinds-of-content> 63 + WHATWG: Kinds of content 23 64 24 65 {2 Namespaces} 25 66 26 - Elements can belong to different namespaces: 27 - - [None] or [Some "html"]: HTML namespace (default) 28 - - [Some "svg"]: SVG namespace for embedded SVG content 29 - - [Some "mathml"]: MathML namespace for mathematical notation 67 + HTML5 can embed content from other XML vocabularies. Elements belong to 68 + one of three {i namespaces}: 69 + 70 + - {b HTML namespace} ([None] or implicit): Standard HTML elements like 71 + [<div>], [<p>], [<table>]. This is the default for all elements. 72 + 73 + - {b SVG namespace} ([Some "svg"]): Scalable Vector Graphics for drawing. 74 + When the parser encounters an [<svg>] tag, all elements inside it 75 + (like [<rect>], [<circle>], [<path>]) are placed in the SVG namespace. 76 + 77 + - {b MathML namespace} ([Some "mathml"]): Mathematical Markup Language 78 + for equations. When the parser encounters a [<math>] tag, elements 79 + inside it are placed in the MathML namespace. 80 + 81 + The parser automatically switches namespaces when entering and leaving 82 + these foreign content islands. 30 83 31 - The parser automatically switches namespaces when encountering [<svg>] 32 - or [<math>] elements, as specified by the HTML5 algorithm. 84 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign> 85 + WHATWG: Parsing foreign content 33 86 34 87 {2 Tree Structure} 35 88 36 89 Nodes form a bidirectional tree: each node has a list of children and 37 - an optional parent reference. Modification functions maintain these 38 - references automatically. 90 + an optional parent reference. Modification functions in this module 91 + maintain these references automatically. 92 + 93 + The tree is always well-formed: a node can only have one parent, and 94 + circular references are not possible. 39 95 *) 40 96 41 97 (** {1 Types} *) 42 98 43 99 (** Information associated with a DOCTYPE node. 44 100 45 - In HTML5, the DOCTYPE is primarily used for quirks mode detection. 46 - Most modern HTML5 documents use [<!DOCTYPE html>] which results in 47 - all fields being [None] or the name being [Some "html"]. 101 + The {i document type declaration} (DOCTYPE) tells browsers what version 102 + of HTML the document uses. In HTML5, the standard declaration is simply: 103 + 104 + {v <!DOCTYPE html> v} 105 + 106 + This minimal DOCTYPE triggers {i standards mode} (no quirks). The DOCTYPE 107 + can optionally include a public identifier and system identifier for 108 + legacy compatibility with SGML-based tools, but these are rarely used 109 + in modern HTML5 documents. 110 + 111 + {b Historical context:} In HTML4 and XHTML, DOCTYPEs were verbose and 112 + referenced DTD files. For example: 113 + {v <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 114 + "http://www.w3.org/TR/html4/strict.dtd"> v} 115 + 116 + HTML5 simplified this to just [<!DOCTYPE html>] because: 117 + - Browsers never actually fetched or validated against DTDs 118 + - The DOCTYPE's only real purpose is triggering standards mode 119 + - A minimal DOCTYPE achieves this goal 120 + 121 + {b Field meanings:} 122 + - [name]: The document type name, almost always ["html"] for HTML documents 123 + - [public_id]: A public identifier (legacy); [None] for HTML5 124 + - [system_id]: A system identifier/URL (legacy); [None] for HTML5 48 125 126 + @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype> 127 + WHATWG: The DOCTYPE 49 128 @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode> 50 - The WHATWG specification for DOCTYPE handling 129 + WHATWG: DOCTYPE handling during parsing 51 130 *) 52 131 type doctype_data = { 53 132 name : string option; (** The DOCTYPE name, e.g., "html" *) ··· 57 136 58 137 (** Quirks mode setting for the document. 59 138 60 - Quirks mode affects CSS layout behavior for backwards compatibility with 61 - old web content. The HTML5 parser determines quirks mode based on the 62 - DOCTYPE declaration. 139 + {i Quirks mode} is a browser rendering mode that emulates bugs and 140 + non-standard behaviors from older browsers (primarily Internet Explorer 5). 141 + Modern HTML5 documents should always render in {i standards mode} 142 + (no quirks) for consistent, predictable behavior. 143 + 144 + The HTML5 parser determines quirks mode based on the DOCTYPE declaration: 145 + 146 + - {b No_quirks} (Standards mode): The document renders according to modern 147 + HTML5 and CSS specifications. This is triggered by [<!DOCTYPE html>]. 148 + CSS box model, table layout, and other features work as specified. 63 149 64 - - [No_quirks]: Standards mode - full HTML5/CSS3 behavior 65 - - [Quirks]: Full quirks mode - emulates legacy browser behavior 66 - - [Limited_quirks]: Almost standards mode - limited quirks for specific cases 150 + - {b Quirks} (Full quirks mode): The document renders with legacy browser 151 + bugs emulated. This happens when: 152 + {ul 153 + {- DOCTYPE is missing entirely} 154 + {- DOCTYPE has certain legacy public identifiers} 155 + {- DOCTYPE has the wrong format}} 156 + 157 + In quirks mode, many CSS properties behave differently: 158 + {ul 159 + {- Tables don't inherit font properties} 160 + {- Box model uses non-standard width calculations} 161 + {- Certain CSS selectors don't work correctly}} 162 + 163 + - {b Limited_quirks} (Almost standards mode): A middle ground that applies 164 + only a few specific quirks, primarily affecting table cell vertical 165 + sizing. Triggered by XHTML DOCTYPEs and certain HTML4 DOCTYPEs. 166 + 167 + {b Recommendation:} Always use [<!DOCTYPE html>] at the start of HTML5 168 + documents to ensure {b No_quirks} mode. 67 169 68 - @see <https://quirks.spec.whatwg.org/> The Quirks Mode specification 170 + @see <https://quirks.spec.whatwg.org/> 171 + Quirks Mode Standard - detailed specification 172 + @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode> 173 + WHATWG: How the parser determines quirks mode 69 174 *) 70 175 type quirks_mode = No_quirks | Quirks | Limited_quirks 71 176 ··· 73 178 74 179 All node types use the same record structure. The [name] field determines 75 180 the node type: 76 - - Element: the tag name (e.g., "div", "p") 181 + - Element: the tag name (e.g., "div", "p", "span") 77 182 - Text: "#text" 78 183 - Comment: "#comment" 79 184 - Document: "#document" 80 185 - Document fragment: "#document-fragment" 81 186 - Doctype: "!doctype" 82 187 83 - {3 Field Usage by Node Type} 188 + {3 Understanding Node Fields} 189 + 190 + Different node types use different combinations of fields: 84 191 85 192 {v 86 193 Node Type | name | namespace | attrs | data | template_content | doctype ··· 92 199 Document Fragment | "#document-frag" | No | No | No | No | No 93 200 Doctype | "!doctype" | No | No | No | No | Yes 94 201 v} 202 + 203 + {3 Element Tag Names} 204 + 205 + For element nodes, the [name] field contains the lowercase tag name. 206 + HTML5 defines many elements with specific meanings: 207 + 208 + {b Structural elements:} [html], [head], [body], [header], [footer], 209 + [main], [nav], [article], [section], [aside] 210 + 211 + {b Text content:} [p], [div], [span], [h1]-[h6], [pre], [blockquote] 212 + 213 + {b Lists:} [ul], [ol], [li], [dl], [dt], [dd] 214 + 215 + {b Tables:} [table], [tr], [td], [th], [thead], [tbody], [tfoot] 216 + 217 + {b Forms:} [form], [input], [button], [select], [textarea], [label] 218 + 219 + {b Media:} [img], [audio], [video], [canvas], [svg] 220 + 221 + @see <https://html.spec.whatwg.org/multipage/indices.html#elements-3> 222 + WHATWG: Index of HTML elements 223 + 224 + {3 Void Elements} 225 + 226 + Some elements are {i void elements} - they cannot have children and have 227 + no end tag. These include: [area], [base], [br], [col], [embed], [hr], 228 + [img], [input], [link], [meta], [source], [track], [wbr]. 229 + 230 + @see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements> 231 + WHATWG: Void elements 232 + 233 + {3 The Template Element} 234 + 235 + The [<template>] element is special: its children are not rendered 236 + directly but stored in a separate document fragment accessible via 237 + the [template_content] field. Templates are used for client-side 238 + templating where content is cloned and inserted via JavaScript. 239 + 240 + @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element> 241 + WHATWG: The template element 95 242 *) 96 243 type node = { 97 244 mutable name : string; 98 - (** Tag name for elements, or special name for other node types *) 245 + (** Tag name for elements, or special name for other node types. 246 + 247 + For elements, this is the lowercase tag name (e.g., "div", "span"). 248 + For other node types, use the constants {!document_name}, 249 + {!text_name}, {!comment_name}, etc. *) 99 250 100 251 mutable namespace : string option; 101 - (** Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"] *) 252 + (** Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"]. 253 + 254 + Most elements are in the HTML namespace ([None]). The SVG and MathML 255 + namespaces are only used when content appears inside [<svg>] or 256 + [<math>] elements respectively. 257 + 258 + @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom> 259 + WHATWG: Elements in the DOM *) 102 260 103 261 mutable attrs : (string * string) list; 104 - (** Element attributes as (name, value) pairs *) 262 + (** Element attributes as (name, value) pairs. 263 + 264 + Attributes provide additional information about elements. Common 265 + global attributes include: 266 + - [id]: Unique identifier for the element 267 + - [class]: Space-separated list of CSS class names 268 + - [style]: Inline CSS styles 269 + - [title]: Advisory text (shown as tooltip) 270 + - [lang]: Language of the element's content 271 + - [hidden]: Whether the element should be hidden 272 + 273 + Element-specific attributes include: 274 + - [href] on [<a>]: The link destination URL 275 + - [src] on [<img>]: The image source URL 276 + - [type] on [<input>]: The input control type 277 + - [disabled] on form controls: Whether the control is disabled 278 + 279 + In HTML5, attribute names are case-insensitive and are normalized 280 + to lowercase by the parser. 281 + 282 + @see <https://html.spec.whatwg.org/multipage/dom.html#global-attributes> 283 + WHATWG: Global attributes 284 + @see <https://html.spec.whatwg.org/multipage/indices.html#attributes-3> 285 + WHATWG: Index of attributes *) 105 286 106 287 mutable children : node list; 107 - (** Child nodes in document order *) 288 + (** Child nodes in document order. 289 + 290 + For most elements, this list contains the nested elements and text. 291 + For void elements (like [<br>], [<img>]), this is always empty. 292 + For [<template>] elements, the actual content is in 293 + [template_content], not here. *) 108 294 109 295 mutable parent : node option; 110 - (** Parent node, [None] for root nodes *) 296 + (** Parent node, [None] for root nodes. 297 + 298 + Every node except the Document node has a parent. This back-reference 299 + enables traversing up the tree. *) 111 300 112 301 mutable data : string; 113 - (** Text content for text and comment nodes *) 302 + (** Text content for text and comment nodes. 303 + 304 + For text nodes, this contains the actual text. For comment nodes, 305 + this contains the comment text (without the [<!--] and [-->] 306 + delimiters). For other node types, this field is empty. *) 114 307 115 308 mutable template_content : node option; 116 - (** Document fragment for [<template>] element contents *) 309 + (** Document fragment for [<template>] element contents. 310 + 311 + The [<template>] element holds "inert" content that is not 312 + rendered but can be cloned and inserted elsewhere. This field 313 + contains a document fragment with the template's content. 314 + 315 + For non-template elements, this is [None]. 316 + 317 + @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element> 318 + WHATWG: The template element *) 117 319 118 320 mutable doctype : doctype_data option; 119 - (** DOCTYPE information for doctype nodes *) 321 + (** DOCTYPE information for doctype nodes. 322 + 323 + Only doctype nodes use this field; for all other nodes it is [None]. *) 120 324 } 121 325 122 326 (** {1 Node Name Constants} ··· 126 330 *) 127 331 128 332 val document_name : string 129 - (** ["#document"] - name for document nodes *) 333 + (** ["#document"] - name for document nodes. 334 + 335 + The Document node is the root of every HTML document tree. It represents 336 + the entire document and is the parent of the [<html>] element. 337 + 338 + @see <https://html.spec.whatwg.org/multipage/dom.html#document> 339 + WHATWG: The Document object *) 130 340 131 341 val document_fragment_name : string 132 - (** ["#document-fragment"] - name for document fragment nodes *) 342 + (** ["#document-fragment"] - name for document fragment nodes. 343 + 344 + Document fragments are lightweight container nodes used to hold a 345 + collection of nodes without a parent document. They are used: 346 + - To hold [<template>] element contents 347 + - As results of fragment parsing (innerHTML) 348 + - For efficient batch DOM operations 349 + 350 + @see <https://dom.spec.whatwg.org/#documentfragment> 351 + DOM Standard: DocumentFragment *) 133 352 134 353 val text_name : string 135 - (** ["#text"] - name for text nodes *) 354 + (** ["#text"] - name for text nodes. 355 + 356 + Text nodes contain the character data within elements. When the 357 + parser encounters text between tags like [<p>Hello world</p>], 358 + it creates a text node with data ["Hello world"] as a child of 359 + the [<p>] element. 360 + 361 + Adjacent text nodes are automatically merged by the parser. *) 136 362 137 363 val comment_name : string 138 - (** ["#comment"] - name for comment nodes *) 364 + (** ["#comment"] - name for comment nodes. 365 + 366 + Comment nodes represent HTML comments: [<!-- comment text -->]. 367 + Comments are preserved in the DOM but not rendered to users. 368 + They're useful for development notes or conditional content. *) 139 369 140 370 val doctype_name : string 141 - (** ["!doctype"] - name for doctype nodes *) 371 + (** ["!doctype"] - name for doctype nodes. 372 + 373 + The DOCTYPE node represents the [<!DOCTYPE html>] declaration. 374 + It is always the first child of the Document node (if present). 375 + 376 + @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype> 377 + WHATWG: The DOCTYPE *) 142 378 143 379 (** {1 Constructors} 144 380 145 381 Functions to create new DOM nodes. All nodes start with no parent and 146 - no children. 382 + no children. Use {!append_child} or {!insert_before} to build a tree. 147 383 *) 148 384 149 385 val create_element : string -> ?namespace:string option -> 150 386 ?attrs:(string * string) list -> unit -> node 151 387 (** Create an element node. 152 388 153 - @param name The tag name (e.g., "div", "p", "span") 154 - @param namespace Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"] 155 - @param attrs Initial attributes as (name, value) pairs 389 + Elements are the primary building blocks of HTML documents. Each 390 + element represents a component of the document with semantic meaning. 391 + 392 + @param name The tag name (e.g., "div", "p", "span"). Tag names are 393 + case-insensitive in HTML; by convention, use lowercase. 394 + @param namespace Element namespace: 395 + - [None] (default): HTML namespace for standard elements 396 + - [Some "svg"]: SVG namespace for graphics elements 397 + - [Some "mathml"]: MathML namespace for mathematical notation 398 + @param attrs Initial attributes as [(name, value)] pairs 156 399 400 + {b Examples:} 157 401 {[ 402 + (* Simple HTML element *) 158 403 let div = create_element "div" () 159 - let svg = create_element "rect" ~namespace:(Some "svg") () 160 - let link = create_element "a" ~attrs:[("href", "/")] () 404 + 405 + (* Element with attributes *) 406 + let link = create_element "a" 407 + ~attrs:[("href", "https://example.com"); ("class", "external")] 408 + () 409 + 410 + (* SVG element *) 411 + let rect = create_element "rect" 412 + ~namespace:(Some "svg") 413 + ~attrs:[("width", "100"); ("height", "50"); ("fill", "blue")] 414 + () 161 415 ]} 416 + 417 + @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom> 418 + WHATWG: Elements in the DOM 162 419 *) 163 420 164 421 val create_text : string -> node 165 422 (** Create a text node with the given content. 166 423 424 + Text nodes contain the readable content of HTML documents. They 425 + appear as children of elements and represent the characters that 426 + users see. 427 + 428 + {b Note:} Text content is stored as-is. Character references like 429 + [&amp;] should already be decoded to their character values. 430 + 431 + {b Example:} 167 432 {[ 168 433 let text = create_text "Hello, world!" 434 + (* To put text in a paragraph: *) 435 + let p = create_element "p" () in 436 + append_child p text 169 437 ]} 170 438 *) 171 439 172 440 val create_comment : string -> node 173 441 (** Create a comment node with the given content. 174 442 175 - The content should not include the comment delimiters. 443 + Comments are human-readable notes in HTML that don't appear in 444 + the rendered output. They're written as [<!-- comment -->] in HTML. 176 445 446 + @param data The comment text (without the [<!--] and [-->] delimiters) 447 + 448 + {b Example:} 177 449 {[ 178 - let comment = create_comment " This is a comment " 179 - (* Represents: <!-- This is a comment --> *) 450 + let comment = create_comment " TODO: Add navigation " 451 + (* Represents: <!-- TODO: Add navigation --> *) 180 452 ]} 453 + 454 + @see <https://html.spec.whatwg.org/multipage/syntax.html#comments> 455 + WHATWG: HTML comments 181 456 *) 182 457 183 458 val create_document : unit -> node 184 459 (** Create an empty document node. 185 460 186 - Document nodes are the root of a complete HTML document tree. 461 + The Document node is the root of an HTML document tree. It represents 462 + the entire document and serves as the parent for the DOCTYPE (if any) 463 + and the root [<html>] element. 464 + 465 + In a complete HTML document, the structure is: 466 + {v 467 + #document 468 + ├── !doctype 469 + └── html 470 + ├── head 471 + └── body 472 + v} 473 + 474 + @see <https://html.spec.whatwg.org/multipage/dom.html#document> 475 + WHATWG: The Document object 187 476 *) 188 477 189 478 val create_document_fragment : unit -> node 190 479 (** Create an empty document fragment. 191 480 192 - Document fragments are lightweight containers used for: 193 - - Template contents 194 - - Fragment parsing results 195 - - Efficient batch DOM operations 481 + Document fragments are lightweight containers that can hold multiple 482 + nodes without being part of the main document tree. They're useful for: 483 + 484 + - {b Template contents:} The [<template>] element stores its children 485 + in a document fragment, keeping them inert until cloned 486 + 487 + - {b Fragment parsing:} When parsing HTML fragments (like innerHTML), 488 + the result is placed in a document fragment 489 + 490 + - {b Batch operations:} Build a subtree in a fragment, then insert it 491 + into the document in one operation for better performance 492 + 493 + @see <https://dom.spec.whatwg.org/#documentfragment> 494 + DOM Standard: DocumentFragment 196 495 *) 197 496 198 497 val create_doctype : ?name:string -> ?public_id:string -> 199 498 ?system_id:string -> unit -> node 200 499 (** Create a DOCTYPE node. 201 500 202 - For HTML5, use [create_doctype ~name:"html" ()] which produces 203 - [<!DOCTYPE html>]. 501 + The DOCTYPE declaration tells browsers to use standards mode for 502 + rendering. For HTML5 documents, use: 503 + 504 + {[ 505 + let doctype = create_doctype ~name:"html" () 506 + (* Represents: <!DOCTYPE html> *) 507 + ]} 508 + 509 + @param name DOCTYPE name (usually ["html"] for HTML documents) 510 + @param public_id Public identifier (legacy, rarely needed) 511 + @param system_id System identifier (legacy, rarely needed) 512 + 513 + {b Legacy example:} 514 + {[ 515 + (* HTML 4.01 Strict DOCTYPE - not recommended for new documents *) 516 + let legacy = create_doctype 517 + ~name:"HTML" 518 + ~public_id:"-//W3C//DTD HTML 4.01//EN" 519 + ~system_id:"http://www.w3.org/TR/html4/strict.dtd" 520 + () 521 + ]} 204 522 205 - @param name DOCTYPE name (usually "html") 206 - @param public_id Public identifier (legacy) 207 - @param system_id System identifier (legacy) 523 + @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype> 524 + WHATWG: The DOCTYPE 208 525 *) 209 526 210 527 val create_template : ?namespace:string option -> 211 528 ?attrs:(string * string) list -> unit -> node 212 529 (** Create a [<template>] element with its content document fragment. 213 530 214 - Template elements have special semantics: their children are not rendered 215 - directly but stored in a separate document fragment accessible via 216 - [template_content]. 531 + The [<template>] element holds inert HTML content that is not 532 + rendered directly. The content is stored in a separate document 533 + fragment and can be: 534 + - Cloned and inserted into the document via JavaScript 535 + - Used as a stamping template for repeated content 536 + - Pre-parsed without affecting the page 537 + 538 + {b How templates work:} 539 + 540 + Unlike normal elements, a [<template>]'s children are not rendered. 541 + Instead, they're stored in the [template_content] field. This means: 542 + - Images inside won't load 543 + - Scripts inside won't execute 544 + - The content is "inert" until explicitly activated 545 + 546 + {b Example:} 547 + {[ 548 + let template = create_template () in 549 + let div = create_element "div" () in 550 + let text = create_text "Template content" in 551 + append_child div text; 552 + (* Add to template's content fragment, not children *) 553 + match template.template_content with 554 + | Some fragment -> append_child fragment div 555 + | None -> () 556 + ]} 217 557 218 558 @see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element> 219 - The HTML5 template element specification 559 + WHATWG: The template element 220 560 *) 221 561 222 562 (** {1 Node Type Predicates} 223 563 224 - Functions to test what type of node you have. 564 + Functions to test what type of node you have. Since all nodes use the 565 + same record type, these predicates check the [name] field to determine 566 + the actual node type. 225 567 *) 226 568 227 569 val is_element : node -> bool 228 570 (** [is_element node] returns [true] if the node is an element node. 229 571 230 - Elements are nodes with HTML tags like [<div>], [<p>], etc. 572 + Elements are HTML tags like [<div>], [<p>], [<a>]. They are 573 + identified by having a tag name that doesn't match any of the 574 + special node name constants. 231 575 *) 232 576 233 577 val is_text : node -> bool 234 - (** [is_text node] returns [true] if the node is a text node. *) 578 + (** [is_text node] returns [true] if the node is a text node. 579 + 580 + Text nodes contain the character content within elements. 581 + They have [name = "#text"]. *) 235 582 236 583 val is_comment : node -> bool 237 - (** [is_comment node] returns [true] if the node is a comment node. *) 584 + (** [is_comment node] returns [true] if the node is a comment node. 585 + 586 + Comment nodes represent HTML comments [<!-- ... -->]. 587 + They have [name = "#comment"]. *) 238 588 239 589 val is_document : node -> bool 240 - (** [is_document node] returns [true] if the node is a document node. *) 590 + (** [is_document node] returns [true] if the node is a document node. 591 + 592 + The document node is the root of the DOM tree. 593 + It has [name = "#document"]. *) 241 594 242 595 val is_document_fragment : node -> bool 243 - (** [is_document_fragment node] returns [true] if the node is a document fragment. *) 596 + (** [is_document_fragment node] returns [true] if the node is a document fragment. 597 + 598 + Document fragments are lightweight containers. 599 + They have [name = "#document-fragment"]. *) 244 600 245 601 val is_doctype : node -> bool 246 - (** [is_doctype node] returns [true] if the node is a DOCTYPE node. *) 602 + (** [is_doctype node] returns [true] if the node is a DOCTYPE node. 603 + 604 + DOCTYPE nodes represent the [<!DOCTYPE>] declaration. 605 + They have [name = "!doctype"]. *) 247 606 248 607 val has_children : node -> bool 249 - (** [has_children node] returns [true] if the node has any children. *) 608 + (** [has_children node] returns [true] if the node has any children. 609 + 610 + Note: For [<template>] elements, this checks the direct children list, 611 + not the template content fragment. *) 250 612 251 613 (** {1 Tree Manipulation} 252 614 253 615 Functions to modify the DOM tree structure. These functions automatically 254 - maintain parent/child references. 616 + maintain parent/child references, ensuring the tree remains consistent. 255 617 *) 256 618 257 619 val append_child : node -> node -> unit 258 620 (** [append_child parent child] adds [child] as the last child of [parent]. 259 621 260 622 The child's parent reference is updated to point to [parent]. 623 + If the child already has a parent, it is first removed from that parent. 624 + 625 + {b Example:} 626 + {[ 627 + let body = create_element "body" () in 628 + let p = create_element "p" () in 629 + let text = create_text "Hello!" in 630 + append_child p text; 631 + append_child body p 632 + (* Result: 633 + body 634 + └── p 635 + └── #text "Hello!" 636 + *) 637 + ]} 261 638 *) 262 639 263 640 val insert_before : node -> node -> node -> unit 264 641 (** [insert_before parent new_child ref_child] inserts [new_child] before 265 642 [ref_child] in [parent]'s children. 266 643 267 - @raise Not_found if [ref_child] is not a child of [parent] 644 + @param parent The parent node 645 + @param new_child The node to insert 646 + @param ref_child The existing child to insert before 647 + 648 + Raises [Not_found] if [ref_child] is not a child of [parent]. 649 + 650 + {b Example:} 651 + {[ 652 + let ul = create_element "ul" () in 653 + let li1 = create_element "li" () in 654 + let li3 = create_element "li" () in 655 + append_child ul li1; 656 + append_child ul li3; 657 + let li2 = create_element "li" () in 658 + insert_before ul li2 li3 659 + (* Result: ul contains li1, li2, li3 in that order *) 660 + ]} 268 661 *) 269 662 270 663 val remove_child : node -> node -> unit 271 664 (** [remove_child parent child] removes [child] from [parent]'s children. 272 665 273 666 The child's parent reference is set to [None]. 667 + 668 + Raises [Not_found] if [child] is not a child of [parent]. 274 669 *) 275 670 276 671 val insert_text_at : node -> string -> node option -> unit 277 672 (** [insert_text_at parent text before_node] inserts text content. 278 673 279 674 If [before_node] is [None], appends at the end. If the previous sibling 280 - is a text node, the text is merged into it. Otherwise, a new text node 281 - is created. 675 + is a text node, the text is merged into it (text nodes are coalesced). 676 + Otherwise, a new text node is created. 282 677 283 678 This implements the HTML5 parser's text insertion algorithm which 284 - coalesces adjacent text nodes. 679 + ensures adjacent text nodes are always merged, matching browser behavior. 680 + 681 + @see <https://html.spec.whatwg.org/multipage/parsing.html#appropriate-place-for-inserting-a-node> 682 + WHATWG: Inserting text in the DOM 285 683 *) 286 684 287 685 (** {1 Attribute Operations} 288 686 289 - Functions to read and modify element attributes. 687 + Functions to read and modify element attributes. Attributes are 688 + name-value pairs that provide additional information about elements. 689 + 690 + In HTML5, attribute names are case-insensitive and normalized to 691 + lowercase by the parser. 692 + 693 + @see <https://html.spec.whatwg.org/multipage/dom.html#attributes> 694 + WHATWG: Attributes 290 695 *) 291 696 292 697 val get_attr : node -> string -> string option 293 - (** [get_attr node name] returns the value of attribute [name], or [None]. *) 698 + (** [get_attr node name] returns the value of attribute [name], or [None] 699 + if the attribute doesn't exist. 700 + 701 + Attribute lookup is case-sensitive on the stored (lowercase) names. 702 + *) 294 703 295 704 val set_attr : node -> string -> string -> unit 296 705 (** [set_attr node name value] sets attribute [name] to [value]. 297 706 298 707 If the attribute already exists, it is replaced. 708 + If it doesn't exist, it is added. 299 709 *) 300 710 301 711 val has_attr : node -> string -> bool ··· 310 720 (** [descendants node] returns all descendant nodes in document order. 311 721 312 722 This performs a depth-first traversal, returning children before 313 - siblings at each level. 723 + siblings at each level. The node itself is not included. 724 + 725 + {b Document order} is the order nodes appear in the HTML source: 726 + parent before children, earlier siblings before later ones. 727 + 728 + {b Example:} 729 + {[ 730 + (* For tree: div > (p > "hello", span > "world") *) 731 + descendants div 732 + (* Returns: [p; text("hello"); span; text("world")] *) 733 + ]} 314 734 *) 315 735 316 736 val ancestors : node -> node list 317 737 (** [ancestors node] returns all ancestor nodes from parent to root. 318 738 319 - The first element is the immediate parent, the last is the root. 739 + The first element is the immediate parent, the last is the root 740 + (usually the Document node). 741 + 742 + {b Example:} 743 + {[ 744 + (* For a text node inside: html > body > p > text *) 745 + ancestors text_node 746 + (* Returns: [p; body; html; #document] *) 747 + ]} 320 748 *) 321 749 322 750 val get_text_content : node -> string 323 751 (** [get_text_content node] returns the concatenated text content. 324 752 325 - For text nodes, returns the text data. For elements, recursively 326 - concatenates all descendant text content. 753 + For text nodes, returns the text data directly. 754 + For elements, recursively concatenates all descendant text content. 755 + For other node types, returns an empty string. 756 + 757 + {b Example:} 758 + {[ 759 + (* For: <p>Hello <b>world</b>!</p> *) 760 + get_text_content p_element 761 + (* Returns: "Hello world!" *) 762 + ]} 327 763 *) 328 764 329 765 (** {1 Cloning} *) ··· 333 769 334 770 @param deep If [true], recursively clone all descendants (default: [false]) 335 771 336 - The cloned node has no parent. Attribute lists are copied by reference 337 - (the list itself is new, but attribute strings are shared). 772 + The cloned node has no parent. With [deep:false], only the node itself 773 + is copied (with its attributes, but not its children). 774 + 775 + {b Example:} 776 + {[ 777 + let original = create_element "div" ~attrs:[("class", "box")] () in 778 + let shallow = clone original in 779 + let deep = clone ~deep:true original 780 + ]} 338 781 *)
+666 -107
lib/html5rw/html5rw.mli
··· 5 5 6 6 (** Html5rw - Pure OCaml HTML5 Parser 7 7 8 - This module provides a complete HTML5 parsing solution following the 9 - WHATWG specification. It uses bytesrw for streaming input/output. 8 + This library provides a complete HTML5 parsing solution that implements the 9 + {{:https://html.spec.whatwg.org/multipage/parsing.html} WHATWG HTML5 10 + parsing specification}. It can parse any HTML document - well-formed or not - 11 + and produce a DOM (Document Object Model) tree that matches browser behavior. 12 + 13 + {2 What is HTML?} 14 + 15 + HTML (HyperText Markup Language) is the standard markup language for creating 16 + web pages. An HTML document consists of nested {i elements} that describe 17 + the structure and content of the page: 18 + 19 + {v 20 + <!DOCTYPE html> 21 + <html> 22 + <head> 23 + <title>My Page</title> 24 + </head> 25 + <body> 26 + <h1>Welcome</h1> 27 + <p>Hello, <b>world</b>!</p> 28 + </body> 29 + </html> 30 + v} 31 + 32 + Each element is written with a {i start tag} (like [<p>]), content, and an 33 + {i end tag} (like [</p>]). Elements can have {i attributes} that provide 34 + additional information: [<a href="https://example.com">]. 35 + 36 + @see <https://html.spec.whatwg.org/multipage/introduction.html> 37 + WHATWG: Introduction to HTML 38 + 39 + {2 The DOM} 40 + 41 + When this parser processes HTML, it doesn't just store the text. Instead, 42 + it builds a tree structure called the DOM (Document Object Model). Each 43 + element, text fragment, and comment becomes a {i node} in this tree: 44 + 45 + {v 46 + Document 47 + └── html 48 + ├── head 49 + │ └── title 50 + │ └── #text "My Page" 51 + └── body 52 + ├── h1 53 + │ └── #text "Welcome" 54 + └── p 55 + ├── #text "Hello, " 56 + ├── b 57 + │ └── #text "world" 58 + └── #text "!" 59 + v} 60 + 61 + This tree can be traversed, searched, and modified. The {!Dom} module 62 + provides types and functions for working with DOM nodes. 63 + 64 + @see <https://html.spec.whatwg.org/multipage/dom.html> 65 + WHATWG: The elements of HTML (DOM chapter) 10 66 11 67 {2 Quick Start} 12 68 13 - Parse HTML from a reader: 69 + Parse HTML from a string: 14 70 {[ 15 71 open Bytesrw 16 72 let reader = Bytes.Reader.of_string "<p>Hello, world!</p>" in ··· 32 88 let result = Html5rw.parse reader in 33 89 let divs = Html5rw.query result "div.content" 34 90 ]} 91 + 92 + {2 Error Handling} 93 + 94 + Unlike many parsers, HTML5 parsing {b never fails}. The WHATWG specification 95 + defines error recovery rules for every possible malformed input, ensuring 96 + all HTML documents produce a valid DOM tree (just as browsers do). 97 + 98 + For example, parsing [<p>Hello<p>World] produces two paragraphs, not an 99 + error, because [<p>] implicitly closes the previous [<p>]. 100 + 101 + If you need to detect malformed HTML (e.g., for validation), enable error 102 + collection with [~collect_errors:true]. Errors are advisory - the parsing 103 + still succeeds. 104 + 105 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 106 + WHATWG: Parse errors 107 + 108 + {2 HTML vs XHTML} 109 + 110 + This parser implements {b HTML5 parsing}, not XHTML parsing. Key differences: 111 + 112 + - Tag and attribute names are case-insensitive ([<DIV>] equals [<div>]) 113 + - Some end tags are optional ([<p>Hello] is valid) 114 + - Void elements have no end tag ([<br>], not [<br/>] or [<br></br>]) 115 + - Boolean attributes need no value ([<input disabled>]) 116 + 117 + XHTML uses stricter XML rules. If you need XHTML parsing, use an XML parser. 118 + 119 + @see <https://html.spec.whatwg.org/multipage/syntax.html> 120 + WHATWG: The HTML syntax 35 121 *) 36 122 37 123 (** {1 Sub-modules} *) 38 124 39 - (** DOM types and manipulation functions *) 125 + (** DOM types and manipulation functions. 126 + 127 + This module provides the core types for representing HTML documents as 128 + DOM trees. It includes: 129 + - The {!Dom.node} type representing all kinds of DOM nodes 130 + - Functions to create, modify, and traverse nodes 131 + - Serialization functions to convert DOM back to HTML 132 + 133 + @see <https://html.spec.whatwg.org/multipage/dom.html> 134 + WHATWG: The elements of HTML *) 40 135 module Dom = Html5rw_dom 41 136 42 - (** HTML5 tokenizer *) 137 + (** HTML5 tokenizer. 138 + 139 + The tokenizer is the first stage of HTML5 parsing. It converts a stream 140 + of characters into a stream of {i tokens}: start tags, end tags, text, 141 + comments, and DOCTYPEs. 142 + 143 + Most users don't need to use the tokenizer directly - the {!parse} 144 + function handles everything. The tokenizer is exposed for advanced use 145 + cases like syntax highlighting or partial parsing. 146 + 147 + @see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization> 148 + WHATWG: Tokenization *) 43 149 module Tokenizer = Html5rw_tokenizer 44 150 45 - (** Encoding detection and decoding *) 151 + (** Encoding detection and decoding. 152 + 153 + HTML documents can use various character encodings (UTF-8, ISO-8859-1, 154 + etc.). This module implements the WHATWG encoding sniffing algorithm 155 + that browsers use to detect the encoding of a document: 156 + 157 + 1. Check for a BOM (Byte Order Mark) 158 + 2. Look for a [<meta charset>] declaration 159 + 3. Use HTTP Content-Type header hint (if available) 160 + 4. Fall back to UTF-8 161 + 162 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 163 + WHATWG: Determining the character encoding 164 + @see <https://encoding.spec.whatwg.org/> 165 + WHATWG Encoding Standard *) 46 166 module Encoding = Html5rw_encoding 47 167 48 - (** CSS selector engine *) 168 + (** CSS selector engine. 169 + 170 + This module provides CSS selector support for querying the DOM tree. 171 + CSS selectors are patterns used to select HTML elements based on their 172 + tag names, attributes, classes, IDs, and position in the document. 173 + 174 + Example selectors: 175 + - [div] - all [<div>] elements 176 + - [#header] - element with [id="header"] 177 + - [.warning] - elements with [class="warning"] 178 + - [div > p] - [<p>] elements that are direct children of [<div>] 179 + - [[href]] - elements with an [href] attribute 180 + 181 + @see <https://www.w3.org/TR/selectors-4/> 182 + W3C Selectors Level 4 specification *) 49 183 module Selector = Html5rw_selector 50 184 51 - (** HTML entity decoding *) 185 + (** HTML entity decoding. 186 + 187 + HTML uses {i character references} to represent characters that are 188 + hard to type or have special meaning: 189 + 190 + - Named references: [&amp;] (ampersand), [&lt;] (less than), [&nbsp;] (non-breaking space) 191 + - Decimal references: [&#60;] (less than as decimal 60) 192 + - Hexadecimal references: [&#x3C;] (less than as hex 3C) 193 + 194 + This module decodes all 2,231 named character references defined in 195 + the WHATWG specification, plus numeric references. 196 + 197 + @see <https://html.spec.whatwg.org/multipage/named-characters.html> 198 + WHATWG: Named character references *) 52 199 module Entities = Html5rw_entities 53 200 54 - (** Low-level parser access *) 201 + (** Low-level parser access. 202 + 203 + This module exposes the internals of the HTML5 parser for advanced use. 204 + Most users should use the top-level {!parse} function instead. 205 + 206 + The parser exposes: 207 + - Insertion modes for the tree construction algorithm 208 + - The tree builder state machine 209 + - Lower-level parsing functions 210 + 211 + @see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction> 212 + WHATWG: Tree construction *) 55 213 module Parser = Html5rw_parser 56 214 57 215 (** {1 Core Types} *) 58 216 59 - (** DOM node type. See {!Dom} for manipulation functions. *) 217 + (** DOM node type. 218 + 219 + A node represents one part of an HTML document. Nodes form a tree 220 + structure with parent/child relationships. There are several kinds: 221 + 222 + - {b Element nodes}: HTML tags like [<div>], [<p>], [<a>] 223 + - {b Text nodes}: Text content within elements 224 + - {b Comment nodes}: HTML comments [<!-- ... -->] 225 + - {b Document nodes}: The root of a document tree 226 + - {b Document fragment nodes}: Lightweight containers 227 + - {b Doctype nodes}: The [<!DOCTYPE html>] declaration 228 + 229 + See {!Dom} for manipulation functions. 230 + 231 + @see <https://html.spec.whatwg.org/multipage/dom.html> 232 + WHATWG: The DOM *) 60 233 type node = Dom.node 61 234 62 - (** Doctype information *) 235 + (** DOCTYPE information. 236 + 237 + The DOCTYPE declaration ([<!DOCTYPE html>]) appears at the start of HTML 238 + documents. It tells browsers to use standards mode for rendering. 239 + 240 + In HTML5, the DOCTYPE is minimal - just [<!DOCTYPE html>] with no public 241 + or system identifiers. Legacy DOCTYPEs may have additional fields. 242 + 243 + @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype> 244 + WHATWG: The DOCTYPE *) 63 245 type doctype_data = Dom.doctype_data = { 64 246 name : string option; 247 + (** DOCTYPE name, typically ["html"] *) 248 + 65 249 public_id : string option; 250 + (** Public identifier for legacy DOCTYPEs (e.g., XHTML, HTML4) *) 251 + 66 252 system_id : string option; 253 + (** System identifier (URL) for legacy DOCTYPEs *) 67 254 } 68 255 69 - (** Quirks mode as determined during parsing *) 256 + (** Quirks mode as determined during parsing. 257 + 258 + {i Quirks mode} controls how browsers render CSS and compute layouts. 259 + It exists for backwards compatibility with old web pages that relied 260 + on browser bugs. 261 + 262 + - {b No_quirks}: Standards mode. The document is rendered according to 263 + modern HTML5 and CSS specifications. Triggered by [<!DOCTYPE html>]. 264 + 265 + - {b Quirks}: Full quirks mode. The browser emulates bugs from older 266 + browsers (primarily IE5). Triggered by missing or malformed DOCTYPEs. 267 + Affects CSS box model, table layout, font inheritance, and more. 268 + 269 + - {b Limited_quirks}: Almost standards mode. Only a few specific quirks 270 + are applied, mainly affecting table cell vertical alignment. 271 + 272 + {b Recommendation:} Always use [<!DOCTYPE html>] to ensure standards mode. 273 + 274 + @see <https://quirks.spec.whatwg.org/> 275 + Quirks Mode Standard 276 + @see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode> 277 + WHATWG: How quirks mode is determined *) 70 278 type quirks_mode = Dom.quirks_mode = No_quirks | Quirks | Limited_quirks 71 279 72 - (** Character encoding detected or specified *) 280 + (** Character encoding detected or specified. 281 + 282 + HTML documents are sequences of bytes that must be decoded into characters. 283 + Different encodings interpret the same bytes differently. For example: 284 + 285 + - UTF-8: The modern standard, supporting all Unicode characters 286 + - Windows-1252: Common on older Western European web pages 287 + - ISO-8859-2: Used for Central European languages 288 + - UTF-16: Used by some Windows applications 289 + 290 + The parser detects encoding automatically when using {!parse_bytes}. 291 + The detected encoding is available via {!val-encoding}. 292 + 293 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 294 + WHATWG: Determining the character encoding 295 + @see <https://encoding.spec.whatwg.org/> 296 + WHATWG Encoding Standard *) 73 297 type encoding = Encoding.encoding = 74 298 | Utf8 299 + (** UTF-8: The dominant encoding for the web, supporting all Unicode *) 300 + 75 301 | Utf16le 302 + (** UTF-16 Little-Endian: 16-bit encoding, used by Windows *) 303 + 76 304 | Utf16be 305 + (** UTF-16 Big-Endian: 16-bit encoding, network byte order *) 306 + 77 307 | Windows_1252 308 + (** Windows-1252 (CP-1252): Western European, superset of ISO-8859-1 *) 309 + 78 310 | Iso_8859_2 311 + (** ISO-8859-2: Central European (Polish, Czech, Hungarian, etc.) *) 312 + 79 313 | Euc_jp 314 + (** EUC-JP: Extended Unix Code for Japanese *) 80 315 81 316 (** A parse error encountered during HTML5 parsing. 82 317 83 - HTML5 parsing never fails - the specification defines error recovery 84 - for all malformed input. However, conformance checkers can report 85 - these errors. Enable error collection with [~collect_errors:true]. 318 + HTML5 parsing {b never fails} - the specification defines error recovery 319 + for all malformed input. However, conformance checkers can report these 320 + errors. Enable error collection with [~collect_errors:true] if you want 321 + to detect malformed HTML. 322 + 323 + {b Common parse errors:} 324 + 325 + - ["unexpected-null-character"]: Null byte in the input 326 + - ["eof-before-tag-name"]: File ended while reading a tag 327 + - ["unexpected-character-in-attribute-name"]: Invalid attribute syntax 328 + - ["missing-doctype"]: Document started without [<!DOCTYPE>] 329 + - ["duplicate-attribute"]: Same attribute appears twice on an element 330 + 331 + The full list of parse error codes is defined in the WHATWG specification. 86 332 87 333 @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 88 - WHATWG parse error definitions 89 - *) 334 + WHATWG: Complete list of parse errors *) 90 335 type parse_error = Parser.parse_error 91 336 92 - (** Get the error code (e.g., "unexpected-null-character"). *) 337 + (** Get the error code string. 338 + 339 + Error codes are lowercase with hyphens, matching the WHATWG specification 340 + names. Examples: ["unexpected-null-character"], ["eof-in-tag"], 341 + ["missing-end-tag-name"]. 342 + 343 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 344 + WHATWG: Parse error codes *) 93 345 val error_code : parse_error -> string 94 346 95 - (** Get the line number where the error occurred (1-indexed). *) 347 + (** Get the line number where the error occurred (1-indexed). 348 + 349 + Line numbers count from 1 and increment at each newline character. *) 96 350 val error_line : parse_error -> int 97 351 98 - (** Get the column number where the error occurred (1-indexed). *) 352 + (** Get the column number where the error occurred (1-indexed). 353 + 354 + Column numbers count from 1 and reset at each newline. *) 99 355 val error_column : parse_error -> int 100 356 101 357 (** Context element for HTML fragment parsing (innerHTML). 102 358 103 - When parsing HTML fragments, you must specify what element would 104 - contain the fragment. This affects how certain elements are handled. 359 + When parsing HTML fragments (like the [innerHTML] of an element), you 360 + must specify what element would contain the fragment. This affects how 361 + the parser handles certain elements. 362 + 363 + {b Why context matters:} 364 + 365 + HTML parsing rules depend on where content appears. For example: 366 + - [<td>] is valid inside [<tr>] but not inside [<div>] 367 + - [<li>] is valid inside [<ul>] but creates implied lists elsewhere 368 + - Content inside [<table>] has special parsing rules 369 + 370 + {b Example:} 371 + {[ 372 + (* Parse as if content were inside a <ul> *) 373 + let ctx = make_fragment_context ~tag_name:"ul" () in 374 + let result = parse ~fragment_context:ctx reader 375 + (* Now <li> elements are parsed correctly *) 376 + ]} 105 377 106 378 @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments> 107 - The fragment parsing algorithm 108 - *) 379 + WHATWG: The fragment parsing algorithm *) 109 380 type fragment_context = Parser.fragment_context 110 381 111 382 (** Create a fragment parsing context. 112 383 113 - @param tag_name Tag name of the context element (e.g., "div", "tr") 114 - @param namespace Namespace: [None] for HTML, [Some "svg"], [Some "mathml"] 384 + The context element determines how the parser interprets the fragment. 385 + Choose a context that matches where the fragment would be inserted. 115 386 387 + @param tag_name Tag name of the context element (e.g., ["div"], ["tr"], 388 + ["ul"]). This is the element that would contain the fragment. 389 + @param namespace Namespace of the context element: 390 + - [None] (default): HTML namespace 391 + - [Some "svg"]: SVG namespace 392 + - [Some "mathml"]: MathML namespace 393 + 394 + {b Examples:} 116 395 {[ 117 - (* Parse as innerHTML of a <ul> *) 118 - let ctx = Html5rw.make_fragment_context ~tag_name:"ul" () 396 + (* Parse as innerHTML of a <div> (most common case) *) 397 + let ctx = make_fragment_context ~tag_name:"div" () 398 + 399 + (* Parse as innerHTML of a <ul> - <li> elements work correctly *) 400 + let ctx = make_fragment_context ~tag_name:"ul" () 119 401 120 402 (* Parse as innerHTML of an SVG <g> element *) 121 - let ctx = Html5rw.make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") () 403 + let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") () 404 + 405 + (* Parse as innerHTML of a <table> - table-specific rules apply *) 406 + let ctx = make_fragment_context ~tag_name:"table" () 122 407 ]} 123 - *) 408 + 409 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments> 410 + WHATWG: Fragment parsing algorithm *) 124 411 val make_fragment_context : tag_name:string -> ?namespace:string option -> 125 412 unit -> fragment_context 126 413 ··· 132 419 133 420 (** Result of parsing an HTML document. 134 421 135 - Contains the parsed DOM tree, any errors encountered, and the 136 - detected encoding (when parsing from bytes). 422 + This record contains everything produced by parsing: 423 + - The DOM tree (accessible via {!val-root}) 424 + - Any parse errors (accessible via {!val-errors}) 425 + - The detected encoding (accessible via {!val-encoding}) 137 426 *) 138 427 type t = { 139 428 root : node; 429 + (** Root node of the parsed document tree. 430 + 431 + For full document parsing, this is a Document node containing the 432 + DOCTYPE (if any) and [<html>] element. 433 + 434 + For fragment parsing, this is a Document Fragment containing the 435 + parsed elements. *) 436 + 140 437 errors : parse_error list; 438 + (** Parse errors encountered during parsing. 439 + 440 + This list is empty unless [~collect_errors:true] was passed to the 441 + parse function. Errors are in the order they were encountered. 442 + 443 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 444 + WHATWG: Parse errors *) 445 + 141 446 encoding : encoding option; 447 + (** Character encoding detected during parsing. 448 + 449 + This is [Some encoding] when using {!parse_bytes} with automatic 450 + encoding detection, and [None] when using {!parse} (which expects 451 + pre-decoded UTF-8 input). *) 142 452 } 143 453 144 454 (** {1 Parsing Functions} *) 145 455 146 456 (** Parse HTML from a [Bytes.Reader.t]. 147 457 148 - This is the primary parsing function. Create a reader from any source: 149 - - [Bytes.Reader.of_string s] for strings 150 - - [Bytes.Reader.of_in_channel ic] for files 151 - - [Bytes.Reader.of_bytes b] for byte buffers 458 + This is the primary parsing function. It reads bytes from the provided 459 + reader and returns a DOM tree. The input should be valid UTF-8. 152 460 461 + {b Creating readers:} 153 462 {[ 154 463 open Bytesrw 155 - let reader = Bytes.Reader.of_string "<html><body>Hello</body></html>" in 464 + 465 + (* From a string *) 466 + let reader = Bytes.Reader.of_string html_string 467 + 468 + (* From a file *) 469 + let ic = open_in "page.html" in 470 + let reader = Bytes.Reader.of_in_channel ic 471 + 472 + (* From a buffer *) 473 + let reader = Bytes.Reader.of_buffer buf 474 + ]} 475 + 476 + {b Parsing a complete document:} 477 + {[ 156 478 let result = Html5rw.parse reader 479 + let doc = Html5rw.root result 157 480 ]} 158 481 159 - @param collect_errors If true, collect parse errors (default: false) 160 - @param fragment_context Context element for fragment parsing 161 - *) 162 - val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> Bytesrw.Bytes.Reader.t -> t 482 + {b Parsing a fragment:} 483 + {[ 484 + let ctx = Html5rw.make_fragment_context ~tag_name:"div" () in 485 + let result = Html5rw.parse ~fragment_context:ctx reader 486 + ]} 487 + 488 + @param collect_errors If [true], collect parse errors. Default: [false]. 489 + Error collection has some performance overhead. 490 + @param fragment_context Context element for fragment parsing. If provided, 491 + the input is parsed as a fragment (like innerHTML) rather than 492 + a complete document. 493 + 494 + @see <https://html.spec.whatwg.org/multipage/parsing.html> 495 + WHATWG: HTML parsing algorithm *) 496 + val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> 497 + Bytesrw.Bytes.Reader.t -> t 163 498 164 499 (** Parse raw bytes with automatic encoding detection. 165 500 166 - This function implements the WHATWG encoding sniffing algorithm: 167 - 1. Check for BOM (Byte Order Mark) 168 - 2. Prescan for <meta charset> 169 - 3. Fall back to UTF-8 501 + This function is useful when you have raw bytes and don't know the 502 + character encoding. It implements the WHATWG encoding sniffing algorithm: 503 + 504 + 1. {b BOM detection}: Check for UTF-8, UTF-16LE, or UTF-16BE BOM 505 + 2. {b Prescan}: Look for [<meta charset="...">] in the first 1024 bytes 506 + 3. {b Transport hint}: Use the provided [transport_encoding] if any 507 + 4. {b Fallback}: Use UTF-8 (the modern web default) 508 + 509 + The detected encoding is stored in the result's [encoding] field. 510 + 511 + {b Example:} 512 + {[ 513 + let bytes = really_input_bytes ic (in_channel_length ic) in 514 + let result = Html5rw.parse_bytes bytes in 515 + match Html5rw.encoding result with 516 + | Some Utf8 -> print_endline "UTF-8 detected" 517 + | Some Windows_1252 -> print_endline "Windows-1252 detected" 518 + | _ -> () 519 + ]} 170 520 171 - @param collect_errors If true, collect parse errors (default: false) 172 - @param transport_encoding Encoding from HTTP Content-Type header 173 - @param fragment_context Context element for fragment parsing 174 - *) 175 - val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> ?fragment_context:fragment_context -> bytes -> t 521 + @param collect_errors If [true], collect parse errors. Default: [false]. 522 + @param transport_encoding Encoding hint from HTTP Content-Type header. 523 + For example, if the server sends [Content-Type: text/html; charset=utf-8], 524 + pass [~transport_encoding:"utf-8"]. 525 + @param fragment_context Context element for fragment parsing. 526 + 527 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 528 + WHATWG: Determining the character encoding *) 529 + val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> 530 + ?fragment_context:fragment_context -> bytes -> t 176 531 177 532 (** {1 Querying} *) 178 533 179 534 (** Query the DOM tree with a CSS selector. 180 535 181 - Supported selectors: 182 - - Tag: [div], [p], [span] 183 - - ID: [#myid] 184 - - Class: [.myclass] 185 - - Universal: [*] 186 - - Attribute: [[attr]], [[attr="value"]], [[attr~="value"]], [[attr|="value"]] 187 - - Pseudo-classes: [:first-child], [:last-child], [:nth-child(n)] 188 - - Combinators: descendant (space), child (>), adjacent sibling (+), general sibling (~) 536 + CSS selectors are patterns used to select elements in HTML documents. 537 + This function returns all nodes matching the selector, in document order. 538 + 539 + {b Supported selectors:} 540 + 541 + {i Type selectors:} 542 + - [div], [p], [span] - elements by tag name 543 + 544 + {i Class and ID selectors:} 545 + - [#myid] - element with [id="myid"] 546 + - [.myclass] - elements with class containing "myclass" 547 + 548 + {i Attribute selectors:} 549 + - [[attr]] - elements with the [attr] attribute 550 + - [[attr="value"]] - attribute equals value 551 + - [[attr~="value"]] - attribute contains word 552 + - [[attr|="value"]] - attribute starts with value or value- 553 + - [[attr^="value"]] - attribute starts with value 554 + - [[attr$="value"]] - attribute ends with value 555 + - [[attr*="value"]] - attribute contains value 556 + 557 + {i Pseudo-classes:} 558 + - [:first-child], [:last-child] - first/last child of parent 559 + - [:nth-child(n)] - nth child (1-indexed) 560 + - [:only-child] - only child of parent 561 + - [:empty] - elements with no children 562 + - [:not(selector)] - elements not matching selector 189 563 564 + {i Combinators:} 565 + - [A B] - B descendants of A (any depth) 566 + - [A > B] - B direct children of A 567 + - [A + B] - B immediately after A (adjacent sibling) 568 + - [A ~ B] - B after A (general sibling) 569 + 570 + {i Universal:} 571 + - [*] - all elements 572 + 573 + {b Examples:} 190 574 {[ 191 - let divs = Html5rw.query result "div.content > p" 575 + (* All paragraphs *) 576 + let ps = query result "p" 577 + 578 + (* Elements with class "warning" inside a div *) 579 + let warnings = query result "div .warning" 580 + 581 + (* Direct children of nav that are links *) 582 + let nav_links = query result "nav > a" 583 + 584 + (* Complex selector *) 585 + let items = query result "ul.menu > li:first-child a[href]" 192 586 ]} 193 587 194 - @raise Selector.Selector_error if the selector is invalid 195 - *) 588 + @raise Selector.Selector_error if the selector syntax is invalid 589 + 590 + @see <https://www.w3.org/TR/selectors-4/> 591 + W3C: Selectors Level 4 *) 196 592 val query : t -> string -> node list 197 593 198 - (** Check if a node matches a CSS selector. *) 594 + (** Check if a node matches a CSS selector. 595 + 596 + This is useful for filtering nodes or implementing custom traversals. 597 + 598 + {b Example:} 599 + {[ 600 + let is_external_link node = 601 + matches node "a[href^='http']" 602 + ]} 603 + 604 + @raise Selector.Selector_error if the selector syntax is invalid *) 199 605 val matches : node -> string -> bool 200 606 201 607 (** {1 Serialization} *) 202 608 203 609 (** Write the DOM tree to a [Bytes.Writer.t]. 204 610 611 + This serializes the DOM back to HTML. The output is valid HTML5 that 612 + can be parsed to produce an equivalent DOM tree. 613 + 614 + {b Example:} 205 615 {[ 206 616 open Bytesrw 207 617 let buf = Buffer.create 1024 in ··· 211 621 let html = Buffer.contents buf 212 622 ]} 213 623 214 - @param pretty If true, format with indentation (default: true) 215 - @param indent_size Number of spaces per indent level (default: 2) 216 - *) 217 - val to_writer : ?pretty:bool -> ?indent_size:int -> t -> Bytesrw.Bytes.Writer.t -> unit 624 + @param pretty If [true] (default), add indentation for readability. 625 + If [false], output compact HTML with no added whitespace. 626 + @param indent_size Spaces per indentation level (default: 2). 627 + Only used when [pretty] is [true]. 628 + 629 + @see <https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments> 630 + WHATWG: Serialising HTML fragments *) 631 + val to_writer : ?pretty:bool -> ?indent_size:int -> t -> 632 + Bytesrw.Bytes.Writer.t -> unit 218 633 219 634 (** Serialize the DOM tree to a string. 220 635 221 - Convenience function when the output fits in memory. 636 + Convenience function that serializes to a string instead of a writer. 637 + Use {!to_writer} for large documents to avoid memory allocation. 222 638 223 - @param pretty If true, format with indentation (default: true) 224 - @param indent_size Number of spaces per indent level (default: 2) 225 - *) 639 + @param pretty If [true] (default), add indentation for readability. 640 + @param indent_size Spaces per indentation level (default: 2). *) 226 641 val to_string : ?pretty:bool -> ?indent_size:int -> t -> string 227 642 228 643 (** Extract text content from the DOM tree. 229 644 230 - @param separator String to insert between text nodes (default: " ") 231 - @param strip If true, trim whitespace (default: true) 232 - *) 645 + This concatenates all text nodes in the document, producing a string 646 + with just the readable text (no HTML tags). 647 + 648 + {b Example:} 649 + {[ 650 + (* For document: <div><p>Hello</p><p>World</p></div> *) 651 + let text = to_text result 652 + (* Returns: "Hello World" *) 653 + ]} 654 + 655 + @param separator String to insert between text nodes (default: [" "]) 656 + @param strip If [true] (default), trim leading/trailing whitespace *) 233 657 val to_text : ?separator:string -> ?strip:bool -> t -> string 234 658 235 - (** Serialize to html5lib test format (for testing). *) 659 + (** Serialize to html5lib test format. 660 + 661 + This produces the tree format used by the 662 + {{:https://github.com/html5lib/html5lib-tests} html5lib-tests} suite. 663 + Mainly useful for testing the parser against the reference tests. *) 236 664 val to_test_format : t -> string 237 665 238 666 (** {1 Result Accessors} *) 239 667 240 - (** Get the root node of the parsed document. *) 668 + (** Get the root node of the parsed document. 669 + 670 + For full document parsing, this returns a Document node. The structure is: 671 + {v 672 + #document 673 + ├── !doctype (if present) 674 + └── html 675 + ├── head 676 + └── body 677 + v} 678 + 679 + For fragment parsing, this returns a Document Fragment node containing 680 + the parsed elements directly. *) 241 681 val root : t -> node 242 682 243 - (** Get parse errors (if error collection was enabled). *) 683 + (** Get parse errors (if error collection was enabled). 684 + 685 + Returns an empty list if [~collect_errors:true] was not passed to the 686 + parse function, or if the document was well-formed. 687 + 688 + Errors are returned in the order they were encountered during parsing. 689 + 690 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 691 + WHATWG: Parse errors *) 244 692 val errors : t -> parse_error list 245 693 246 - (** Get the detected encoding (if parsed from bytes). *) 694 + (** Get the detected encoding (if parsed from bytes). 695 + 696 + Returns [Some encoding] when {!parse_bytes} was used, indicating which 697 + encoding was detected or specified. Returns [None] when {!parse} was 698 + used, since it expects pre-decoded UTF-8 input. 699 + 700 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 701 + WHATWG: Determining the character encoding *) 247 702 val encoding : t -> encoding option 248 703 249 704 (** {1 DOM Utilities} 250 705 251 - Common DOM operations are available directly. For the full API, 252 - see the {!Dom} module. 706 + Common DOM operations are available directly on this module. For the 707 + full API including more advanced operations, see the {!Dom} module. 708 + 709 + @see <https://html.spec.whatwg.org/multipage/dom.html> 710 + WHATWG: The elements of HTML 253 711 *) 254 712 255 713 (** Create an element node. 256 - @param namespace None for HTML, Some "svg" or Some "mathml" for foreign content 257 - @param attrs List of (name, value) attribute pairs 258 - *) 259 - val create_element : string -> ?namespace:string option -> ?attrs:(string * string) list -> unit -> node 714 + 715 + Elements are the building blocks of HTML documents. They represent tags 716 + like [<div>], [<p>], [<a>], etc. 717 + 718 + @param name Tag name (e.g., ["div"], ["p"], ["span"]) 719 + @param namespace Element namespace: 720 + - [None] (default): HTML namespace 721 + - [Some "svg"]: SVG namespace for graphics 722 + - [Some "mathml"]: MathML namespace for math notation 723 + @param attrs Initial attributes as [(name, value)] pairs 724 + 725 + {b Example:} 726 + {[ 727 + (* Simple element *) 728 + let div = create_element "div" () 260 729 261 - (** Create a text node. *) 730 + (* Element with attributes *) 731 + let link = create_element "a" 732 + ~attrs:[("href", "/about"); ("class", "nav-link")] 733 + () 734 + ]} 735 + 736 + @see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom> 737 + WHATWG: Elements in the DOM *) 738 + val create_element : string -> ?namespace:string option -> 739 + ?attrs:(string * string) list -> unit -> node 740 + 741 + (** Create a text node. 742 + 743 + Text nodes contain the readable text content of HTML documents. 744 + 745 + {b Example:} 746 + {[ 747 + let text = create_text "Hello, world!" 748 + ]} *) 262 749 val create_text : string -> node 263 750 264 - (** Create a comment node. *) 751 + (** Create a comment node. 752 + 753 + Comments are preserved in the DOM but not rendered. They're written 754 + as [<!-- text -->] in HTML. 755 + 756 + @see <https://html.spec.whatwg.org/multipage/syntax.html#comments> 757 + WHATWG: Comments *) 265 758 val create_comment : string -> node 266 759 267 - (** Create an empty document node. *) 760 + (** Create an empty document node. 761 + 762 + The Document node is the root of an HTML document tree. 763 + 764 + @see <https://html.spec.whatwg.org/multipage/dom.html#document> 765 + WHATWG: The Document object *) 268 766 val create_document : unit -> node 269 767 270 - (** Create a document fragment node. *) 768 + (** Create a document fragment node. 769 + 770 + Document fragments are lightweight containers for holding nodes 771 + without a parent document. Used for template contents and fragment 772 + parsing results. 773 + 774 + @see <https://dom.spec.whatwg.org/#documentfragment> 775 + DOM Standard: DocumentFragment *) 271 776 val create_document_fragment : unit -> node 272 777 273 - (** Create a doctype node. *) 274 - val create_doctype : ?name:string -> ?public_id:string -> ?system_id:string -> unit -> node 778 + (** Create a doctype node. 275 779 276 - (** Append a child node to a parent. *) 780 + For HTML5 documents, use [create_doctype ~name:"html" ()]. 781 + 782 + @param name DOCTYPE name (usually ["html"]) 783 + @param public_id Public identifier (legacy) 784 + @param system_id System identifier (legacy) 785 + 786 + @see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype> 787 + WHATWG: The DOCTYPE *) 788 + val create_doctype : ?name:string -> ?public_id:string -> 789 + ?system_id:string -> unit -> node 790 + 791 + (** Append a child node to a parent. 792 + 793 + The child is added as the last child of the parent. If the child 794 + already has a parent, it is first removed from that parent. *) 277 795 val append_child : node -> node -> unit 278 796 279 - (** Insert a node before a reference node. *) 797 + (** Insert a node before a reference node. 798 + 799 + @param parent The parent node 800 + @param new_child The node to insert 801 + @param ref_child The existing child to insert before 802 + 803 + Raises [Not_found] if [ref_child] is not a child of [parent]. *) 280 804 val insert_before : node -> node -> node -> unit 281 805 282 - (** Remove a child node from its parent. *) 806 + (** Remove a child node from its parent. 807 + 808 + Raises [Not_found] if [child] is not a child of [parent]. *) 283 809 val remove_child : node -> node -> unit 284 810 285 - (** Get an attribute value. *) 811 + (** Get an attribute value. 812 + 813 + Returns [Some value] if the attribute exists, [None] otherwise. 814 + Attribute names are case-sensitive (but were lowercased during parsing). 815 + 816 + @see <https://html.spec.whatwg.org/multipage/dom.html#attributes> 817 + WHATWG: Attributes *) 286 818 val get_attr : node -> string -> string option 287 819 288 - (** Set an attribute value. *) 820 + (** Set an attribute value. 821 + 822 + If the attribute exists, it is replaced. If not, it is added. *) 289 823 val set_attr : node -> string -> string -> unit 290 824 291 825 (** Check if a node has an attribute. *) 292 826 val has_attr : node -> string -> bool 293 827 294 - (** Get all descendant nodes. *) 828 + (** Get all descendant nodes in document order. 829 + 830 + Returns all nodes below this node in the tree, in the order they 831 + appear in the HTML source (depth-first). *) 295 832 val descendants : node -> node list 296 833 297 - (** Get all ancestor nodes (from parent to root). *) 834 + (** Get all ancestor nodes from parent to root. 835 + 836 + Returns the chain of parent nodes, starting with the immediate parent 837 + and ending with the Document node. *) 298 838 val ancestors : node -> node list 299 839 300 - (** Get text content of a node and its descendants. *) 840 + (** Get text content of a node and its descendants. 841 + 842 + For text nodes, returns the text directly. For elements, recursively 843 + concatenates all descendant text content. *) 301 844 val get_text_content : node -> string 302 845 303 846 (** Clone a node. 304 - @param deep If true, also clone descendants (default: false) 305 - *) 847 + 848 + @param deep If [true], recursively clone all descendants. 849 + If [false] (default), only clone the node itself. *) 306 850 val clone : ?deep:bool -> node -> node 307 851 308 - (** {1 Node Predicates} *) 852 + (** {1 Node Predicates} 309 853 310 - (** Test if a node is an element. *) 854 + Functions to test what type of node you have. 855 + *) 856 + 857 + (** Test if a node is an element. 858 + 859 + Elements are HTML tags like [<div>], [<p>], [<a>]. *) 311 860 val is_element : node -> bool 312 861 313 - (** Test if a node is a text node. *) 862 + (** Test if a node is a text node. 863 + 864 + Text nodes contain character content within elements. *) 314 865 val is_text : node -> bool 315 866 316 - (** Test if a node is a comment node. *) 867 + (** Test if a node is a comment node. 868 + 869 + Comment nodes represent HTML comments [<!-- ... -->]. *) 317 870 val is_comment : node -> bool 318 871 319 - (** Test if a node is a document node. *) 872 + (** Test if a node is a document node. 873 + 874 + The document node is the root of a complete HTML document tree. *) 320 875 val is_document : node -> bool 321 876 322 - (** Test if a node is a document fragment. *) 877 + (** Test if a node is a document fragment. 878 + 879 + Document fragments are lightweight containers for nodes. *) 323 880 val is_document_fragment : node -> bool 324 881 325 - (** Test if a node is a doctype node. *) 882 + (** Test if a node is a doctype node. 883 + 884 + Doctype nodes represent the [<!DOCTYPE>] declaration. *) 326 885 val is_doctype : node -> bool 327 886 328 887 (** Test if a node has children. *)
+431 -93
lib/parser/html5rw_parser.mli
··· 3 3 SPDX-License-Identifier: MIT 4 4 ---------------------------------------------------------------------------*) 5 5 6 - (** HTML5 Parser 6 + (** HTML5 Parser - Low-Level API 7 7 8 8 This module provides the core HTML5 parsing functionality implementing 9 - the WHATWG parsing specification. It handles tokenization, tree construction, 9 + the {{:https://html.spec.whatwg.org/multipage/parsing.html} WHATWG 10 + HTML5 parsing specification}. It handles tokenization, tree construction, 10 11 error recovery, and produces a DOM tree. 11 12 12 - For most uses, prefer the top-level {!Html5rw} module which re-exports 13 - these functions with a simpler interface. 13 + For most uses, prefer the top-level {!Html5rw} module which provides 14 + a simpler interface. This module is for advanced use cases that need 15 + access to parser internals. 16 + 17 + {2 How HTML5 Parsing Works} 18 + 19 + The HTML5 parsing algorithm is unusual compared to most parsers. It was 20 + reverse-engineered from browser behavior rather than designed from a 21 + formal grammar. This ensures the parser handles malformed HTML exactly 22 + like web browsers do. 23 + 24 + The algorithm has three main phases: 25 + 26 + {3 1. Encoding Detection} 27 + 28 + Before parsing begins, the character encoding must be determined. The 29 + WHATWG specification defines a "sniffing" algorithm: 30 + 31 + 1. Check for a BOM (Byte Order Mark) at the start 32 + 2. Look for [<meta charset="...">] in the first 1024 bytes 33 + 3. Use HTTP Content-Type header hint if available 34 + 4. Fall back to UTF-8 35 + 36 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 37 + WHATWG: Determining the character encoding 38 + 39 + {3 2. Tokenization} 40 + 41 + The tokenizer converts the input stream into a sequence of tokens. 42 + It implements a state machine with over 80 states to handle: 43 + 44 + - Data (text content) 45 + - Tags (start tags, end tags, self-closing tags) 46 + - Comments 47 + - DOCTYPEs 48 + - Character references ([&amp;], [&#60;], [&#x3C;]) 49 + - CDATA sections (in SVG/MathML) 50 + 51 + The tokenizer has special handling for: 52 + - {b Raw text elements}: [<script>], [<style>] - no markup parsing inside 53 + - {b Escapable raw text elements}: [<textarea>], [<title>] - limited parsing 54 + - {b RCDATA}: Content where only character references are parsed 55 + 56 + @see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization> 57 + WHATWG: Tokenization 58 + 59 + {3 3. Tree Construction} 60 + 61 + The tree builder receives tokens from the tokenizer and builds the DOM 62 + tree. It uses {i insertion modes} - a state machine that determines how 63 + each token should be processed based on the current document context. 64 + 65 + {b Insertion modes} include: 66 + - [initial]: Before the DOCTYPE 67 + - [before_html]: Before the [<html>] element 68 + - [before_head]: Before the [<head>] element 69 + - [in_head]: Inside [<head>] 70 + - [in_body]: Inside [<body>] (the most complex mode) 71 + - [in_table]: Inside [<table>] (special handling) 72 + - [in_template]: Inside [<template>] 73 + - And many more... 74 + 75 + The tree builder maintains: 76 + - {b Stack of open elements}: Elements that have been opened but not closed 77 + - {b List of active formatting elements}: For handling nested formatting 78 + - {b The template insertion mode stack}: For [<template>] elements 79 + 80 + @see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction> 81 + WHATWG: Tree construction 82 + 83 + {2 Error Recovery} 84 + 85 + A key feature of HTML5 parsing is that it {b never fails}. The specification 86 + defines error recovery for every possible malformed input. For example: 87 + 88 + - Missing end tags are implicitly closed 89 + - Misnested tags are handled via the "adoption agency algorithm" 90 + - Invalid characters are replaced with U+FFFD 91 + - Unexpected elements are either ignored or moved to valid positions 14 92 15 - {2 Parsing Algorithm} 93 + This ensures every HTML document produces a valid DOM tree. 16 94 17 - The HTML5 parsing algorithm is defined by the WHATWG specification and 18 - consists of several phases: 95 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 96 + WHATWG: Parse errors 19 97 20 - 1. {b Encoding sniffing}: Detect character encoding from BOM, meta tags, 21 - or transport layer hints 22 - 2. {b Tokenization}: Convert the input stream into a sequence of tokens 23 - (start tags, end tags, character data, comments, etc.) 24 - 3. {b Tree construction}: Build the DOM tree using a state machine with 25 - multiple insertion modes 98 + {2 The Adoption Agency Algorithm} 99 + 100 + One of the most complex parts of HTML5 parsing is handling misnested 101 + formatting elements. For example: 102 + 103 + {v <p>Hello <b>world</p> <p>more</b> text</p> v} 26 104 27 - The algorithm includes extensive error recovery to handle malformed HTML 28 - in a consistent way across browsers. 105 + Browsers don't just error out - they use the "adoption agency algorithm" 106 + to produce sensible results. This algorithm: 107 + 1. Identifies formatting elements that span across other elements 108 + 2. Reconstructs the tree to properly nest elements 109 + 3. Moves nodes between parents as needed 29 110 30 - @see <https://html.spec.whatwg.org/multipage/parsing.html> 31 - The WHATWG HTML Parsing specification 111 + @see <https://html.spec.whatwg.org/multipage/parsing.html#adoption-agency-algorithm> 112 + WHATWG: The adoption agency algorithm 32 113 *) 33 114 34 115 (** {1 Sub-modules} *) 35 116 117 + (** DOM types and manipulation. *) 36 118 module Dom = Html5rw_dom 119 + 120 + (** HTML5 tokenizer. 121 + 122 + The tokenizer implements the first stage of HTML5 parsing, converting 123 + an input byte stream into a sequence of tokens (start tags, end tags, 124 + text, comments, DOCTYPEs). 125 + 126 + @see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization> 127 + WHATWG: Tokenization *) 37 128 module Tokenizer = Html5rw_tokenizer 129 + 130 + (** Character encoding detection and conversion. 131 + 132 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 133 + WHATWG: Determining the character encoding *) 38 134 module Encoding = Html5rw_encoding 135 + 136 + (** HTML element constants and categories. 137 + 138 + This module provides lists of element names that have special handling 139 + in the HTML5 parser: 140 + 141 + - {b Void elements}: Elements that cannot have children and have no end 142 + tag ([area], [base], [br], [col], [embed], [hr], [img], [input], 143 + [link], [meta], [source], [track], [wbr]) 144 + 145 + - {b Formatting elements}: Elements tracked in the list of active 146 + formatting elements for the adoption agency algorithm ([a], [b], [big], 147 + [code], [em], [font], [i], [nobr], [s], [small], [strike], [strong], 148 + [tt], [u]) 149 + 150 + - {b Special elements}: Elements with special parsing rules that affect 151 + scope and formatting reconstruction 152 + 153 + @see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements> 154 + WHATWG: Void elements 155 + @see <https://html.spec.whatwg.org/multipage/parsing.html#formatting> 156 + WHATWG: Formatting elements *) 39 157 module Constants : sig 40 158 val void_elements : string list 159 + (** Elements that cannot have children: [area], [base], [br], [col], 160 + [embed], [hr], [img], [input], [link], [meta], [source], [track], [wbr]. 161 + 162 + @see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements> 163 + WHATWG: Void elements *) 164 + 41 165 val formatting_elements : string list 166 + (** Elements tracked for the adoption agency algorithm: [a], [b], [big], 167 + [code], [em], [font], [i], [nobr], [s], [small], [strike], [strong], 168 + [tt], [u]. 169 + 170 + @see <https://html.spec.whatwg.org/multipage/parsing.html#formatting> 171 + WHATWG: Formatting elements *) 172 + 42 173 val special_elements : string list 174 + (** Elements with special parsing behavior that affect scope checking. 175 + 176 + @see <https://html.spec.whatwg.org/multipage/parsing.html#special> 177 + WHATWG: Special elements *) 43 178 end 179 + 180 + (** Parser insertion modes. 181 + 182 + Insertion modes are the states of the tree construction state machine. 183 + They determine how each token from the tokenizer should be processed 184 + based on the current document context. 185 + 186 + For example, a [<td>] tag is handled differently depending on whether 187 + the parser is currently in a table context or in the body. 188 + 189 + @see <https://html.spec.whatwg.org/multipage/parsing.html#insertion-mode> 190 + WHATWG: Insertion mode *) 44 191 module Insertion_mode : sig 45 192 type t 193 + (** The insertion mode type. Values include modes like [initial], 194 + [before_html], [in_head], [in_body], [in_table], etc. *) 46 195 end 196 + 197 + (** Tree builder state. 198 + 199 + The tree builder maintains the state needed for tree construction: 200 + - Stack of open elements 201 + - List of active formatting elements 202 + - Template insertion mode stack 203 + - Current insertion mode 204 + - Foster parenting flag 205 + 206 + @see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction> 207 + WHATWG: Tree construction *) 47 208 module Tree_builder : sig 48 209 type t 210 + (** The tree builder state. *) 49 211 end 50 212 51 213 (** {1 Types} *) 52 214 53 215 (** A parse error encountered during parsing. 54 216 55 - HTML5 parsing never fails - it always produces a DOM tree. However, 56 - the specification defines many error conditions that conformance 57 - checkers should report. Error collection is optional and disabled 58 - by default for performance. 217 + HTML5 parsing {b never fails} - it always produces a DOM tree. However, 218 + the WHATWG specification defines 92 specific error conditions that 219 + conformance checkers should report. These errors indicate malformed 220 + HTML that browsers will still render (with error recovery). 221 + 222 + {b Error categories:} 223 + 224 + {i Tokenizer errors} (detected during tokenization): 225 + - [abrupt-closing-of-empty-comment]: Comment closed with [-->] without content 226 + - [abrupt-doctype-public-identifier]: DOCTYPE public ID ended unexpectedly 227 + - [eof-before-tag-name]: End of file while reading a tag name 228 + - [eof-in-tag]: End of file inside a tag 229 + - [missing-attribute-value]: Attribute has [=] but no value 230 + - [unexpected-null-character]: Null byte in the input 231 + - [unexpected-question-mark-instead-of-tag-name]: [<?] used instead of [<!] 232 + 233 + {i Tree construction errors} (detected during tree building): 234 + - [missing-doctype]: No DOCTYPE before first element 235 + - [unexpected-token-*]: Token appeared in wrong context 236 + - [foster-parenting]: Content moved outside table due to invalid position 59 237 60 - Error codes follow the WHATWG specification naming convention, 61 - e.g., "unexpected-null-character", "eof-in-tag". 238 + Enable error collection with [~collect_errors:true]. Error collection 239 + has some performance overhead, so it's disabled by default. 62 240 63 241 @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 64 - The list of HTML5 parse errors 65 - *) 242 + WHATWG: Complete list of parse errors *) 66 243 type parse_error 67 244 68 245 (** Get the error code string. 69 246 70 - Error codes are lowercase with hyphens, matching the WHATWG spec names 71 - like "unexpected-null-character" or "eof-before-tag-name". 72 - *) 247 + Error codes are lowercase with hyphens, exactly matching the WHATWG 248 + specification naming. Examples: 249 + - ["unexpected-null-character"] 250 + - ["eof-before-tag-name"] 251 + - ["missing-end-tag-name"] 252 + - ["duplicate-attribute"] 253 + - ["missing-doctype"] 254 + 255 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 256 + WHATWG: Parse error codes *) 73 257 val error_code : parse_error -> string 74 258 75 - (** Get the line number where the error occurred (1-indexed). *) 259 + (** Get the line number where the error occurred. 260 + 261 + Line numbers are 1-indexed (first line is 1). Line breaks are 262 + detected at LF (U+000A), CR (U+000D), and CR+LF sequences. *) 76 263 val error_line : parse_error -> int 77 264 78 - (** Get the column number where the error occurred (1-indexed). *) 265 + (** Get the column number where the error occurred. 266 + 267 + Column numbers are 1-indexed (first column is 1). Columns reset 268 + to 1 after each line break. Column counting uses code points, 269 + not bytes or grapheme clusters. *) 79 270 val error_column : parse_error -> int 80 271 81 272 (** Context element for HTML fragment parsing. 82 273 83 - When parsing an HTML fragment (innerHTML), you need to specify the 84 - context element that would contain the fragment. This affects how 85 - the parser handles certain elements. 274 + When parsing HTML fragments (the content that would be assigned to 275 + an element's [innerHTML]), the parser needs to know what element 276 + would contain the fragment. This affects parsing in several ways: 86 277 87 - For example, parsing [<td>] as a fragment of a [<tr>] works differently 88 - than parsing it as a fragment of a [<div>]. 278 + {b Parser state initialization:} 279 + - For [<title>] or [<textarea>]: Tokenizer starts in RCDATA state 280 + - For [<style>], [<xmp>], [<iframe>], [<noembed>], [<noframes>]: 281 + Tokenizer starts in RAWTEXT state 282 + - For [<script>]: Tokenizer starts in script data state 283 + - For [<noscript>]: Tokenizer starts in RAWTEXT state (if scripting enabled) 284 + - For [<plaintext>]: Tokenizer starts in PLAINTEXT state 285 + - Otherwise: Tokenizer starts in data state 286 + 287 + {b Insertion mode:} 288 + The initial insertion mode depends on the context element: 289 + - [<template>]: "in template" mode 290 + - [<html>]: "before head" mode 291 + - [<head>]: "in head" mode 292 + - [<body>], [<div>], etc.: "in body" mode 293 + - [<table>]: "in table" mode 294 + - And so on... 89 295 90 296 @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments> 91 - The HTML fragment parsing algorithm 92 - *) 297 + WHATWG: The fragment parsing algorithm *) 93 298 type fragment_context 94 299 95 300 (** Create a fragment parsing context. 96 301 97 - @param tag_name The tag name of the context element (e.g., "div", "tr") 98 - @param namespace Namespace: [None] for HTML, [Some "svg"], [Some "mathml"] 302 + @param tag_name Tag name of the context element. This should be the 303 + tag name of the element that would contain the fragment. 304 + Common choices: 305 + - ["div"]: General-purpose (most common) 306 + - ["body"]: For full body content 307 + - ["tr"]: For table row content ([<td>] elements) 308 + - ["ul"], ["ol"]: For list content ([<li>] elements) 309 + - ["select"]: For [<option>] elements 99 310 311 + @param namespace Element namespace: 312 + - [None]: HTML namespace (default) 313 + - [Some "svg"]: SVG namespace 314 + - [Some "mathml"]: MathML namespace 315 + 316 + {b Examples:} 100 317 {[ 101 - (* Parse as innerHTML of a table row *) 318 + (* Parse innerHTML of a table row - <td> works correctly *) 102 319 let ctx = make_fragment_context ~tag_name:"tr" () 103 320 104 - (* Parse as innerHTML of an SVG element *) 321 + (* Parse innerHTML of an SVG group element *) 105 322 let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") () 323 + 324 + (* Parse innerHTML of a select element - <option> works correctly *) 325 + let ctx = make_fragment_context ~tag_name:"select" () 106 326 ]} 107 - *) 327 + 328 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments> 329 + WHATWG: Fragment parsing algorithm *) 108 330 val make_fragment_context : tag_name:string -> ?namespace:string option -> 109 331 unit -> fragment_context 110 332 111 333 (** Get the tag name of a fragment context. *) 112 334 val fragment_context_tag : fragment_context -> string 113 335 114 - (** Get the namespace of a fragment context. *) 336 + (** Get the namespace of a fragment context ([None] for HTML). *) 115 337 val fragment_context_namespace : fragment_context -> string option 116 338 117 339 (** Result of parsing an HTML document or fragment. 118 340 119 - Contains the parsed DOM tree, any errors encountered (if error 120 - collection was enabled), and the detected encoding (for byte input). 341 + This opaque type contains: 342 + - The DOM tree (access via {!root}) 343 + - Parse errors if collection was enabled (access via {!errors}) 344 + - Detected encoding for byte input (access via {!encoding}) 121 345 *) 122 346 type t 123 347 124 348 (** {1 Parsing Functions} *) 125 349 350 + (** Parse HTML from a byte stream reader. 351 + 352 + This function implements the complete HTML5 parsing algorithm: 353 + 354 + 1. Reads bytes from the provided reader 355 + 2. Tokenizes the input into HTML tokens 356 + 3. Constructs a DOM tree using the tree construction algorithm 357 + 4. Returns the parsed result 358 + 359 + The input should be valid UTF-8. For automatic encoding detection 360 + from raw bytes, use {!parse_bytes} instead. 361 + 362 + {b Parser behavior:} 363 + 364 + For {b full document parsing} (no fragment context), the parser: 365 + - Creates a Document node as the root 366 + - Processes any DOCTYPE declaration 367 + - Creates [<html>], [<head>], and [<body>] elements as needed 368 + - Builds the full document tree 369 + 370 + For {b fragment parsing} (with fragment context), the parser: 371 + - Creates a Document Fragment as the root 372 + - Initializes tokenizer state based on context element 373 + - Initializes insertion mode based on context element 374 + - Does not create implicit [<html>], [<head>], [<body>] 375 + 376 + @param collect_errors If [true], collect parse errors in the result. 377 + Default: [false]. Enabling error collection adds overhead. 378 + @param fragment_context Context for fragment parsing. If provided, 379 + the input is parsed as fragment content (like innerHTML). 380 + 381 + @see <https://html.spec.whatwg.org/multipage/parsing.html> 382 + WHATWG: HTML parsing *) 126 383 val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> 127 384 Bytesrw.Bytes.Reader.t -> t 128 - (** Parse HTML from a byte stream reader. 385 + 386 + (** Parse HTML bytes with automatic encoding detection. 387 + 388 + This function wraps {!parse} with encoding detection, implementing the 389 + WHATWG encoding sniffing algorithm: 390 + 391 + {b Detection order:} 392 + 1. {b BOM}: Check first 2-3 bytes for UTF-8, UTF-16LE, or UTF-16BE BOM 393 + 2. {b Prescan}: Look for [<meta charset="...">] or 394 + [<meta http-equiv="Content-Type" content="...charset=...">] 395 + in the first 1024 bytes 396 + 3. {b Transport hint}: Use [transport_encoding] if provided 397 + 4. {b Fallback}: Use UTF-8 398 + 399 + The detected encoding is stored in the result (access via {!encoding}). 129 400 130 - This is the primary parsing function. The input must be valid UTF-8 131 - (or will be converted from detected encoding when using {!parse_bytes}). 401 + {b Prescan details:} 132 402 133 - @param collect_errors If [true], collect parse errors (default: [false]) 134 - @param fragment_context Context for fragment parsing (innerHTML) 403 + The prescan algorithm parses just enough of the document to find a 404 + charset declaration. It handles: 405 + - [<meta charset="utf-8">] 406 + - [<meta http-equiv="Content-Type" content="text/html; charset=utf-8">] 407 + - Comments and other markup are skipped 408 + - Parsing stops after 1024 bytes 135 409 136 - {[ 137 - open Bytesrw 138 - let reader = Bytes.Reader.of_string "<p>Hello</p>" in 139 - let result = parse reader 140 - ]} 141 - *) 410 + @param collect_errors If [true], collect parse errors. Default: [false]. 411 + @param transport_encoding Encoding hint from HTTP Content-Type header. 412 + For example: ["utf-8"], ["iso-8859-1"], ["windows-1252"]. 413 + @param fragment_context Context for fragment parsing. 142 414 415 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 416 + WHATWG: Determining the character encoding 417 + @see <https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding> 418 + WHATWG: Prescan algorithm *) 143 419 val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> 144 420 ?fragment_context:fragment_context -> bytes -> t 145 - (** Parse HTML bytes with automatic encoding detection. 146 421 147 - Implements the WHATWG encoding sniffing algorithm: 148 - 1. Check for BOM (UTF-8, UTF-16LE, UTF-16BE) 149 - 2. Prescan for [<meta charset>] declaration 150 - 3. Use transport encoding hint if provided 151 - 4. Fall back to UTF-8 422 + (** {1 Result Accessors} *) 423 + 424 + (** Get the root node of the parsed document. 152 425 153 - @param collect_errors If [true], collect parse errors (default: [false]) 154 - @param transport_encoding Encoding from HTTP Content-Type header 155 - @param fragment_context Context for fragment parsing (innerHTML) 156 - *) 426 + For full document parsing, returns a Document node with structure: 427 + {v 428 + #document 429 + ├── !doctype (if DOCTYPE was present) 430 + └── html 431 + ├── head 432 + │ └── ... (title, meta, link, script, style) 433 + └── body 434 + └── ... (page content) 435 + v} 157 436 158 - (** {1 Result Accessors} *) 437 + For fragment parsing, returns a Document Fragment node containing 438 + the parsed elements directly (no implicit html/head/body). 159 439 440 + @see <https://html.spec.whatwg.org/multipage/dom.html#document> 441 + WHATWG: The Document object *) 160 442 val root : t -> Dom.node 161 - (** Get the root node of the parsed document. 162 443 163 - For full document parsing, this is a document node. 164 - For fragment parsing, this is a document fragment node. 165 - *) 444 + (** Get parse errors collected during parsing. 445 + 446 + Returns an empty list if error collection was not enabled 447 + ([collect_errors:false] or omitted) or if the document was well-formed. 448 + 449 + Errors are returned in the order they were encountered. 166 450 451 + {b Example:} 452 + {[ 453 + let result = parse ~collect_errors:true reader in 454 + List.iter (fun e -> 455 + Printf.printf "Line %d, col %d: %s\n" 456 + (error_line e) (error_column e) (error_code e) 457 + ) (errors result) 458 + ]} 459 + 460 + @see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors> 461 + WHATWG: Parse errors *) 167 462 val errors : t -> parse_error list 168 - (** Get parse errors (empty if error collection was disabled). *) 463 + 464 + (** Get the detected encoding. 465 + 466 + Returns [Some encoding] when {!parse_bytes} was used, indicating which 467 + encoding was detected or specified. 468 + 469 + Returns [None] when {!parse} was used (it expects pre-decoded UTF-8). 169 470 471 + @see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding> 472 + WHATWG: Determining the character encoding *) 170 473 val encoding : t -> Encoding.encoding option 171 - (** Get the detected encoding (only set when using {!parse_bytes}). *) 172 474 173 475 (** {1 Querying} *) 174 476 175 - val query : t -> string -> Dom.node list 176 477 (** Query the DOM with a CSS selector. 177 478 479 + Returns all elements matching the selector in document order. 480 + 481 + {b Supported selectors:} 482 + 483 + See {!Html5rw_selector} for the complete list. Key selectors include: 484 + - Type: [div], [p], [a] 485 + - ID: [#myid] 486 + - Class: [.myclass] 487 + - Attribute: [[href]], [[type="text"]] 488 + - Pseudo-class: [:first-child], [:nth-child(2)] 489 + - Combinators: [div p] (descendant), [div > p] (child) 490 + 178 491 @raise Html5rw_selector.Selector_error if the selector is invalid 179 492 180 - See {!Html5rw_selector} for supported selector syntax. 181 - *) 493 + @see <https://www.w3.org/TR/selectors-4/> 494 + W3C: Selectors Level 4 *) 495 + val query : t -> string -> Dom.node list 182 496 183 497 (** {1 Serialization} *) 184 498 499 + (** Serialize the DOM tree to a byte writer. 500 + 501 + Outputs valid HTML5 that can be parsed to produce an equivalent DOM tree. 502 + The output follows the WHATWG serialization algorithm. 503 + 504 + {b Serialization rules:} 505 + - Void elements are written without end tags 506 + - Attributes are quoted with double quotes 507 + - Special characters in text/attributes are escaped 508 + - Comments preserve their content 509 + - DOCTYPE is serialized as [<!DOCTYPE html>] 510 + 511 + @param pretty If [true] (default), add indentation for readability. 512 + @param indent_size Spaces per indent level (default: 2). 513 + 514 + @see <https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments> 515 + WHATWG: Serialising HTML fragments *) 185 516 val to_writer : ?pretty:bool -> ?indent_size:int -> t -> 186 517 Bytesrw.Bytes.Writer.t -> unit 187 - (** Serialize the DOM tree to a byte stream writer. 188 518 189 - @param pretty If [true], format with indentation (default: [true]) 190 - @param indent_size Spaces per indent level (default: [2]) 191 - *) 519 + (** Serialize the DOM tree to a string. 192 520 521 + Convenience wrapper around {!to_writer} that returns a string. 522 + 523 + @param pretty If [true] (default), add indentation for readability. 524 + @param indent_size Spaces per indent level (default: 2). *) 193 525 val to_string : ?pretty:bool -> ?indent_size:int -> t -> string 194 - (** Serialize the DOM tree to a string. 195 526 196 - @param pretty If [true], format with indentation (default: [true]) 197 - @param indent_size Spaces per indent level (default: [2]) 198 - *) 199 - 200 - val to_text : ?separator:string -> ?strip:bool -> t -> string 201 527 (** Extract text content from the DOM tree. 202 528 203 - @param separator String between text nodes (default: [" "]) 204 - @param strip If [true], trim whitespace (default: [true]) 205 - *) 529 + Returns the concatenation of all text node content in document order, 530 + with no HTML markup. 206 531 207 - val to_test_format : t -> string 532 + @param separator String to insert between text nodes (default: [" "]) 533 + @param strip If [true] (default), trim leading/trailing whitespace *) 534 + val to_text : ?separator:string -> ?strip:bool -> t -> string 535 + 208 536 (** Serialize to html5lib test format. 209 537 210 - This format is used by the html5lib test suite and shows the tree 211 - structure with indentation and node type prefixes. 212 - *) 538 + This produces the tree representation format used by the 539 + {{:https://github.com/html5lib/html5lib-tests} html5lib-tests} suite. 540 + 541 + The format shows the tree structure with: 542 + - Indentation indicating depth (2 spaces per level) 543 + - Prefixes indicating node type: 544 + - [<!DOCTYPE ...>] for DOCTYPE 545 + - [<tagname>] for elements (with attributes on same line) 546 + - ["text"] for text nodes 547 + - [<!-- comment -->] for comments 548 + 549 + Mainly useful for testing the parser against the reference test suite. *) 550 + val to_test_format : t -> string