+530
-87
lib/dom/node.mli
+530
-87
lib/dom/node.mli
···
5
5
6
6
(** HTML5 DOM Node Types and Operations
7
7
8
-
This module provides the DOM node representation used by the HTML5 parser.
9
-
Nodes form a tree structure representing HTML documents. The type follows
10
-
the WHATWG HTML5 specification for document structure.
8
+
This module provides the DOM (Document Object Model) node representation
9
+
used by the HTML5 parser. The DOM is a programming interface that
10
+
represents an HTML document as a tree of nodes, where each node represents
11
+
part of the document (an element, text content, comment, etc.).
12
+
13
+
{2 What is the DOM?}
14
+
15
+
When an HTML parser processes markup like [<p>Hello <b>world</b></p>], it
16
+
doesn't store the text directly. Instead, it builds a tree structure in
17
+
memory:
18
+
19
+
{v
20
+
Document
21
+
└── html
22
+
└── body
23
+
└── p
24
+
├── #text "Hello "
25
+
└── b
26
+
└── #text "world"
27
+
v}
28
+
29
+
This tree is the DOM. Each box in the tree is a {i node}. Programs can
30
+
traverse and modify this tree to read or change the document.
31
+
32
+
@see <https://html.spec.whatwg.org/multipage/dom.html>
33
+
WHATWG: The elements of HTML (DOM chapter)
11
34
12
35
{2 Node Types}
13
36
14
37
The HTML5 DOM includes several node types, all represented by the same
15
38
record type with different field usage:
16
39
17
-
- {b Element nodes}: Regular HTML elements like [<div>], [<p>], [<span>]
18
-
- {b Text nodes}: Text content within elements
19
-
- {b Comment nodes}: HTML comments [<!-- comment -->]
20
-
- {b Document nodes}: The root node representing the entire document
21
-
- {b Document fragment nodes}: A lightweight container (used for templates)
22
-
- {b Doctype nodes}: The [<!DOCTYPE html>] declaration
40
+
- {b Element nodes}: HTML elements like [<div>], [<p>], [<a href="...">].
41
+
Elements are the building blocks of HTML documents. They can have
42
+
attributes and contain other nodes.
43
+
44
+
- {b Text nodes}: The actual text content within elements. For example,
45
+
in [<p>Hello</p>], "Hello" is a text node that is a child of the [<p>]
46
+
element.
47
+
48
+
- {b Comment nodes}: HTML comments written as [<!-- comment text -->].
49
+
Comments are preserved in the DOM but not rendered.
50
+
51
+
- {b Document nodes}: The root of the entire document tree. Every HTML
52
+
document has exactly one Document node at the top.
53
+
54
+
- {b Document fragment nodes}: Lightweight containers that hold a
55
+
collection of nodes without a parent. Used for efficient batch DOM
56
+
operations and [<template>] element contents.
57
+
58
+
- {b Doctype nodes}: The [<!DOCTYPE html>] declaration at the start of
59
+
HTML5 documents. This declaration tells browsers to render the page
60
+
in standards mode.
61
+
62
+
@see <https://html.spec.whatwg.org/multipage/dom.html#kinds-of-content>
63
+
WHATWG: Kinds of content
23
64
24
65
{2 Namespaces}
25
66
26
-
Elements can belong to different namespaces:
27
-
- [None] or [Some "html"]: HTML namespace (default)
28
-
- [Some "svg"]: SVG namespace for embedded SVG content
29
-
- [Some "mathml"]: MathML namespace for mathematical notation
67
+
HTML5 can embed content from other XML vocabularies. Elements belong to
68
+
one of three {i namespaces}:
69
+
70
+
- {b HTML namespace} ([None] or implicit): Standard HTML elements like
71
+
[<div>], [<p>], [<table>]. This is the default for all elements.
72
+
73
+
- {b SVG namespace} ([Some "svg"]): Scalable Vector Graphics for drawing.
74
+
When the parser encounters an [<svg>] tag, all elements inside it
75
+
(like [<rect>], [<circle>], [<path>]) are placed in the SVG namespace.
76
+
77
+
- {b MathML namespace} ([Some "mathml"]): Mathematical Markup Language
78
+
for equations. When the parser encounters a [<math>] tag, elements
79
+
inside it are placed in the MathML namespace.
80
+
81
+
The parser automatically switches namespaces when entering and leaving
82
+
these foreign content islands.
30
83
31
-
The parser automatically switches namespaces when encountering [<svg>]
32
-
or [<math>] elements, as specified by the HTML5 algorithm.
84
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign>
85
+
WHATWG: Parsing foreign content
33
86
34
87
{2 Tree Structure}
35
88
36
89
Nodes form a bidirectional tree: each node has a list of children and
37
-
an optional parent reference. Modification functions maintain these
38
-
references automatically.
90
+
an optional parent reference. Modification functions in this module
91
+
maintain these references automatically.
92
+
93
+
The tree is always well-formed: a node can only have one parent, and
94
+
circular references are not possible.
39
95
*)
40
96
41
97
(** {1 Types} *)
42
98
43
99
(** Information associated with a DOCTYPE node.
44
100
45
-
In HTML5, the DOCTYPE is primarily used for quirks mode detection.
46
-
Most modern HTML5 documents use [<!DOCTYPE html>] which results in
47
-
all fields being [None] or the name being [Some "html"].
101
+
The {i document type declaration} (DOCTYPE) tells browsers what version
102
+
of HTML the document uses. In HTML5, the standard declaration is simply:
103
+
104
+
{v <!DOCTYPE html> v}
105
+
106
+
This minimal DOCTYPE triggers {i standards mode} (no quirks). The DOCTYPE
107
+
can optionally include a public identifier and system identifier for
108
+
legacy compatibility with SGML-based tools, but these are rarely used
109
+
in modern HTML5 documents.
110
+
111
+
{b Historical context:} In HTML4 and XHTML, DOCTYPEs were verbose and
112
+
referenced DTD files. For example:
113
+
{v <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
114
+
"http://www.w3.org/TR/html4/strict.dtd"> v}
115
+
116
+
HTML5 simplified this to just [<!DOCTYPE html>] because:
117
+
- Browsers never actually fetched or validated against DTDs
118
+
- The DOCTYPE's only real purpose is triggering standards mode
119
+
- A minimal DOCTYPE achieves this goal
120
+
121
+
{b Field meanings:}
122
+
- [name]: The document type name, almost always ["html"] for HTML documents
123
+
- [public_id]: A public identifier (legacy); [None] for HTML5
124
+
- [system_id]: A system identifier/URL (legacy); [None] for HTML5
48
125
126
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
127
+
WHATWG: The DOCTYPE
49
128
@see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
50
-
The WHATWG specification for DOCTYPE handling
129
+
WHATWG: DOCTYPE handling during parsing
51
130
*)
52
131
type doctype_data = {
53
132
name : string option; (** The DOCTYPE name, e.g., "html" *)
···
57
136
58
137
(** Quirks mode setting for the document.
59
138
60
-
Quirks mode affects CSS layout behavior for backwards compatibility with
61
-
old web content. The HTML5 parser determines quirks mode based on the
62
-
DOCTYPE declaration.
139
+
{i Quirks mode} is a browser rendering mode that emulates bugs and
140
+
non-standard behaviors from older browsers (primarily Internet Explorer 5).
141
+
Modern HTML5 documents should always render in {i standards mode}
142
+
(no quirks) for consistent, predictable behavior.
143
+
144
+
The HTML5 parser determines quirks mode based on the DOCTYPE declaration:
145
+
146
+
- {b No_quirks} (Standards mode): The document renders according to modern
147
+
HTML5 and CSS specifications. This is triggered by [<!DOCTYPE html>].
148
+
CSS box model, table layout, and other features work as specified.
63
149
64
-
- [No_quirks]: Standards mode - full HTML5/CSS3 behavior
65
-
- [Quirks]: Full quirks mode - emulates legacy browser behavior
66
-
- [Limited_quirks]: Almost standards mode - limited quirks for specific cases
150
+
- {b Quirks} (Full quirks mode): The document renders with legacy browser
151
+
bugs emulated. This happens when:
152
+
{ul
153
+
{- DOCTYPE is missing entirely}
154
+
{- DOCTYPE has certain legacy public identifiers}
155
+
{- DOCTYPE has the wrong format}}
156
+
157
+
In quirks mode, many CSS properties behave differently:
158
+
{ul
159
+
{- Tables don't inherit font properties}
160
+
{- Box model uses non-standard width calculations}
161
+
{- Certain CSS selectors don't work correctly}}
162
+
163
+
- {b Limited_quirks} (Almost standards mode): A middle ground that applies
164
+
only a few specific quirks, primarily affecting table cell vertical
165
+
sizing. Triggered by XHTML DOCTYPEs and certain HTML4 DOCTYPEs.
166
+
167
+
{b Recommendation:} Always use [<!DOCTYPE html>] at the start of HTML5
168
+
documents to ensure {b No_quirks} mode.
67
169
68
-
@see <https://quirks.spec.whatwg.org/> The Quirks Mode specification
170
+
@see <https://quirks.spec.whatwg.org/>
171
+
Quirks Mode Standard - detailed specification
172
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
173
+
WHATWG: How the parser determines quirks mode
69
174
*)
70
175
type quirks_mode = No_quirks | Quirks | Limited_quirks
71
176
···
73
178
74
179
All node types use the same record structure. The [name] field determines
75
180
the node type:
76
-
- Element: the tag name (e.g., "div", "p")
181
+
- Element: the tag name (e.g., "div", "p", "span")
77
182
- Text: "#text"
78
183
- Comment: "#comment"
79
184
- Document: "#document"
80
185
- Document fragment: "#document-fragment"
81
186
- Doctype: "!doctype"
82
187
83
-
{3 Field Usage by Node Type}
188
+
{3 Understanding Node Fields}
189
+
190
+
Different node types use different combinations of fields:
84
191
85
192
{v
86
193
Node Type | name | namespace | attrs | data | template_content | doctype
···
92
199
Document Fragment | "#document-frag" | No | No | No | No | No
93
200
Doctype | "!doctype" | No | No | No | No | Yes
94
201
v}
202
+
203
+
{3 Element Tag Names}
204
+
205
+
For element nodes, the [name] field contains the lowercase tag name.
206
+
HTML5 defines many elements with specific meanings:
207
+
208
+
{b Structural elements:} [html], [head], [body], [header], [footer],
209
+
[main], [nav], [article], [section], [aside]
210
+
211
+
{b Text content:} [p], [div], [span], [h1]-[h6], [pre], [blockquote]
212
+
213
+
{b Lists:} [ul], [ol], [li], [dl], [dt], [dd]
214
+
215
+
{b Tables:} [table], [tr], [td], [th], [thead], [tbody], [tfoot]
216
+
217
+
{b Forms:} [form], [input], [button], [select], [textarea], [label]
218
+
219
+
{b Media:} [img], [audio], [video], [canvas], [svg]
220
+
221
+
@see <https://html.spec.whatwg.org/multipage/indices.html#elements-3>
222
+
WHATWG: Index of HTML elements
223
+
224
+
{3 Void Elements}
225
+
226
+
Some elements are {i void elements} - they cannot have children and have
227
+
no end tag. These include: [area], [base], [br], [col], [embed], [hr],
228
+
[img], [input], [link], [meta], [source], [track], [wbr].
229
+
230
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements>
231
+
WHATWG: Void elements
232
+
233
+
{3 The Template Element}
234
+
235
+
The [<template>] element is special: its children are not rendered
236
+
directly but stored in a separate document fragment accessible via
237
+
the [template_content] field. Templates are used for client-side
238
+
templating where content is cloned and inserted via JavaScript.
239
+
240
+
@see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
241
+
WHATWG: The template element
95
242
*)
96
243
type node = {
97
244
mutable name : string;
98
-
(** Tag name for elements, or special name for other node types *)
245
+
(** Tag name for elements, or special name for other node types.
246
+
247
+
For elements, this is the lowercase tag name (e.g., "div", "span").
248
+
For other node types, use the constants {!document_name},
249
+
{!text_name}, {!comment_name}, etc. *)
99
250
100
251
mutable namespace : string option;
101
-
(** Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"] *)
252
+
(** Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"].
253
+
254
+
Most elements are in the HTML namespace ([None]). The SVG and MathML
255
+
namespaces are only used when content appears inside [<svg>] or
256
+
[<math>] elements respectively.
257
+
258
+
@see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
259
+
WHATWG: Elements in the DOM *)
102
260
103
261
mutable attrs : (string * string) list;
104
-
(** Element attributes as (name, value) pairs *)
262
+
(** Element attributes as (name, value) pairs.
263
+
264
+
Attributes provide additional information about elements. Common
265
+
global attributes include:
266
+
- [id]: Unique identifier for the element
267
+
- [class]: Space-separated list of CSS class names
268
+
- [style]: Inline CSS styles
269
+
- [title]: Advisory text (shown as tooltip)
270
+
- [lang]: Language of the element's content
271
+
- [hidden]: Whether the element should be hidden
272
+
273
+
Element-specific attributes include:
274
+
- [href] on [<a>]: The link destination URL
275
+
- [src] on [<img>]: The image source URL
276
+
- [type] on [<input>]: The input control type
277
+
- [disabled] on form controls: Whether the control is disabled
278
+
279
+
In HTML5, attribute names are case-insensitive and are normalized
280
+
to lowercase by the parser.
281
+
282
+
@see <https://html.spec.whatwg.org/multipage/dom.html#global-attributes>
283
+
WHATWG: Global attributes
284
+
@see <https://html.spec.whatwg.org/multipage/indices.html#attributes-3>
285
+
WHATWG: Index of attributes *)
105
286
106
287
mutable children : node list;
107
-
(** Child nodes in document order *)
288
+
(** Child nodes in document order.
289
+
290
+
For most elements, this list contains the nested elements and text.
291
+
For void elements (like [<br>], [<img>]), this is always empty.
292
+
For [<template>] elements, the actual content is in
293
+
[template_content], not here. *)
108
294
109
295
mutable parent : node option;
110
-
(** Parent node, [None] for root nodes *)
296
+
(** Parent node, [None] for root nodes.
297
+
298
+
Every node except the Document node has a parent. This back-reference
299
+
enables traversing up the tree. *)
111
300
112
301
mutable data : string;
113
-
(** Text content for text and comment nodes *)
302
+
(** Text content for text and comment nodes.
303
+
304
+
For text nodes, this contains the actual text. For comment nodes,
305
+
this contains the comment text (without the [<!--] and [-->]
306
+
delimiters). For other node types, this field is empty. *)
114
307
115
308
mutable template_content : node option;
116
-
(** Document fragment for [<template>] element contents *)
309
+
(** Document fragment for [<template>] element contents.
310
+
311
+
The [<template>] element holds "inert" content that is not
312
+
rendered but can be cloned and inserted elsewhere. This field
313
+
contains a document fragment with the template's content.
314
+
315
+
For non-template elements, this is [None].
316
+
317
+
@see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
318
+
WHATWG: The template element *)
117
319
118
320
mutable doctype : doctype_data option;
119
-
(** DOCTYPE information for doctype nodes *)
321
+
(** DOCTYPE information for doctype nodes.
322
+
323
+
Only doctype nodes use this field; for all other nodes it is [None]. *)
120
324
}
121
325
122
326
(** {1 Node Name Constants}
···
126
330
*)
127
331
128
332
val document_name : string
129
-
(** ["#document"] - name for document nodes *)
333
+
(** ["#document"] - name for document nodes.
334
+
335
+
The Document node is the root of every HTML document tree. It represents
336
+
the entire document and is the parent of the [<html>] element.
337
+
338
+
@see <https://html.spec.whatwg.org/multipage/dom.html#document>
339
+
WHATWG: The Document object *)
130
340
131
341
val document_fragment_name : string
132
-
(** ["#document-fragment"] - name for document fragment nodes *)
342
+
(** ["#document-fragment"] - name for document fragment nodes.
343
+
344
+
Document fragments are lightweight container nodes used to hold a
345
+
collection of nodes without a parent document. They are used:
346
+
- To hold [<template>] element contents
347
+
- As results of fragment parsing (innerHTML)
348
+
- For efficient batch DOM operations
349
+
350
+
@see <https://dom.spec.whatwg.org/#documentfragment>
351
+
DOM Standard: DocumentFragment *)
133
352
134
353
val text_name : string
135
-
(** ["#text"] - name for text nodes *)
354
+
(** ["#text"] - name for text nodes.
355
+
356
+
Text nodes contain the character data within elements. When the
357
+
parser encounters text between tags like [<p>Hello world</p>],
358
+
it creates a text node with data ["Hello world"] as a child of
359
+
the [<p>] element.
360
+
361
+
Adjacent text nodes are automatically merged by the parser. *)
136
362
137
363
val comment_name : string
138
-
(** ["#comment"] - name for comment nodes *)
364
+
(** ["#comment"] - name for comment nodes.
365
+
366
+
Comment nodes represent HTML comments: [<!-- comment text -->].
367
+
Comments are preserved in the DOM but not rendered to users.
368
+
They're useful for development notes or conditional content. *)
139
369
140
370
val doctype_name : string
141
-
(** ["!doctype"] - name for doctype nodes *)
371
+
(** ["!doctype"] - name for doctype nodes.
372
+
373
+
The DOCTYPE node represents the [<!DOCTYPE html>] declaration.
374
+
It is always the first child of the Document node (if present).
375
+
376
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
377
+
WHATWG: The DOCTYPE *)
142
378
143
379
(** {1 Constructors}
144
380
145
381
Functions to create new DOM nodes. All nodes start with no parent and
146
-
no children.
382
+
no children. Use {!append_child} or {!insert_before} to build a tree.
147
383
*)
148
384
149
385
val create_element : string -> ?namespace:string option ->
150
386
?attrs:(string * string) list -> unit -> node
151
387
(** Create an element node.
152
388
153
-
@param name The tag name (e.g., "div", "p", "span")
154
-
@param namespace Element namespace: [None] for HTML, [Some "svg"], [Some "mathml"]
155
-
@param attrs Initial attributes as (name, value) pairs
389
+
Elements are the primary building blocks of HTML documents. Each
390
+
element represents a component of the document with semantic meaning.
391
+
392
+
@param name The tag name (e.g., "div", "p", "span"). Tag names are
393
+
case-insensitive in HTML; by convention, use lowercase.
394
+
@param namespace Element namespace:
395
+
- [None] (default): HTML namespace for standard elements
396
+
- [Some "svg"]: SVG namespace for graphics elements
397
+
- [Some "mathml"]: MathML namespace for mathematical notation
398
+
@param attrs Initial attributes as [(name, value)] pairs
156
399
400
+
{b Examples:}
157
401
{[
402
+
(* Simple HTML element *)
158
403
let div = create_element "div" ()
159
-
let svg = create_element "rect" ~namespace:(Some "svg") ()
160
-
let link = create_element "a" ~attrs:[("href", "/")] ()
404
+
405
+
(* Element with attributes *)
406
+
let link = create_element "a"
407
+
~attrs:[("href", "https://example.com"); ("class", "external")]
408
+
()
409
+
410
+
(* SVG element *)
411
+
let rect = create_element "rect"
412
+
~namespace:(Some "svg")
413
+
~attrs:[("width", "100"); ("height", "50"); ("fill", "blue")]
414
+
()
161
415
]}
416
+
417
+
@see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
418
+
WHATWG: Elements in the DOM
162
419
*)
163
420
164
421
val create_text : string -> node
165
422
(** Create a text node with the given content.
166
423
424
+
Text nodes contain the readable content of HTML documents. They
425
+
appear as children of elements and represent the characters that
426
+
users see.
427
+
428
+
{b Note:} Text content is stored as-is. Character references like
429
+
[&] should already be decoded to their character values.
430
+
431
+
{b Example:}
167
432
{[
168
433
let text = create_text "Hello, world!"
434
+
(* To put text in a paragraph: *)
435
+
let p = create_element "p" () in
436
+
append_child p text
169
437
]}
170
438
*)
171
439
172
440
val create_comment : string -> node
173
441
(** Create a comment node with the given content.
174
442
175
-
The content should not include the comment delimiters.
443
+
Comments are human-readable notes in HTML that don't appear in
444
+
the rendered output. They're written as [<!-- comment -->] in HTML.
176
445
446
+
@param data The comment text (without the [<!--] and [-->] delimiters)
447
+
448
+
{b Example:}
177
449
{[
178
-
let comment = create_comment " This is a comment "
179
-
(* Represents: <!-- This is a comment --> *)
450
+
let comment = create_comment " TODO: Add navigation "
451
+
(* Represents: <!-- TODO: Add navigation --> *)
180
452
]}
453
+
454
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#comments>
455
+
WHATWG: HTML comments
181
456
*)
182
457
183
458
val create_document : unit -> node
184
459
(** Create an empty document node.
185
460
186
-
Document nodes are the root of a complete HTML document tree.
461
+
The Document node is the root of an HTML document tree. It represents
462
+
the entire document and serves as the parent for the DOCTYPE (if any)
463
+
and the root [<html>] element.
464
+
465
+
In a complete HTML document, the structure is:
466
+
{v
467
+
#document
468
+
├── !doctype
469
+
└── html
470
+
├── head
471
+
└── body
472
+
v}
473
+
474
+
@see <https://html.spec.whatwg.org/multipage/dom.html#document>
475
+
WHATWG: The Document object
187
476
*)
188
477
189
478
val create_document_fragment : unit -> node
190
479
(** Create an empty document fragment.
191
480
192
-
Document fragments are lightweight containers used for:
193
-
- Template contents
194
-
- Fragment parsing results
195
-
- Efficient batch DOM operations
481
+
Document fragments are lightweight containers that can hold multiple
482
+
nodes without being part of the main document tree. They're useful for:
483
+
484
+
- {b Template contents:} The [<template>] element stores its children
485
+
in a document fragment, keeping them inert until cloned
486
+
487
+
- {b Fragment parsing:} When parsing HTML fragments (like innerHTML),
488
+
the result is placed in a document fragment
489
+
490
+
- {b Batch operations:} Build a subtree in a fragment, then insert it
491
+
into the document in one operation for better performance
492
+
493
+
@see <https://dom.spec.whatwg.org/#documentfragment>
494
+
DOM Standard: DocumentFragment
196
495
*)
197
496
198
497
val create_doctype : ?name:string -> ?public_id:string ->
199
498
?system_id:string -> unit -> node
200
499
(** Create a DOCTYPE node.
201
500
202
-
For HTML5, use [create_doctype ~name:"html" ()] which produces
203
-
[<!DOCTYPE html>].
501
+
The DOCTYPE declaration tells browsers to use standards mode for
502
+
rendering. For HTML5 documents, use:
503
+
504
+
{[
505
+
let doctype = create_doctype ~name:"html" ()
506
+
(* Represents: <!DOCTYPE html> *)
507
+
]}
508
+
509
+
@param name DOCTYPE name (usually ["html"] for HTML documents)
510
+
@param public_id Public identifier (legacy, rarely needed)
511
+
@param system_id System identifier (legacy, rarely needed)
512
+
513
+
{b Legacy example:}
514
+
{[
515
+
(* HTML 4.01 Strict DOCTYPE - not recommended for new documents *)
516
+
let legacy = create_doctype
517
+
~name:"HTML"
518
+
~public_id:"-//W3C//DTD HTML 4.01//EN"
519
+
~system_id:"http://www.w3.org/TR/html4/strict.dtd"
520
+
()
521
+
]}
204
522
205
-
@param name DOCTYPE name (usually "html")
206
-
@param public_id Public identifier (legacy)
207
-
@param system_id System identifier (legacy)
523
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
524
+
WHATWG: The DOCTYPE
208
525
*)
209
526
210
527
val create_template : ?namespace:string option ->
211
528
?attrs:(string * string) list -> unit -> node
212
529
(** Create a [<template>] element with its content document fragment.
213
530
214
-
Template elements have special semantics: their children are not rendered
215
-
directly but stored in a separate document fragment accessible via
216
-
[template_content].
531
+
The [<template>] element holds inert HTML content that is not
532
+
rendered directly. The content is stored in a separate document
533
+
fragment and can be:
534
+
- Cloned and inserted into the document via JavaScript
535
+
- Used as a stamping template for repeated content
536
+
- Pre-parsed without affecting the page
537
+
538
+
{b How templates work:}
539
+
540
+
Unlike normal elements, a [<template>]'s children are not rendered.
541
+
Instead, they're stored in the [template_content] field. This means:
542
+
- Images inside won't load
543
+
- Scripts inside won't execute
544
+
- The content is "inert" until explicitly activated
545
+
546
+
{b Example:}
547
+
{[
548
+
let template = create_template () in
549
+
let div = create_element "div" () in
550
+
let text = create_text "Template content" in
551
+
append_child div text;
552
+
(* Add to template's content fragment, not children *)
553
+
match template.template_content with
554
+
| Some fragment -> append_child fragment div
555
+
| None -> ()
556
+
]}
217
557
218
558
@see <https://html.spec.whatwg.org/multipage/scripting.html#the-template-element>
219
-
The HTML5 template element specification
559
+
WHATWG: The template element
220
560
*)
221
561
222
562
(** {1 Node Type Predicates}
223
563
224
-
Functions to test what type of node you have.
564
+
Functions to test what type of node you have. Since all nodes use the
565
+
same record type, these predicates check the [name] field to determine
566
+
the actual node type.
225
567
*)
226
568
227
569
val is_element : node -> bool
228
570
(** [is_element node] returns [true] if the node is an element node.
229
571
230
-
Elements are nodes with HTML tags like [<div>], [<p>], etc.
572
+
Elements are HTML tags like [<div>], [<p>], [<a>]. They are
573
+
identified by having a tag name that doesn't match any of the
574
+
special node name constants.
231
575
*)
232
576
233
577
val is_text : node -> bool
234
-
(** [is_text node] returns [true] if the node is a text node. *)
578
+
(** [is_text node] returns [true] if the node is a text node.
579
+
580
+
Text nodes contain the character content within elements.
581
+
They have [name = "#text"]. *)
235
582
236
583
val is_comment : node -> bool
237
-
(** [is_comment node] returns [true] if the node is a comment node. *)
584
+
(** [is_comment node] returns [true] if the node is a comment node.
585
+
586
+
Comment nodes represent HTML comments [<!-- ... -->].
587
+
They have [name = "#comment"]. *)
238
588
239
589
val is_document : node -> bool
240
-
(** [is_document node] returns [true] if the node is a document node. *)
590
+
(** [is_document node] returns [true] if the node is a document node.
591
+
592
+
The document node is the root of the DOM tree.
593
+
It has [name = "#document"]. *)
241
594
242
595
val is_document_fragment : node -> bool
243
-
(** [is_document_fragment node] returns [true] if the node is a document fragment. *)
596
+
(** [is_document_fragment node] returns [true] if the node is a document fragment.
597
+
598
+
Document fragments are lightweight containers.
599
+
They have [name = "#document-fragment"]. *)
244
600
245
601
val is_doctype : node -> bool
246
-
(** [is_doctype node] returns [true] if the node is a DOCTYPE node. *)
602
+
(** [is_doctype node] returns [true] if the node is a DOCTYPE node.
603
+
604
+
DOCTYPE nodes represent the [<!DOCTYPE>] declaration.
605
+
They have [name = "!doctype"]. *)
247
606
248
607
val has_children : node -> bool
249
-
(** [has_children node] returns [true] if the node has any children. *)
608
+
(** [has_children node] returns [true] if the node has any children.
609
+
610
+
Note: For [<template>] elements, this checks the direct children list,
611
+
not the template content fragment. *)
250
612
251
613
(** {1 Tree Manipulation}
252
614
253
615
Functions to modify the DOM tree structure. These functions automatically
254
-
maintain parent/child references.
616
+
maintain parent/child references, ensuring the tree remains consistent.
255
617
*)
256
618
257
619
val append_child : node -> node -> unit
258
620
(** [append_child parent child] adds [child] as the last child of [parent].
259
621
260
622
The child's parent reference is updated to point to [parent].
623
+
If the child already has a parent, it is first removed from that parent.
624
+
625
+
{b Example:}
626
+
{[
627
+
let body = create_element "body" () in
628
+
let p = create_element "p" () in
629
+
let text = create_text "Hello!" in
630
+
append_child p text;
631
+
append_child body p
632
+
(* Result:
633
+
body
634
+
└── p
635
+
└── #text "Hello!"
636
+
*)
637
+
]}
261
638
*)
262
639
263
640
val insert_before : node -> node -> node -> unit
264
641
(** [insert_before parent new_child ref_child] inserts [new_child] before
265
642
[ref_child] in [parent]'s children.
266
643
267
-
@raise Not_found if [ref_child] is not a child of [parent]
644
+
@param parent The parent node
645
+
@param new_child The node to insert
646
+
@param ref_child The existing child to insert before
647
+
648
+
Raises [Not_found] if [ref_child] is not a child of [parent].
649
+
650
+
{b Example:}
651
+
{[
652
+
let ul = create_element "ul" () in
653
+
let li1 = create_element "li" () in
654
+
let li3 = create_element "li" () in
655
+
append_child ul li1;
656
+
append_child ul li3;
657
+
let li2 = create_element "li" () in
658
+
insert_before ul li2 li3
659
+
(* Result: ul contains li1, li2, li3 in that order *)
660
+
]}
268
661
*)
269
662
270
663
val remove_child : node -> node -> unit
271
664
(** [remove_child parent child] removes [child] from [parent]'s children.
272
665
273
666
The child's parent reference is set to [None].
667
+
668
+
Raises [Not_found] if [child] is not a child of [parent].
274
669
*)
275
670
276
671
val insert_text_at : node -> string -> node option -> unit
277
672
(** [insert_text_at parent text before_node] inserts text content.
278
673
279
674
If [before_node] is [None], appends at the end. If the previous sibling
280
-
is a text node, the text is merged into it. Otherwise, a new text node
281
-
is created.
675
+
is a text node, the text is merged into it (text nodes are coalesced).
676
+
Otherwise, a new text node is created.
282
677
283
678
This implements the HTML5 parser's text insertion algorithm which
284
-
coalesces adjacent text nodes.
679
+
ensures adjacent text nodes are always merged, matching browser behavior.
680
+
681
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#appropriate-place-for-inserting-a-node>
682
+
WHATWG: Inserting text in the DOM
285
683
*)
286
684
287
685
(** {1 Attribute Operations}
288
686
289
-
Functions to read and modify element attributes.
687
+
Functions to read and modify element attributes. Attributes are
688
+
name-value pairs that provide additional information about elements.
689
+
690
+
In HTML5, attribute names are case-insensitive and normalized to
691
+
lowercase by the parser.
692
+
693
+
@see <https://html.spec.whatwg.org/multipage/dom.html#attributes>
694
+
WHATWG: Attributes
290
695
*)
291
696
292
697
val get_attr : node -> string -> string option
293
-
(** [get_attr node name] returns the value of attribute [name], or [None]. *)
698
+
(** [get_attr node name] returns the value of attribute [name], or [None]
699
+
if the attribute doesn't exist.
700
+
701
+
Attribute lookup is case-sensitive on the stored (lowercase) names.
702
+
*)
294
703
295
704
val set_attr : node -> string -> string -> unit
296
705
(** [set_attr node name value] sets attribute [name] to [value].
297
706
298
707
If the attribute already exists, it is replaced.
708
+
If it doesn't exist, it is added.
299
709
*)
300
710
301
711
val has_attr : node -> string -> bool
···
310
720
(** [descendants node] returns all descendant nodes in document order.
311
721
312
722
This performs a depth-first traversal, returning children before
313
-
siblings at each level.
723
+
siblings at each level. The node itself is not included.
724
+
725
+
{b Document order} is the order nodes appear in the HTML source:
726
+
parent before children, earlier siblings before later ones.
727
+
728
+
{b Example:}
729
+
{[
730
+
(* For tree: div > (p > "hello", span > "world") *)
731
+
descendants div
732
+
(* Returns: [p; text("hello"); span; text("world")] *)
733
+
]}
314
734
*)
315
735
316
736
val ancestors : node -> node list
317
737
(** [ancestors node] returns all ancestor nodes from parent to root.
318
738
319
-
The first element is the immediate parent, the last is the root.
739
+
The first element is the immediate parent, the last is the root
740
+
(usually the Document node).
741
+
742
+
{b Example:}
743
+
{[
744
+
(* For a text node inside: html > body > p > text *)
745
+
ancestors text_node
746
+
(* Returns: [p; body; html; #document] *)
747
+
]}
320
748
*)
321
749
322
750
val get_text_content : node -> string
323
751
(** [get_text_content node] returns the concatenated text content.
324
752
325
-
For text nodes, returns the text data. For elements, recursively
326
-
concatenates all descendant text content.
753
+
For text nodes, returns the text data directly.
754
+
For elements, recursively concatenates all descendant text content.
755
+
For other node types, returns an empty string.
756
+
757
+
{b Example:}
758
+
{[
759
+
(* For: <p>Hello <b>world</b>!</p> *)
760
+
get_text_content p_element
761
+
(* Returns: "Hello world!" *)
762
+
]}
327
763
*)
328
764
329
765
(** {1 Cloning} *)
···
333
769
334
770
@param deep If [true], recursively clone all descendants (default: [false])
335
771
336
-
The cloned node has no parent. Attribute lists are copied by reference
337
-
(the list itself is new, but attribute strings are shared).
772
+
The cloned node has no parent. With [deep:false], only the node itself
773
+
is copied (with its attributes, but not its children).
774
+
775
+
{b Example:}
776
+
{[
777
+
let original = create_element "div" ~attrs:[("class", "box")] () in
778
+
let shallow = clone original in
779
+
let deep = clone ~deep:true original
780
+
]}
338
781
*)
+666
-107
lib/html5rw/html5rw.mli
+666
-107
lib/html5rw/html5rw.mli
···
5
5
6
6
(** Html5rw - Pure OCaml HTML5 Parser
7
7
8
-
This module provides a complete HTML5 parsing solution following the
9
-
WHATWG specification. It uses bytesrw for streaming input/output.
8
+
This library provides a complete HTML5 parsing solution that implements the
9
+
{{:https://html.spec.whatwg.org/multipage/parsing.html} WHATWG HTML5
10
+
parsing specification}. It can parse any HTML document - well-formed or not -
11
+
and produce a DOM (Document Object Model) tree that matches browser behavior.
12
+
13
+
{2 What is HTML?}
14
+
15
+
HTML (HyperText Markup Language) is the standard markup language for creating
16
+
web pages. An HTML document consists of nested {i elements} that describe
17
+
the structure and content of the page:
18
+
19
+
{v
20
+
<!DOCTYPE html>
21
+
<html>
22
+
<head>
23
+
<title>My Page</title>
24
+
</head>
25
+
<body>
26
+
<h1>Welcome</h1>
27
+
<p>Hello, <b>world</b>!</p>
28
+
</body>
29
+
</html>
30
+
v}
31
+
32
+
Each element is written with a {i start tag} (like [<p>]), content, and an
33
+
{i end tag} (like [</p>]). Elements can have {i attributes} that provide
34
+
additional information: [<a href="https://example.com">].
35
+
36
+
@see <https://html.spec.whatwg.org/multipage/introduction.html>
37
+
WHATWG: Introduction to HTML
38
+
39
+
{2 The DOM}
40
+
41
+
When this parser processes HTML, it doesn't just store the text. Instead,
42
+
it builds a tree structure called the DOM (Document Object Model). Each
43
+
element, text fragment, and comment becomes a {i node} in this tree:
44
+
45
+
{v
46
+
Document
47
+
└── html
48
+
├── head
49
+
│ └── title
50
+
│ └── #text "My Page"
51
+
└── body
52
+
├── h1
53
+
│ └── #text "Welcome"
54
+
└── p
55
+
├── #text "Hello, "
56
+
├── b
57
+
│ └── #text "world"
58
+
└── #text "!"
59
+
v}
60
+
61
+
This tree can be traversed, searched, and modified. The {!Dom} module
62
+
provides types and functions for working with DOM nodes.
63
+
64
+
@see <https://html.spec.whatwg.org/multipage/dom.html>
65
+
WHATWG: The elements of HTML (DOM chapter)
10
66
11
67
{2 Quick Start}
12
68
13
-
Parse HTML from a reader:
69
+
Parse HTML from a string:
14
70
{[
15
71
open Bytesrw
16
72
let reader = Bytes.Reader.of_string "<p>Hello, world!</p>" in
···
32
88
let result = Html5rw.parse reader in
33
89
let divs = Html5rw.query result "div.content"
34
90
]}
91
+
92
+
{2 Error Handling}
93
+
94
+
Unlike many parsers, HTML5 parsing {b never fails}. The WHATWG specification
95
+
defines error recovery rules for every possible malformed input, ensuring
96
+
all HTML documents produce a valid DOM tree (just as browsers do).
97
+
98
+
For example, parsing [<p>Hello<p>World] produces two paragraphs, not an
99
+
error, because [<p>] implicitly closes the previous [<p>].
100
+
101
+
If you need to detect malformed HTML (e.g., for validation), enable error
102
+
collection with [~collect_errors:true]. Errors are advisory - the parsing
103
+
still succeeds.
104
+
105
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
106
+
WHATWG: Parse errors
107
+
108
+
{2 HTML vs XHTML}
109
+
110
+
This parser implements {b HTML5 parsing}, not XHTML parsing. Key differences:
111
+
112
+
- Tag and attribute names are case-insensitive ([<DIV>] equals [<div>])
113
+
- Some end tags are optional ([<p>Hello] is valid)
114
+
- Void elements have no end tag ([<br>], not [<br/>] or [<br></br>])
115
+
- Boolean attributes need no value ([<input disabled>])
116
+
117
+
XHTML uses stricter XML rules. If you need XHTML parsing, use an XML parser.
118
+
119
+
@see <https://html.spec.whatwg.org/multipage/syntax.html>
120
+
WHATWG: The HTML syntax
35
121
*)
36
122
37
123
(** {1 Sub-modules} *)
38
124
39
-
(** DOM types and manipulation functions *)
125
+
(** DOM types and manipulation functions.
126
+
127
+
This module provides the core types for representing HTML documents as
128
+
DOM trees. It includes:
129
+
- The {!Dom.node} type representing all kinds of DOM nodes
130
+
- Functions to create, modify, and traverse nodes
131
+
- Serialization functions to convert DOM back to HTML
132
+
133
+
@see <https://html.spec.whatwg.org/multipage/dom.html>
134
+
WHATWG: The elements of HTML *)
40
135
module Dom = Html5rw_dom
41
136
42
-
(** HTML5 tokenizer *)
137
+
(** HTML5 tokenizer.
138
+
139
+
The tokenizer is the first stage of HTML5 parsing. It converts a stream
140
+
of characters into a stream of {i tokens}: start tags, end tags, text,
141
+
comments, and DOCTYPEs.
142
+
143
+
Most users don't need to use the tokenizer directly - the {!parse}
144
+
function handles everything. The tokenizer is exposed for advanced use
145
+
cases like syntax highlighting or partial parsing.
146
+
147
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization>
148
+
WHATWG: Tokenization *)
43
149
module Tokenizer = Html5rw_tokenizer
44
150
45
-
(** Encoding detection and decoding *)
151
+
(** Encoding detection and decoding.
152
+
153
+
HTML documents can use various character encodings (UTF-8, ISO-8859-1,
154
+
etc.). This module implements the WHATWG encoding sniffing algorithm
155
+
that browsers use to detect the encoding of a document:
156
+
157
+
1. Check for a BOM (Byte Order Mark)
158
+
2. Look for a [<meta charset>] declaration
159
+
3. Use HTTP Content-Type header hint (if available)
160
+
4. Fall back to UTF-8
161
+
162
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
163
+
WHATWG: Determining the character encoding
164
+
@see <https://encoding.spec.whatwg.org/>
165
+
WHATWG Encoding Standard *)
46
166
module Encoding = Html5rw_encoding
47
167
48
-
(** CSS selector engine *)
168
+
(** CSS selector engine.
169
+
170
+
This module provides CSS selector support for querying the DOM tree.
171
+
CSS selectors are patterns used to select HTML elements based on their
172
+
tag names, attributes, classes, IDs, and position in the document.
173
+
174
+
Example selectors:
175
+
- [div] - all [<div>] elements
176
+
- [#header] - element with [id="header"]
177
+
- [.warning] - elements with [class="warning"]
178
+
- [div > p] - [<p>] elements that are direct children of [<div>]
179
+
- [[href]] - elements with an [href] attribute
180
+
181
+
@see <https://www.w3.org/TR/selectors-4/>
182
+
W3C Selectors Level 4 specification *)
49
183
module Selector = Html5rw_selector
50
184
51
-
(** HTML entity decoding *)
185
+
(** HTML entity decoding.
186
+
187
+
HTML uses {i character references} to represent characters that are
188
+
hard to type or have special meaning:
189
+
190
+
- Named references: [&] (ampersand), [<] (less than), [ ] (non-breaking space)
191
+
- Decimal references: [<] (less than as decimal 60)
192
+
- Hexadecimal references: [<] (less than as hex 3C)
193
+
194
+
This module decodes all 2,231 named character references defined in
195
+
the WHATWG specification, plus numeric references.
196
+
197
+
@see <https://html.spec.whatwg.org/multipage/named-characters.html>
198
+
WHATWG: Named character references *)
52
199
module Entities = Html5rw_entities
53
200
54
-
(** Low-level parser access *)
201
+
(** Low-level parser access.
202
+
203
+
This module exposes the internals of the HTML5 parser for advanced use.
204
+
Most users should use the top-level {!parse} function instead.
205
+
206
+
The parser exposes:
207
+
- Insertion modes for the tree construction algorithm
208
+
- The tree builder state machine
209
+
- Lower-level parsing functions
210
+
211
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction>
212
+
WHATWG: Tree construction *)
55
213
module Parser = Html5rw_parser
56
214
57
215
(** {1 Core Types} *)
58
216
59
-
(** DOM node type. See {!Dom} for manipulation functions. *)
217
+
(** DOM node type.
218
+
219
+
A node represents one part of an HTML document. Nodes form a tree
220
+
structure with parent/child relationships. There are several kinds:
221
+
222
+
- {b Element nodes}: HTML tags like [<div>], [<p>], [<a>]
223
+
- {b Text nodes}: Text content within elements
224
+
- {b Comment nodes}: HTML comments [<!-- ... -->]
225
+
- {b Document nodes}: The root of a document tree
226
+
- {b Document fragment nodes}: Lightweight containers
227
+
- {b Doctype nodes}: The [<!DOCTYPE html>] declaration
228
+
229
+
See {!Dom} for manipulation functions.
230
+
231
+
@see <https://html.spec.whatwg.org/multipage/dom.html>
232
+
WHATWG: The DOM *)
60
233
type node = Dom.node
61
234
62
-
(** Doctype information *)
235
+
(** DOCTYPE information.
236
+
237
+
The DOCTYPE declaration ([<!DOCTYPE html>]) appears at the start of HTML
238
+
documents. It tells browsers to use standards mode for rendering.
239
+
240
+
In HTML5, the DOCTYPE is minimal - just [<!DOCTYPE html>] with no public
241
+
or system identifiers. Legacy DOCTYPEs may have additional fields.
242
+
243
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
244
+
WHATWG: The DOCTYPE *)
63
245
type doctype_data = Dom.doctype_data = {
64
246
name : string option;
247
+
(** DOCTYPE name, typically ["html"] *)
248
+
65
249
public_id : string option;
250
+
(** Public identifier for legacy DOCTYPEs (e.g., XHTML, HTML4) *)
251
+
66
252
system_id : string option;
253
+
(** System identifier (URL) for legacy DOCTYPEs *)
67
254
}
68
255
69
-
(** Quirks mode as determined during parsing *)
256
+
(** Quirks mode as determined during parsing.
257
+
258
+
{i Quirks mode} controls how browsers render CSS and compute layouts.
259
+
It exists for backwards compatibility with old web pages that relied
260
+
on browser bugs.
261
+
262
+
- {b No_quirks}: Standards mode. The document is rendered according to
263
+
modern HTML5 and CSS specifications. Triggered by [<!DOCTYPE html>].
264
+
265
+
- {b Quirks}: Full quirks mode. The browser emulates bugs from older
266
+
browsers (primarily IE5). Triggered by missing or malformed DOCTYPEs.
267
+
Affects CSS box model, table layout, font inheritance, and more.
268
+
269
+
- {b Limited_quirks}: Almost standards mode. Only a few specific quirks
270
+
are applied, mainly affecting table cell vertical alignment.
271
+
272
+
{b Recommendation:} Always use [<!DOCTYPE html>] to ensure standards mode.
273
+
274
+
@see <https://quirks.spec.whatwg.org/>
275
+
Quirks Mode Standard
276
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#the-initial-insertion-mode>
277
+
WHATWG: How quirks mode is determined *)
70
278
type quirks_mode = Dom.quirks_mode = No_quirks | Quirks | Limited_quirks
71
279
72
-
(** Character encoding detected or specified *)
280
+
(** Character encoding detected or specified.
281
+
282
+
HTML documents are sequences of bytes that must be decoded into characters.
283
+
Different encodings interpret the same bytes differently. For example:
284
+
285
+
- UTF-8: The modern standard, supporting all Unicode characters
286
+
- Windows-1252: Common on older Western European web pages
287
+
- ISO-8859-2: Used for Central European languages
288
+
- UTF-16: Used by some Windows applications
289
+
290
+
The parser detects encoding automatically when using {!parse_bytes}.
291
+
The detected encoding is available via {!val-encoding}.
292
+
293
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
294
+
WHATWG: Determining the character encoding
295
+
@see <https://encoding.spec.whatwg.org/>
296
+
WHATWG Encoding Standard *)
73
297
type encoding = Encoding.encoding =
74
298
| Utf8
299
+
(** UTF-8: The dominant encoding for the web, supporting all Unicode *)
300
+
75
301
| Utf16le
302
+
(** UTF-16 Little-Endian: 16-bit encoding, used by Windows *)
303
+
76
304
| Utf16be
305
+
(** UTF-16 Big-Endian: 16-bit encoding, network byte order *)
306
+
77
307
| Windows_1252
308
+
(** Windows-1252 (CP-1252): Western European, superset of ISO-8859-1 *)
309
+
78
310
| Iso_8859_2
311
+
(** ISO-8859-2: Central European (Polish, Czech, Hungarian, etc.) *)
312
+
79
313
| Euc_jp
314
+
(** EUC-JP: Extended Unix Code for Japanese *)
80
315
81
316
(** A parse error encountered during HTML5 parsing.
82
317
83
-
HTML5 parsing never fails - the specification defines error recovery
84
-
for all malformed input. However, conformance checkers can report
85
-
these errors. Enable error collection with [~collect_errors:true].
318
+
HTML5 parsing {b never fails} - the specification defines error recovery
319
+
for all malformed input. However, conformance checkers can report these
320
+
errors. Enable error collection with [~collect_errors:true] if you want
321
+
to detect malformed HTML.
322
+
323
+
{b Common parse errors:}
324
+
325
+
- ["unexpected-null-character"]: Null byte in the input
326
+
- ["eof-before-tag-name"]: File ended while reading a tag
327
+
- ["unexpected-character-in-attribute-name"]: Invalid attribute syntax
328
+
- ["missing-doctype"]: Document started without [<!DOCTYPE>]
329
+
- ["duplicate-attribute"]: Same attribute appears twice on an element
330
+
331
+
The full list of parse error codes is defined in the WHATWG specification.
86
332
87
333
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
88
-
WHATWG parse error definitions
89
-
*)
334
+
WHATWG: Complete list of parse errors *)
90
335
type parse_error = Parser.parse_error
91
336
92
-
(** Get the error code (e.g., "unexpected-null-character"). *)
337
+
(** Get the error code string.
338
+
339
+
Error codes are lowercase with hyphens, matching the WHATWG specification
340
+
names. Examples: ["unexpected-null-character"], ["eof-in-tag"],
341
+
["missing-end-tag-name"].
342
+
343
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
344
+
WHATWG: Parse error codes *)
93
345
val error_code : parse_error -> string
94
346
95
-
(** Get the line number where the error occurred (1-indexed). *)
347
+
(** Get the line number where the error occurred (1-indexed).
348
+
349
+
Line numbers count from 1 and increment at each newline character. *)
96
350
val error_line : parse_error -> int
97
351
98
-
(** Get the column number where the error occurred (1-indexed). *)
352
+
(** Get the column number where the error occurred (1-indexed).
353
+
354
+
Column numbers count from 1 and reset at each newline. *)
99
355
val error_column : parse_error -> int
100
356
101
357
(** Context element for HTML fragment parsing (innerHTML).
102
358
103
-
When parsing HTML fragments, you must specify what element would
104
-
contain the fragment. This affects how certain elements are handled.
359
+
When parsing HTML fragments (like the [innerHTML] of an element), you
360
+
must specify what element would contain the fragment. This affects how
361
+
the parser handles certain elements.
362
+
363
+
{b Why context matters:}
364
+
365
+
HTML parsing rules depend on where content appears. For example:
366
+
- [<td>] is valid inside [<tr>] but not inside [<div>]
367
+
- [<li>] is valid inside [<ul>] but creates implied lists elsewhere
368
+
- Content inside [<table>] has special parsing rules
369
+
370
+
{b Example:}
371
+
{[
372
+
(* Parse as if content were inside a <ul> *)
373
+
let ctx = make_fragment_context ~tag_name:"ul" () in
374
+
let result = parse ~fragment_context:ctx reader
375
+
(* Now <li> elements are parsed correctly *)
376
+
]}
105
377
106
378
@see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
107
-
The fragment parsing algorithm
108
-
*)
379
+
WHATWG: The fragment parsing algorithm *)
109
380
type fragment_context = Parser.fragment_context
110
381
111
382
(** Create a fragment parsing context.
112
383
113
-
@param tag_name Tag name of the context element (e.g., "div", "tr")
114
-
@param namespace Namespace: [None] for HTML, [Some "svg"], [Some "mathml"]
384
+
The context element determines how the parser interprets the fragment.
385
+
Choose a context that matches where the fragment would be inserted.
115
386
387
+
@param tag_name Tag name of the context element (e.g., ["div"], ["tr"],
388
+
["ul"]). This is the element that would contain the fragment.
389
+
@param namespace Namespace of the context element:
390
+
- [None] (default): HTML namespace
391
+
- [Some "svg"]: SVG namespace
392
+
- [Some "mathml"]: MathML namespace
393
+
394
+
{b Examples:}
116
395
{[
117
-
(* Parse as innerHTML of a <ul> *)
118
-
let ctx = Html5rw.make_fragment_context ~tag_name:"ul" ()
396
+
(* Parse as innerHTML of a <div> (most common case) *)
397
+
let ctx = make_fragment_context ~tag_name:"div" ()
398
+
399
+
(* Parse as innerHTML of a <ul> - <li> elements work correctly *)
400
+
let ctx = make_fragment_context ~tag_name:"ul" ()
119
401
120
402
(* Parse as innerHTML of an SVG <g> element *)
121
-
let ctx = Html5rw.make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
403
+
let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
404
+
405
+
(* Parse as innerHTML of a <table> - table-specific rules apply *)
406
+
let ctx = make_fragment_context ~tag_name:"table" ()
122
407
]}
123
-
*)
408
+
409
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
410
+
WHATWG: Fragment parsing algorithm *)
124
411
val make_fragment_context : tag_name:string -> ?namespace:string option ->
125
412
unit -> fragment_context
126
413
···
132
419
133
420
(** Result of parsing an HTML document.
134
421
135
-
Contains the parsed DOM tree, any errors encountered, and the
136
-
detected encoding (when parsing from bytes).
422
+
This record contains everything produced by parsing:
423
+
- The DOM tree (accessible via {!val-root})
424
+
- Any parse errors (accessible via {!val-errors})
425
+
- The detected encoding (accessible via {!val-encoding})
137
426
*)
138
427
type t = {
139
428
root : node;
429
+
(** Root node of the parsed document tree.
430
+
431
+
For full document parsing, this is a Document node containing the
432
+
DOCTYPE (if any) and [<html>] element.
433
+
434
+
For fragment parsing, this is a Document Fragment containing the
435
+
parsed elements. *)
436
+
140
437
errors : parse_error list;
438
+
(** Parse errors encountered during parsing.
439
+
440
+
This list is empty unless [~collect_errors:true] was passed to the
441
+
parse function. Errors are in the order they were encountered.
442
+
443
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
444
+
WHATWG: Parse errors *)
445
+
141
446
encoding : encoding option;
447
+
(** Character encoding detected during parsing.
448
+
449
+
This is [Some encoding] when using {!parse_bytes} with automatic
450
+
encoding detection, and [None] when using {!parse} (which expects
451
+
pre-decoded UTF-8 input). *)
142
452
}
143
453
144
454
(** {1 Parsing Functions} *)
145
455
146
456
(** Parse HTML from a [Bytes.Reader.t].
147
457
148
-
This is the primary parsing function. Create a reader from any source:
149
-
- [Bytes.Reader.of_string s] for strings
150
-
- [Bytes.Reader.of_in_channel ic] for files
151
-
- [Bytes.Reader.of_bytes b] for byte buffers
458
+
This is the primary parsing function. It reads bytes from the provided
459
+
reader and returns a DOM tree. The input should be valid UTF-8.
152
460
461
+
{b Creating readers:}
153
462
{[
154
463
open Bytesrw
155
-
let reader = Bytes.Reader.of_string "<html><body>Hello</body></html>" in
464
+
465
+
(* From a string *)
466
+
let reader = Bytes.Reader.of_string html_string
467
+
468
+
(* From a file *)
469
+
let ic = open_in "page.html" in
470
+
let reader = Bytes.Reader.of_in_channel ic
471
+
472
+
(* From a buffer *)
473
+
let reader = Bytes.Reader.of_buffer buf
474
+
]}
475
+
476
+
{b Parsing a complete document:}
477
+
{[
156
478
let result = Html5rw.parse reader
479
+
let doc = Html5rw.root result
157
480
]}
158
481
159
-
@param collect_errors If true, collect parse errors (default: false)
160
-
@param fragment_context Context element for fragment parsing
161
-
*)
162
-
val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> Bytesrw.Bytes.Reader.t -> t
482
+
{b Parsing a fragment:}
483
+
{[
484
+
let ctx = Html5rw.make_fragment_context ~tag_name:"div" () in
485
+
let result = Html5rw.parse ~fragment_context:ctx reader
486
+
]}
487
+
488
+
@param collect_errors If [true], collect parse errors. Default: [false].
489
+
Error collection has some performance overhead.
490
+
@param fragment_context Context element for fragment parsing. If provided,
491
+
the input is parsed as a fragment (like innerHTML) rather than
492
+
a complete document.
493
+
494
+
@see <https://html.spec.whatwg.org/multipage/parsing.html>
495
+
WHATWG: HTML parsing algorithm *)
496
+
val parse : ?collect_errors:bool -> ?fragment_context:fragment_context ->
497
+
Bytesrw.Bytes.Reader.t -> t
163
498
164
499
(** Parse raw bytes with automatic encoding detection.
165
500
166
-
This function implements the WHATWG encoding sniffing algorithm:
167
-
1. Check for BOM (Byte Order Mark)
168
-
2. Prescan for <meta charset>
169
-
3. Fall back to UTF-8
501
+
This function is useful when you have raw bytes and don't know the
502
+
character encoding. It implements the WHATWG encoding sniffing algorithm:
503
+
504
+
1. {b BOM detection}: Check for UTF-8, UTF-16LE, or UTF-16BE BOM
505
+
2. {b Prescan}: Look for [<meta charset="...">] in the first 1024 bytes
506
+
3. {b Transport hint}: Use the provided [transport_encoding] if any
507
+
4. {b Fallback}: Use UTF-8 (the modern web default)
508
+
509
+
The detected encoding is stored in the result's [encoding] field.
510
+
511
+
{b Example:}
512
+
{[
513
+
let bytes = really_input_bytes ic (in_channel_length ic) in
514
+
let result = Html5rw.parse_bytes bytes in
515
+
match Html5rw.encoding result with
516
+
| Some Utf8 -> print_endline "UTF-8 detected"
517
+
| Some Windows_1252 -> print_endline "Windows-1252 detected"
518
+
| _ -> ()
519
+
]}
170
520
171
-
@param collect_errors If true, collect parse errors (default: false)
172
-
@param transport_encoding Encoding from HTTP Content-Type header
173
-
@param fragment_context Context element for fragment parsing
174
-
*)
175
-
val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> ?fragment_context:fragment_context -> bytes -> t
521
+
@param collect_errors If [true], collect parse errors. Default: [false].
522
+
@param transport_encoding Encoding hint from HTTP Content-Type header.
523
+
For example, if the server sends [Content-Type: text/html; charset=utf-8],
524
+
pass [~transport_encoding:"utf-8"].
525
+
@param fragment_context Context element for fragment parsing.
526
+
527
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
528
+
WHATWG: Determining the character encoding *)
529
+
val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string ->
530
+
?fragment_context:fragment_context -> bytes -> t
176
531
177
532
(** {1 Querying} *)
178
533
179
534
(** Query the DOM tree with a CSS selector.
180
535
181
-
Supported selectors:
182
-
- Tag: [div], [p], [span]
183
-
- ID: [#myid]
184
-
- Class: [.myclass]
185
-
- Universal: [*]
186
-
- Attribute: [[attr]], [[attr="value"]], [[attr~="value"]], [[attr|="value"]]
187
-
- Pseudo-classes: [:first-child], [:last-child], [:nth-child(n)]
188
-
- Combinators: descendant (space), child (>), adjacent sibling (+), general sibling (~)
536
+
CSS selectors are patterns used to select elements in HTML documents.
537
+
This function returns all nodes matching the selector, in document order.
538
+
539
+
{b Supported selectors:}
540
+
541
+
{i Type selectors:}
542
+
- [div], [p], [span] - elements by tag name
543
+
544
+
{i Class and ID selectors:}
545
+
- [#myid] - element with [id="myid"]
546
+
- [.myclass] - elements with class containing "myclass"
547
+
548
+
{i Attribute selectors:}
549
+
- [[attr]] - elements with the [attr] attribute
550
+
- [[attr="value"]] - attribute equals value
551
+
- [[attr~="value"]] - attribute contains word
552
+
- [[attr|="value"]] - attribute starts with value or value-
553
+
- [[attr^="value"]] - attribute starts with value
554
+
- [[attr$="value"]] - attribute ends with value
555
+
- [[attr*="value"]] - attribute contains value
556
+
557
+
{i Pseudo-classes:}
558
+
- [:first-child], [:last-child] - first/last child of parent
559
+
- [:nth-child(n)] - nth child (1-indexed)
560
+
- [:only-child] - only child of parent
561
+
- [:empty] - elements with no children
562
+
- [:not(selector)] - elements not matching selector
189
563
564
+
{i Combinators:}
565
+
- [A B] - B descendants of A (any depth)
566
+
- [A > B] - B direct children of A
567
+
- [A + B] - B immediately after A (adjacent sibling)
568
+
- [A ~ B] - B after A (general sibling)
569
+
570
+
{i Universal:}
571
+
- [*] - all elements
572
+
573
+
{b Examples:}
190
574
{[
191
-
let divs = Html5rw.query result "div.content > p"
575
+
(* All paragraphs *)
576
+
let ps = query result "p"
577
+
578
+
(* Elements with class "warning" inside a div *)
579
+
let warnings = query result "div .warning"
580
+
581
+
(* Direct children of nav that are links *)
582
+
let nav_links = query result "nav > a"
583
+
584
+
(* Complex selector *)
585
+
let items = query result "ul.menu > li:first-child a[href]"
192
586
]}
193
587
194
-
@raise Selector.Selector_error if the selector is invalid
195
-
*)
588
+
@raise Selector.Selector_error if the selector syntax is invalid
589
+
590
+
@see <https://www.w3.org/TR/selectors-4/>
591
+
W3C: Selectors Level 4 *)
196
592
val query : t -> string -> node list
197
593
198
-
(** Check if a node matches a CSS selector. *)
594
+
(** Check if a node matches a CSS selector.
595
+
596
+
This is useful for filtering nodes or implementing custom traversals.
597
+
598
+
{b Example:}
599
+
{[
600
+
let is_external_link node =
601
+
matches node "a[href^='http']"
602
+
]}
603
+
604
+
@raise Selector.Selector_error if the selector syntax is invalid *)
199
605
val matches : node -> string -> bool
200
606
201
607
(** {1 Serialization} *)
202
608
203
609
(** Write the DOM tree to a [Bytes.Writer.t].
204
610
611
+
This serializes the DOM back to HTML. The output is valid HTML5 that
612
+
can be parsed to produce an equivalent DOM tree.
613
+
614
+
{b Example:}
205
615
{[
206
616
open Bytesrw
207
617
let buf = Buffer.create 1024 in
···
211
621
let html = Buffer.contents buf
212
622
]}
213
623
214
-
@param pretty If true, format with indentation (default: true)
215
-
@param indent_size Number of spaces per indent level (default: 2)
216
-
*)
217
-
val to_writer : ?pretty:bool -> ?indent_size:int -> t -> Bytesrw.Bytes.Writer.t -> unit
624
+
@param pretty If [true] (default), add indentation for readability.
625
+
If [false], output compact HTML with no added whitespace.
626
+
@param indent_size Spaces per indentation level (default: 2).
627
+
Only used when [pretty] is [true].
628
+
629
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments>
630
+
WHATWG: Serialising HTML fragments *)
631
+
val to_writer : ?pretty:bool -> ?indent_size:int -> t ->
632
+
Bytesrw.Bytes.Writer.t -> unit
218
633
219
634
(** Serialize the DOM tree to a string.
220
635
221
-
Convenience function when the output fits in memory.
636
+
Convenience function that serializes to a string instead of a writer.
637
+
Use {!to_writer} for large documents to avoid memory allocation.
222
638
223
-
@param pretty If true, format with indentation (default: true)
224
-
@param indent_size Number of spaces per indent level (default: 2)
225
-
*)
639
+
@param pretty If [true] (default), add indentation for readability.
640
+
@param indent_size Spaces per indentation level (default: 2). *)
226
641
val to_string : ?pretty:bool -> ?indent_size:int -> t -> string
227
642
228
643
(** Extract text content from the DOM tree.
229
644
230
-
@param separator String to insert between text nodes (default: " ")
231
-
@param strip If true, trim whitespace (default: true)
232
-
*)
645
+
This concatenates all text nodes in the document, producing a string
646
+
with just the readable text (no HTML tags).
647
+
648
+
{b Example:}
649
+
{[
650
+
(* For document: <div><p>Hello</p><p>World</p></div> *)
651
+
let text = to_text result
652
+
(* Returns: "Hello World" *)
653
+
]}
654
+
655
+
@param separator String to insert between text nodes (default: [" "])
656
+
@param strip If [true] (default), trim leading/trailing whitespace *)
233
657
val to_text : ?separator:string -> ?strip:bool -> t -> string
234
658
235
-
(** Serialize to html5lib test format (for testing). *)
659
+
(** Serialize to html5lib test format.
660
+
661
+
This produces the tree format used by the
662
+
{{:https://github.com/html5lib/html5lib-tests} html5lib-tests} suite.
663
+
Mainly useful for testing the parser against the reference tests. *)
236
664
val to_test_format : t -> string
237
665
238
666
(** {1 Result Accessors} *)
239
667
240
-
(** Get the root node of the parsed document. *)
668
+
(** Get the root node of the parsed document.
669
+
670
+
For full document parsing, this returns a Document node. The structure is:
671
+
{v
672
+
#document
673
+
├── !doctype (if present)
674
+
└── html
675
+
├── head
676
+
└── body
677
+
v}
678
+
679
+
For fragment parsing, this returns a Document Fragment node containing
680
+
the parsed elements directly. *)
241
681
val root : t -> node
242
682
243
-
(** Get parse errors (if error collection was enabled). *)
683
+
(** Get parse errors (if error collection was enabled).
684
+
685
+
Returns an empty list if [~collect_errors:true] was not passed to the
686
+
parse function, or if the document was well-formed.
687
+
688
+
Errors are returned in the order they were encountered during parsing.
689
+
690
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
691
+
WHATWG: Parse errors *)
244
692
val errors : t -> parse_error list
245
693
246
-
(** Get the detected encoding (if parsed from bytes). *)
694
+
(** Get the detected encoding (if parsed from bytes).
695
+
696
+
Returns [Some encoding] when {!parse_bytes} was used, indicating which
697
+
encoding was detected or specified. Returns [None] when {!parse} was
698
+
used, since it expects pre-decoded UTF-8 input.
699
+
700
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
701
+
WHATWG: Determining the character encoding *)
247
702
val encoding : t -> encoding option
248
703
249
704
(** {1 DOM Utilities}
250
705
251
-
Common DOM operations are available directly. For the full API,
252
-
see the {!Dom} module.
706
+
Common DOM operations are available directly on this module. For the
707
+
full API including more advanced operations, see the {!Dom} module.
708
+
709
+
@see <https://html.spec.whatwg.org/multipage/dom.html>
710
+
WHATWG: The elements of HTML
253
711
*)
254
712
255
713
(** Create an element node.
256
-
@param namespace None for HTML, Some "svg" or Some "mathml" for foreign content
257
-
@param attrs List of (name, value) attribute pairs
258
-
*)
259
-
val create_element : string -> ?namespace:string option -> ?attrs:(string * string) list -> unit -> node
714
+
715
+
Elements are the building blocks of HTML documents. They represent tags
716
+
like [<div>], [<p>], [<a>], etc.
717
+
718
+
@param name Tag name (e.g., ["div"], ["p"], ["span"])
719
+
@param namespace Element namespace:
720
+
- [None] (default): HTML namespace
721
+
- [Some "svg"]: SVG namespace for graphics
722
+
- [Some "mathml"]: MathML namespace for math notation
723
+
@param attrs Initial attributes as [(name, value)] pairs
724
+
725
+
{b Example:}
726
+
{[
727
+
(* Simple element *)
728
+
let div = create_element "div" ()
260
729
261
-
(** Create a text node. *)
730
+
(* Element with attributes *)
731
+
let link = create_element "a"
732
+
~attrs:[("href", "/about"); ("class", "nav-link")]
733
+
()
734
+
]}
735
+
736
+
@see <https://html.spec.whatwg.org/multipage/dom.html#elements-in-the-dom>
737
+
WHATWG: Elements in the DOM *)
738
+
val create_element : string -> ?namespace:string option ->
739
+
?attrs:(string * string) list -> unit -> node
740
+
741
+
(** Create a text node.
742
+
743
+
Text nodes contain the readable text content of HTML documents.
744
+
745
+
{b Example:}
746
+
{[
747
+
let text = create_text "Hello, world!"
748
+
]} *)
262
749
val create_text : string -> node
263
750
264
-
(** Create a comment node. *)
751
+
(** Create a comment node.
752
+
753
+
Comments are preserved in the DOM but not rendered. They're written
754
+
as [<!-- text -->] in HTML.
755
+
756
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#comments>
757
+
WHATWG: Comments *)
265
758
val create_comment : string -> node
266
759
267
-
(** Create an empty document node. *)
760
+
(** Create an empty document node.
761
+
762
+
The Document node is the root of an HTML document tree.
763
+
764
+
@see <https://html.spec.whatwg.org/multipage/dom.html#document>
765
+
WHATWG: The Document object *)
268
766
val create_document : unit -> node
269
767
270
-
(** Create a document fragment node. *)
768
+
(** Create a document fragment node.
769
+
770
+
Document fragments are lightweight containers for holding nodes
771
+
without a parent document. Used for template contents and fragment
772
+
parsing results.
773
+
774
+
@see <https://dom.spec.whatwg.org/#documentfragment>
775
+
DOM Standard: DocumentFragment *)
271
776
val create_document_fragment : unit -> node
272
777
273
-
(** Create a doctype node. *)
274
-
val create_doctype : ?name:string -> ?public_id:string -> ?system_id:string -> unit -> node
778
+
(** Create a doctype node.
275
779
276
-
(** Append a child node to a parent. *)
780
+
For HTML5 documents, use [create_doctype ~name:"html" ()].
781
+
782
+
@param name DOCTYPE name (usually ["html"])
783
+
@param public_id Public identifier (legacy)
784
+
@param system_id System identifier (legacy)
785
+
786
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#the-doctype>
787
+
WHATWG: The DOCTYPE *)
788
+
val create_doctype : ?name:string -> ?public_id:string ->
789
+
?system_id:string -> unit -> node
790
+
791
+
(** Append a child node to a parent.
792
+
793
+
The child is added as the last child of the parent. If the child
794
+
already has a parent, it is first removed from that parent. *)
277
795
val append_child : node -> node -> unit
278
796
279
-
(** Insert a node before a reference node. *)
797
+
(** Insert a node before a reference node.
798
+
799
+
@param parent The parent node
800
+
@param new_child The node to insert
801
+
@param ref_child The existing child to insert before
802
+
803
+
Raises [Not_found] if [ref_child] is not a child of [parent]. *)
280
804
val insert_before : node -> node -> node -> unit
281
805
282
-
(** Remove a child node from its parent. *)
806
+
(** Remove a child node from its parent.
807
+
808
+
Raises [Not_found] if [child] is not a child of [parent]. *)
283
809
val remove_child : node -> node -> unit
284
810
285
-
(** Get an attribute value. *)
811
+
(** Get an attribute value.
812
+
813
+
Returns [Some value] if the attribute exists, [None] otherwise.
814
+
Attribute names are case-sensitive (but were lowercased during parsing).
815
+
816
+
@see <https://html.spec.whatwg.org/multipage/dom.html#attributes>
817
+
WHATWG: Attributes *)
286
818
val get_attr : node -> string -> string option
287
819
288
-
(** Set an attribute value. *)
820
+
(** Set an attribute value.
821
+
822
+
If the attribute exists, it is replaced. If not, it is added. *)
289
823
val set_attr : node -> string -> string -> unit
290
824
291
825
(** Check if a node has an attribute. *)
292
826
val has_attr : node -> string -> bool
293
827
294
-
(** Get all descendant nodes. *)
828
+
(** Get all descendant nodes in document order.
829
+
830
+
Returns all nodes below this node in the tree, in the order they
831
+
appear in the HTML source (depth-first). *)
295
832
val descendants : node -> node list
296
833
297
-
(** Get all ancestor nodes (from parent to root). *)
834
+
(** Get all ancestor nodes from parent to root.
835
+
836
+
Returns the chain of parent nodes, starting with the immediate parent
837
+
and ending with the Document node. *)
298
838
val ancestors : node -> node list
299
839
300
-
(** Get text content of a node and its descendants. *)
840
+
(** Get text content of a node and its descendants.
841
+
842
+
For text nodes, returns the text directly. For elements, recursively
843
+
concatenates all descendant text content. *)
301
844
val get_text_content : node -> string
302
845
303
846
(** Clone a node.
304
-
@param deep If true, also clone descendants (default: false)
305
-
*)
847
+
848
+
@param deep If [true], recursively clone all descendants.
849
+
If [false] (default), only clone the node itself. *)
306
850
val clone : ?deep:bool -> node -> node
307
851
308
-
(** {1 Node Predicates} *)
852
+
(** {1 Node Predicates}
309
853
310
-
(** Test if a node is an element. *)
854
+
Functions to test what type of node you have.
855
+
*)
856
+
857
+
(** Test if a node is an element.
858
+
859
+
Elements are HTML tags like [<div>], [<p>], [<a>]. *)
311
860
val is_element : node -> bool
312
861
313
-
(** Test if a node is a text node. *)
862
+
(** Test if a node is a text node.
863
+
864
+
Text nodes contain character content within elements. *)
314
865
val is_text : node -> bool
315
866
316
-
(** Test if a node is a comment node. *)
867
+
(** Test if a node is a comment node.
868
+
869
+
Comment nodes represent HTML comments [<!-- ... -->]. *)
317
870
val is_comment : node -> bool
318
871
319
-
(** Test if a node is a document node. *)
872
+
(** Test if a node is a document node.
873
+
874
+
The document node is the root of a complete HTML document tree. *)
320
875
val is_document : node -> bool
321
876
322
-
(** Test if a node is a document fragment. *)
877
+
(** Test if a node is a document fragment.
878
+
879
+
Document fragments are lightweight containers for nodes. *)
323
880
val is_document_fragment : node -> bool
324
881
325
-
(** Test if a node is a doctype node. *)
882
+
(** Test if a node is a doctype node.
883
+
884
+
Doctype nodes represent the [<!DOCTYPE>] declaration. *)
326
885
val is_doctype : node -> bool
327
886
328
887
(** Test if a node has children. *)
+431
-93
lib/parser/html5rw_parser.mli
+431
-93
lib/parser/html5rw_parser.mli
···
3
3
SPDX-License-Identifier: MIT
4
4
---------------------------------------------------------------------------*)
5
5
6
-
(** HTML5 Parser
6
+
(** HTML5 Parser - Low-Level API
7
7
8
8
This module provides the core HTML5 parsing functionality implementing
9
-
the WHATWG parsing specification. It handles tokenization, tree construction,
9
+
the {{:https://html.spec.whatwg.org/multipage/parsing.html} WHATWG
10
+
HTML5 parsing specification}. It handles tokenization, tree construction,
10
11
error recovery, and produces a DOM tree.
11
12
12
-
For most uses, prefer the top-level {!Html5rw} module which re-exports
13
-
these functions with a simpler interface.
13
+
For most uses, prefer the top-level {!Html5rw} module which provides
14
+
a simpler interface. This module is for advanced use cases that need
15
+
access to parser internals.
16
+
17
+
{2 How HTML5 Parsing Works}
18
+
19
+
The HTML5 parsing algorithm is unusual compared to most parsers. It was
20
+
reverse-engineered from browser behavior rather than designed from a
21
+
formal grammar. This ensures the parser handles malformed HTML exactly
22
+
like web browsers do.
23
+
24
+
The algorithm has three main phases:
25
+
26
+
{3 1. Encoding Detection}
27
+
28
+
Before parsing begins, the character encoding must be determined. The
29
+
WHATWG specification defines a "sniffing" algorithm:
30
+
31
+
1. Check for a BOM (Byte Order Mark) at the start
32
+
2. Look for [<meta charset="...">] in the first 1024 bytes
33
+
3. Use HTTP Content-Type header hint if available
34
+
4. Fall back to UTF-8
35
+
36
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
37
+
WHATWG: Determining the character encoding
38
+
39
+
{3 2. Tokenization}
40
+
41
+
The tokenizer converts the input stream into a sequence of tokens.
42
+
It implements a state machine with over 80 states to handle:
43
+
44
+
- Data (text content)
45
+
- Tags (start tags, end tags, self-closing tags)
46
+
- Comments
47
+
- DOCTYPEs
48
+
- Character references ([&], [<], [<])
49
+
- CDATA sections (in SVG/MathML)
50
+
51
+
The tokenizer has special handling for:
52
+
- {b Raw text elements}: [<script>], [<style>] - no markup parsing inside
53
+
- {b Escapable raw text elements}: [<textarea>], [<title>] - limited parsing
54
+
- {b RCDATA}: Content where only character references are parsed
55
+
56
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization>
57
+
WHATWG: Tokenization
58
+
59
+
{3 3. Tree Construction}
60
+
61
+
The tree builder receives tokens from the tokenizer and builds the DOM
62
+
tree. It uses {i insertion modes} - a state machine that determines how
63
+
each token should be processed based on the current document context.
64
+
65
+
{b Insertion modes} include:
66
+
- [initial]: Before the DOCTYPE
67
+
- [before_html]: Before the [<html>] element
68
+
- [before_head]: Before the [<head>] element
69
+
- [in_head]: Inside [<head>]
70
+
- [in_body]: Inside [<body>] (the most complex mode)
71
+
- [in_table]: Inside [<table>] (special handling)
72
+
- [in_template]: Inside [<template>]
73
+
- And many more...
74
+
75
+
The tree builder maintains:
76
+
- {b Stack of open elements}: Elements that have been opened but not closed
77
+
- {b List of active formatting elements}: For handling nested formatting
78
+
- {b The template insertion mode stack}: For [<template>] elements
79
+
80
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction>
81
+
WHATWG: Tree construction
82
+
83
+
{2 Error Recovery}
84
+
85
+
A key feature of HTML5 parsing is that it {b never fails}. The specification
86
+
defines error recovery for every possible malformed input. For example:
87
+
88
+
- Missing end tags are implicitly closed
89
+
- Misnested tags are handled via the "adoption agency algorithm"
90
+
- Invalid characters are replaced with U+FFFD
91
+
- Unexpected elements are either ignored or moved to valid positions
14
92
15
-
{2 Parsing Algorithm}
93
+
This ensures every HTML document produces a valid DOM tree.
16
94
17
-
The HTML5 parsing algorithm is defined by the WHATWG specification and
18
-
consists of several phases:
95
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
96
+
WHATWG: Parse errors
19
97
20
-
1. {b Encoding sniffing}: Detect character encoding from BOM, meta tags,
21
-
or transport layer hints
22
-
2. {b Tokenization}: Convert the input stream into a sequence of tokens
23
-
(start tags, end tags, character data, comments, etc.)
24
-
3. {b Tree construction}: Build the DOM tree using a state machine with
25
-
multiple insertion modes
98
+
{2 The Adoption Agency Algorithm}
99
+
100
+
One of the most complex parts of HTML5 parsing is handling misnested
101
+
formatting elements. For example:
102
+
103
+
{v <p>Hello <b>world</p> <p>more</b> text</p> v}
26
104
27
-
The algorithm includes extensive error recovery to handle malformed HTML
28
-
in a consistent way across browsers.
105
+
Browsers don't just error out - they use the "adoption agency algorithm"
106
+
to produce sensible results. This algorithm:
107
+
1. Identifies formatting elements that span across other elements
108
+
2. Reconstructs the tree to properly nest elements
109
+
3. Moves nodes between parents as needed
29
110
30
-
@see <https://html.spec.whatwg.org/multipage/parsing.html>
31
-
The WHATWG HTML Parsing specification
111
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#adoption-agency-algorithm>
112
+
WHATWG: The adoption agency algorithm
32
113
*)
33
114
34
115
(** {1 Sub-modules} *)
35
116
117
+
(** DOM types and manipulation. *)
36
118
module Dom = Html5rw_dom
119
+
120
+
(** HTML5 tokenizer.
121
+
122
+
The tokenizer implements the first stage of HTML5 parsing, converting
123
+
an input byte stream into a sequence of tokens (start tags, end tags,
124
+
text, comments, DOCTYPEs).
125
+
126
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#tokenization>
127
+
WHATWG: Tokenization *)
37
128
module Tokenizer = Html5rw_tokenizer
129
+
130
+
(** Character encoding detection and conversion.
131
+
132
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
133
+
WHATWG: Determining the character encoding *)
38
134
module Encoding = Html5rw_encoding
135
+
136
+
(** HTML element constants and categories.
137
+
138
+
This module provides lists of element names that have special handling
139
+
in the HTML5 parser:
140
+
141
+
- {b Void elements}: Elements that cannot have children and have no end
142
+
tag ([area], [base], [br], [col], [embed], [hr], [img], [input],
143
+
[link], [meta], [source], [track], [wbr])
144
+
145
+
- {b Formatting elements}: Elements tracked in the list of active
146
+
formatting elements for the adoption agency algorithm ([a], [b], [big],
147
+
[code], [em], [font], [i], [nobr], [s], [small], [strike], [strong],
148
+
[tt], [u])
149
+
150
+
- {b Special elements}: Elements with special parsing rules that affect
151
+
scope and formatting reconstruction
152
+
153
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements>
154
+
WHATWG: Void elements
155
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#formatting>
156
+
WHATWG: Formatting elements *)
39
157
module Constants : sig
40
158
val void_elements : string list
159
+
(** Elements that cannot have children: [area], [base], [br], [col],
160
+
[embed], [hr], [img], [input], [link], [meta], [source], [track], [wbr].
161
+
162
+
@see <https://html.spec.whatwg.org/multipage/syntax.html#void-elements>
163
+
WHATWG: Void elements *)
164
+
41
165
val formatting_elements : string list
166
+
(** Elements tracked for the adoption agency algorithm: [a], [b], [big],
167
+
[code], [em], [font], [i], [nobr], [s], [small], [strike], [strong],
168
+
[tt], [u].
169
+
170
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#formatting>
171
+
WHATWG: Formatting elements *)
172
+
42
173
val special_elements : string list
174
+
(** Elements with special parsing behavior that affect scope checking.
175
+
176
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#special>
177
+
WHATWG: Special elements *)
43
178
end
179
+
180
+
(** Parser insertion modes.
181
+
182
+
Insertion modes are the states of the tree construction state machine.
183
+
They determine how each token from the tokenizer should be processed
184
+
based on the current document context.
185
+
186
+
For example, a [<td>] tag is handled differently depending on whether
187
+
the parser is currently in a table context or in the body.
188
+
189
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#insertion-mode>
190
+
WHATWG: Insertion mode *)
44
191
module Insertion_mode : sig
45
192
type t
193
+
(** The insertion mode type. Values include modes like [initial],
194
+
[before_html], [in_head], [in_body], [in_table], etc. *)
46
195
end
196
+
197
+
(** Tree builder state.
198
+
199
+
The tree builder maintains the state needed for tree construction:
200
+
- Stack of open elements
201
+
- List of active formatting elements
202
+
- Template insertion mode stack
203
+
- Current insertion mode
204
+
- Foster parenting flag
205
+
206
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#tree-construction>
207
+
WHATWG: Tree construction *)
47
208
module Tree_builder : sig
48
209
type t
210
+
(** The tree builder state. *)
49
211
end
50
212
51
213
(** {1 Types} *)
52
214
53
215
(** A parse error encountered during parsing.
54
216
55
-
HTML5 parsing never fails - it always produces a DOM tree. However,
56
-
the specification defines many error conditions that conformance
57
-
checkers should report. Error collection is optional and disabled
58
-
by default for performance.
217
+
HTML5 parsing {b never fails} - it always produces a DOM tree. However,
218
+
the WHATWG specification defines 92 specific error conditions that
219
+
conformance checkers should report. These errors indicate malformed
220
+
HTML that browsers will still render (with error recovery).
221
+
222
+
{b Error categories:}
223
+
224
+
{i Tokenizer errors} (detected during tokenization):
225
+
- [abrupt-closing-of-empty-comment]: Comment closed with [-->] without content
226
+
- [abrupt-doctype-public-identifier]: DOCTYPE public ID ended unexpectedly
227
+
- [eof-before-tag-name]: End of file while reading a tag name
228
+
- [eof-in-tag]: End of file inside a tag
229
+
- [missing-attribute-value]: Attribute has [=] but no value
230
+
- [unexpected-null-character]: Null byte in the input
231
+
- [unexpected-question-mark-instead-of-tag-name]: [<?] used instead of [<!]
232
+
233
+
{i Tree construction errors} (detected during tree building):
234
+
- [missing-doctype]: No DOCTYPE before first element
235
+
- [unexpected-token-*]: Token appeared in wrong context
236
+
- [foster-parenting]: Content moved outside table due to invalid position
59
237
60
-
Error codes follow the WHATWG specification naming convention,
61
-
e.g., "unexpected-null-character", "eof-in-tag".
238
+
Enable error collection with [~collect_errors:true]. Error collection
239
+
has some performance overhead, so it's disabled by default.
62
240
63
241
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
64
-
The list of HTML5 parse errors
65
-
*)
242
+
WHATWG: Complete list of parse errors *)
66
243
type parse_error
67
244
68
245
(** Get the error code string.
69
246
70
-
Error codes are lowercase with hyphens, matching the WHATWG spec names
71
-
like "unexpected-null-character" or "eof-before-tag-name".
72
-
*)
247
+
Error codes are lowercase with hyphens, exactly matching the WHATWG
248
+
specification naming. Examples:
249
+
- ["unexpected-null-character"]
250
+
- ["eof-before-tag-name"]
251
+
- ["missing-end-tag-name"]
252
+
- ["duplicate-attribute"]
253
+
- ["missing-doctype"]
254
+
255
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
256
+
WHATWG: Parse error codes *)
73
257
val error_code : parse_error -> string
74
258
75
-
(** Get the line number where the error occurred (1-indexed). *)
259
+
(** Get the line number where the error occurred.
260
+
261
+
Line numbers are 1-indexed (first line is 1). Line breaks are
262
+
detected at LF (U+000A), CR (U+000D), and CR+LF sequences. *)
76
263
val error_line : parse_error -> int
77
264
78
-
(** Get the column number where the error occurred (1-indexed). *)
265
+
(** Get the column number where the error occurred.
266
+
267
+
Column numbers are 1-indexed (first column is 1). Columns reset
268
+
to 1 after each line break. Column counting uses code points,
269
+
not bytes or grapheme clusters. *)
79
270
val error_column : parse_error -> int
80
271
81
272
(** Context element for HTML fragment parsing.
82
273
83
-
When parsing an HTML fragment (innerHTML), you need to specify the
84
-
context element that would contain the fragment. This affects how
85
-
the parser handles certain elements.
274
+
When parsing HTML fragments (the content that would be assigned to
275
+
an element's [innerHTML]), the parser needs to know what element
276
+
would contain the fragment. This affects parsing in several ways:
86
277
87
-
For example, parsing [<td>] as a fragment of a [<tr>] works differently
88
-
than parsing it as a fragment of a [<div>].
278
+
{b Parser state initialization:}
279
+
- For [<title>] or [<textarea>]: Tokenizer starts in RCDATA state
280
+
- For [<style>], [<xmp>], [<iframe>], [<noembed>], [<noframes>]:
281
+
Tokenizer starts in RAWTEXT state
282
+
- For [<script>]: Tokenizer starts in script data state
283
+
- For [<noscript>]: Tokenizer starts in RAWTEXT state (if scripting enabled)
284
+
- For [<plaintext>]: Tokenizer starts in PLAINTEXT state
285
+
- Otherwise: Tokenizer starts in data state
286
+
287
+
{b Insertion mode:}
288
+
The initial insertion mode depends on the context element:
289
+
- [<template>]: "in template" mode
290
+
- [<html>]: "before head" mode
291
+
- [<head>]: "in head" mode
292
+
- [<body>], [<div>], etc.: "in body" mode
293
+
- [<table>]: "in table" mode
294
+
- And so on...
89
295
90
296
@see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
91
-
The HTML fragment parsing algorithm
92
-
*)
297
+
WHATWG: The fragment parsing algorithm *)
93
298
type fragment_context
94
299
95
300
(** Create a fragment parsing context.
96
301
97
-
@param tag_name The tag name of the context element (e.g., "div", "tr")
98
-
@param namespace Namespace: [None] for HTML, [Some "svg"], [Some "mathml"]
302
+
@param tag_name Tag name of the context element. This should be the
303
+
tag name of the element that would contain the fragment.
304
+
Common choices:
305
+
- ["div"]: General-purpose (most common)
306
+
- ["body"]: For full body content
307
+
- ["tr"]: For table row content ([<td>] elements)
308
+
- ["ul"], ["ol"]: For list content ([<li>] elements)
309
+
- ["select"]: For [<option>] elements
99
310
311
+
@param namespace Element namespace:
312
+
- [None]: HTML namespace (default)
313
+
- [Some "svg"]: SVG namespace
314
+
- [Some "mathml"]: MathML namespace
315
+
316
+
{b Examples:}
100
317
{[
101
-
(* Parse as innerHTML of a table row *)
318
+
(* Parse innerHTML of a table row - <td> works correctly *)
102
319
let ctx = make_fragment_context ~tag_name:"tr" ()
103
320
104
-
(* Parse as innerHTML of an SVG element *)
321
+
(* Parse innerHTML of an SVG group element *)
105
322
let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") ()
323
+
324
+
(* Parse innerHTML of a select element - <option> works correctly *)
325
+
let ctx = make_fragment_context ~tag_name:"select" ()
106
326
]}
107
-
*)
327
+
328
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments>
329
+
WHATWG: Fragment parsing algorithm *)
108
330
val make_fragment_context : tag_name:string -> ?namespace:string option ->
109
331
unit -> fragment_context
110
332
111
333
(** Get the tag name of a fragment context. *)
112
334
val fragment_context_tag : fragment_context -> string
113
335
114
-
(** Get the namespace of a fragment context. *)
336
+
(** Get the namespace of a fragment context ([None] for HTML). *)
115
337
val fragment_context_namespace : fragment_context -> string option
116
338
117
339
(** Result of parsing an HTML document or fragment.
118
340
119
-
Contains the parsed DOM tree, any errors encountered (if error
120
-
collection was enabled), and the detected encoding (for byte input).
341
+
This opaque type contains:
342
+
- The DOM tree (access via {!root})
343
+
- Parse errors if collection was enabled (access via {!errors})
344
+
- Detected encoding for byte input (access via {!encoding})
121
345
*)
122
346
type t
123
347
124
348
(** {1 Parsing Functions} *)
125
349
350
+
(** Parse HTML from a byte stream reader.
351
+
352
+
This function implements the complete HTML5 parsing algorithm:
353
+
354
+
1. Reads bytes from the provided reader
355
+
2. Tokenizes the input into HTML tokens
356
+
3. Constructs a DOM tree using the tree construction algorithm
357
+
4. Returns the parsed result
358
+
359
+
The input should be valid UTF-8. For automatic encoding detection
360
+
from raw bytes, use {!parse_bytes} instead.
361
+
362
+
{b Parser behavior:}
363
+
364
+
For {b full document parsing} (no fragment context), the parser:
365
+
- Creates a Document node as the root
366
+
- Processes any DOCTYPE declaration
367
+
- Creates [<html>], [<head>], and [<body>] elements as needed
368
+
- Builds the full document tree
369
+
370
+
For {b fragment parsing} (with fragment context), the parser:
371
+
- Creates a Document Fragment as the root
372
+
- Initializes tokenizer state based on context element
373
+
- Initializes insertion mode based on context element
374
+
- Does not create implicit [<html>], [<head>], [<body>]
375
+
376
+
@param collect_errors If [true], collect parse errors in the result.
377
+
Default: [false]. Enabling error collection adds overhead.
378
+
@param fragment_context Context for fragment parsing. If provided,
379
+
the input is parsed as fragment content (like innerHTML).
380
+
381
+
@see <https://html.spec.whatwg.org/multipage/parsing.html>
382
+
WHATWG: HTML parsing *)
126
383
val parse : ?collect_errors:bool -> ?fragment_context:fragment_context ->
127
384
Bytesrw.Bytes.Reader.t -> t
128
-
(** Parse HTML from a byte stream reader.
385
+
386
+
(** Parse HTML bytes with automatic encoding detection.
387
+
388
+
This function wraps {!parse} with encoding detection, implementing the
389
+
WHATWG encoding sniffing algorithm:
390
+
391
+
{b Detection order:}
392
+
1. {b BOM}: Check first 2-3 bytes for UTF-8, UTF-16LE, or UTF-16BE BOM
393
+
2. {b Prescan}: Look for [<meta charset="...">] or
394
+
[<meta http-equiv="Content-Type" content="...charset=...">]
395
+
in the first 1024 bytes
396
+
3. {b Transport hint}: Use [transport_encoding] if provided
397
+
4. {b Fallback}: Use UTF-8
398
+
399
+
The detected encoding is stored in the result (access via {!encoding}).
129
400
130
-
This is the primary parsing function. The input must be valid UTF-8
131
-
(or will be converted from detected encoding when using {!parse_bytes}).
401
+
{b Prescan details:}
132
402
133
-
@param collect_errors If [true], collect parse errors (default: [false])
134
-
@param fragment_context Context for fragment parsing (innerHTML)
403
+
The prescan algorithm parses just enough of the document to find a
404
+
charset declaration. It handles:
405
+
- [<meta charset="utf-8">]
406
+
- [<meta http-equiv="Content-Type" content="text/html; charset=utf-8">]
407
+
- Comments and other markup are skipped
408
+
- Parsing stops after 1024 bytes
135
409
136
-
{[
137
-
open Bytesrw
138
-
let reader = Bytes.Reader.of_string "<p>Hello</p>" in
139
-
let result = parse reader
140
-
]}
141
-
*)
410
+
@param collect_errors If [true], collect parse errors. Default: [false].
411
+
@param transport_encoding Encoding hint from HTTP Content-Type header.
412
+
For example: ["utf-8"], ["iso-8859-1"], ["windows-1252"].
413
+
@param fragment_context Context for fragment parsing.
142
414
415
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
416
+
WHATWG: Determining the character encoding
417
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding>
418
+
WHATWG: Prescan algorithm *)
143
419
val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string ->
144
420
?fragment_context:fragment_context -> bytes -> t
145
-
(** Parse HTML bytes with automatic encoding detection.
146
421
147
-
Implements the WHATWG encoding sniffing algorithm:
148
-
1. Check for BOM (UTF-8, UTF-16LE, UTF-16BE)
149
-
2. Prescan for [<meta charset>] declaration
150
-
3. Use transport encoding hint if provided
151
-
4. Fall back to UTF-8
422
+
(** {1 Result Accessors} *)
423
+
424
+
(** Get the root node of the parsed document.
152
425
153
-
@param collect_errors If [true], collect parse errors (default: [false])
154
-
@param transport_encoding Encoding from HTTP Content-Type header
155
-
@param fragment_context Context for fragment parsing (innerHTML)
156
-
*)
426
+
For full document parsing, returns a Document node with structure:
427
+
{v
428
+
#document
429
+
├── !doctype (if DOCTYPE was present)
430
+
└── html
431
+
├── head
432
+
│ └── ... (title, meta, link, script, style)
433
+
└── body
434
+
└── ... (page content)
435
+
v}
157
436
158
-
(** {1 Result Accessors} *)
437
+
For fragment parsing, returns a Document Fragment node containing
438
+
the parsed elements directly (no implicit html/head/body).
159
439
440
+
@see <https://html.spec.whatwg.org/multipage/dom.html#document>
441
+
WHATWG: The Document object *)
160
442
val root : t -> Dom.node
161
-
(** Get the root node of the parsed document.
162
443
163
-
For full document parsing, this is a document node.
164
-
For fragment parsing, this is a document fragment node.
165
-
*)
444
+
(** Get parse errors collected during parsing.
445
+
446
+
Returns an empty list if error collection was not enabled
447
+
([collect_errors:false] or omitted) or if the document was well-formed.
448
+
449
+
Errors are returned in the order they were encountered.
166
450
451
+
{b Example:}
452
+
{[
453
+
let result = parse ~collect_errors:true reader in
454
+
List.iter (fun e ->
455
+
Printf.printf "Line %d, col %d: %s\n"
456
+
(error_line e) (error_column e) (error_code e)
457
+
) (errors result)
458
+
]}
459
+
460
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#parse-errors>
461
+
WHATWG: Parse errors *)
167
462
val errors : t -> parse_error list
168
-
(** Get parse errors (empty if error collection was disabled). *)
463
+
464
+
(** Get the detected encoding.
465
+
466
+
Returns [Some encoding] when {!parse_bytes} was used, indicating which
467
+
encoding was detected or specified.
468
+
469
+
Returns [None] when {!parse} was used (it expects pre-decoded UTF-8).
169
470
471
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>
472
+
WHATWG: Determining the character encoding *)
170
473
val encoding : t -> Encoding.encoding option
171
-
(** Get the detected encoding (only set when using {!parse_bytes}). *)
172
474
173
475
(** {1 Querying} *)
174
476
175
-
val query : t -> string -> Dom.node list
176
477
(** Query the DOM with a CSS selector.
177
478
479
+
Returns all elements matching the selector in document order.
480
+
481
+
{b Supported selectors:}
482
+
483
+
See {!Html5rw_selector} for the complete list. Key selectors include:
484
+
- Type: [div], [p], [a]
485
+
- ID: [#myid]
486
+
- Class: [.myclass]
487
+
- Attribute: [[href]], [[type="text"]]
488
+
- Pseudo-class: [:first-child], [:nth-child(2)]
489
+
- Combinators: [div p] (descendant), [div > p] (child)
490
+
178
491
@raise Html5rw_selector.Selector_error if the selector is invalid
179
492
180
-
See {!Html5rw_selector} for supported selector syntax.
181
-
*)
493
+
@see <https://www.w3.org/TR/selectors-4/>
494
+
W3C: Selectors Level 4 *)
495
+
val query : t -> string -> Dom.node list
182
496
183
497
(** {1 Serialization} *)
184
498
499
+
(** Serialize the DOM tree to a byte writer.
500
+
501
+
Outputs valid HTML5 that can be parsed to produce an equivalent DOM tree.
502
+
The output follows the WHATWG serialization algorithm.
503
+
504
+
{b Serialization rules:}
505
+
- Void elements are written without end tags
506
+
- Attributes are quoted with double quotes
507
+
- Special characters in text/attributes are escaped
508
+
- Comments preserve their content
509
+
- DOCTYPE is serialized as [<!DOCTYPE html>]
510
+
511
+
@param pretty If [true] (default), add indentation for readability.
512
+
@param indent_size Spaces per indent level (default: 2).
513
+
514
+
@see <https://html.spec.whatwg.org/multipage/parsing.html#serialising-html-fragments>
515
+
WHATWG: Serialising HTML fragments *)
185
516
val to_writer : ?pretty:bool -> ?indent_size:int -> t ->
186
517
Bytesrw.Bytes.Writer.t -> unit
187
-
(** Serialize the DOM tree to a byte stream writer.
188
518
189
-
@param pretty If [true], format with indentation (default: [true])
190
-
@param indent_size Spaces per indent level (default: [2])
191
-
*)
519
+
(** Serialize the DOM tree to a string.
192
520
521
+
Convenience wrapper around {!to_writer} that returns a string.
522
+
523
+
@param pretty If [true] (default), add indentation for readability.
524
+
@param indent_size Spaces per indent level (default: 2). *)
193
525
val to_string : ?pretty:bool -> ?indent_size:int -> t -> string
194
-
(** Serialize the DOM tree to a string.
195
526
196
-
@param pretty If [true], format with indentation (default: [true])
197
-
@param indent_size Spaces per indent level (default: [2])
198
-
*)
199
-
200
-
val to_text : ?separator:string -> ?strip:bool -> t -> string
201
527
(** Extract text content from the DOM tree.
202
528
203
-
@param separator String between text nodes (default: [" "])
204
-
@param strip If [true], trim whitespace (default: [true])
205
-
*)
529
+
Returns the concatenation of all text node content in document order,
530
+
with no HTML markup.
206
531
207
-
val to_test_format : t -> string
532
+
@param separator String to insert between text nodes (default: [" "])
533
+
@param strip If [true] (default), trim leading/trailing whitespace *)
534
+
val to_text : ?separator:string -> ?strip:bool -> t -> string
535
+
208
536
(** Serialize to html5lib test format.
209
537
210
-
This format is used by the html5lib test suite and shows the tree
211
-
structure with indentation and node type prefixes.
212
-
*)
538
+
This produces the tree representation format used by the
539
+
{{:https://github.com/html5lib/html5lib-tests} html5lib-tests} suite.
540
+
541
+
The format shows the tree structure with:
542
+
- Indentation indicating depth (2 spaces per level)
543
+
- Prefixes indicating node type:
544
+
- [<!DOCTYPE ...>] for DOCTYPE
545
+
- [<tagname>] for elements (with attributes on same line)
546
+
- ["text"] for text nodes
547
+
- [<!-- comment -->] for comments
548
+
549
+
Mainly useful for testing the parser against the reference test suite. *)
550
+
val to_test_format : t -> string