My Page

Hello] is valid) - Void elements have no end tag ([
], not [
] or [

]) - Boolean attributes need no value ([]) XHTML uses stricter XML rules. If you need XHTML parsing, use an XML parser. @see WHATWG: The HTML syntax *) (** {1 Sub-modules} *) (** Parse error code types. This module provides the {!Parse_error_code.t} variant type that represents all WHATWG-defined parse errors plus tree construction errors. @see WHATWG: Parse errors *) module Parse_error_code = Parse_error_code (** DOM types and manipulation functions. This module provides the core types for representing HTML documents as DOM trees. It includes: - The {!Dom.node} type representing all kinds of DOM nodes - Functions to create, modify, and traverse nodes - Serialization functions to convert DOM back to HTML @see WHATWG: The elements of HTML *) module Dom = Dom (** HTML5 tokenizer. The tokenizer is the first stage of HTML5 parsing. It converts a stream of characters into a stream of {i tokens}: start tags, end tags, text, comments, and DOCTYPEs. Most users don't need to use the tokenizer directly - the {!parse} function handles everything. The tokenizer is exposed for advanced use cases like syntax highlighting or partial parsing. @see WHATWG: Tokenization *) module Tokenizer = Tokenizer (** Encoding detection and decoding. HTML documents can use various character encodings (UTF-8, ISO-8859-1, etc.). This module implements the WHATWG encoding sniffing algorithm that browsers use to detect the encoding of a document: 1. Check for a BOM (Byte Order Mark) 2. Look for a [] declaration 3. Use HTTP Content-Type header hint (if available) 4. Fall back to UTF-8 @see WHATWG: Determining the character encoding @see WHATWG Encoding Standard *) module Encoding = Encoding (** CSS selector engine. This module provides CSS selector support for querying the DOM tree. CSS selectors are patterns used to select HTML elements based on their tag names, attributes, classes, IDs, and position in the document. Example selectors: - [div] - all [

] elements - [#header] - element with [id="header"] - [.warning] - elements with [class="warning"] - [div > p] - [

] elements that are direct children of [

] - [[href]] - elements with an [href] attribute @see W3C Selectors Level 4 specification *) module Selector = Selector (** HTML entity decoding. HTML uses {i character references} to represent characters that are hard to type or have special meaning: - Named references: [&] (ampersand), [<] (less than), [ ] (non-breaking space) - Decimal references: [<] (less than as decimal 60) - Hexadecimal references: [<] (less than as hex 3C) This module decodes all 2,231 named character references defined in the WHATWG specification, plus numeric references. @see WHATWG: Named character references *) module Entities = Entities (** Low-level parser access. This module exposes the internals of the HTML5 parser for advanced use. Most users should use the top-level {!parse} function instead. The parser exposes: - Insertion modes for the tree construction algorithm - The tree builder state machine - Lower-level parsing functions @see WHATWG: Tree construction *) module Parser = Parser (** {1 Core Types} *) (** DOM node type. A node represents one part of an HTML document. Nodes form a tree structure with parent/child relationships. There are several kinds: - {b Element nodes}: HTML tags like [

], [

], [] - {b Text nodes}: Text content within elements - {b Comment nodes}: HTML comments [] - {b Document nodes}: The root of a document tree - {b Document fragment nodes}: Lightweight containers - {b Doctype nodes}: The [] declaration See {!Dom} for manipulation functions. @see WHATWG: The DOM *) type node = Dom.node val pp_node : Format.formatter -> node -> unit (** Pretty-print a DOM node. Prints a summary representation showing the node type and key attributes. Does not recursively print children. *) (** DOCTYPE information. The DOCTYPE declaration ([]) appears at the start of HTML documents. It tells browsers to use standards mode for rendering. In HTML5, the DOCTYPE is minimal - just [] with no public or system identifiers. Legacy DOCTYPEs may have additional fields. @see WHATWG: The DOCTYPE *) type doctype_data = Dom.doctype_data = { name : string option; (** DOCTYPE name, typically ["html"] *) public_id : string option; (** Public identifier for legacy DOCTYPEs (e.g., XHTML, HTML4) *) system_id : string option; (** System identifier (URL) for legacy DOCTYPEs *) } val pp_doctype_data : Format.formatter -> doctype_data -> unit (** Pretty-print DOCTYPE data. *) (** Source location for nodes. Records the line and column where a node was found in the source HTML. The end position is optional for nodes like text that may span multiple locations. *) type location = Dom.location = { line : int; (** 1-indexed line number where the node starts *) column : int; (** 1-indexed column number where the node starts *) end_line : int option; (** Optional line number where the node ends *) end_column : int option; (** Optional column number where the node ends *) } val make_location : line:int -> column:int -> ?end_line:int -> ?end_column:int -> unit -> location (** Create a location. *) val get_location : node -> location option (** Get the source location for a node, if set. *) val set_location : node -> line:int -> column:int -> ?end_line:int -> ?end_column:int -> unit -> unit (** Set the source location for a node. *) (** Quirks mode as determined during parsing. {i Quirks mode} controls how browsers render CSS and compute layouts. It exists for backwards compatibility with old web pages that relied on browser bugs. - {b No_quirks}: Standards mode. The document is rendered according to modern HTML5 and CSS specifications. Triggered by []. - {b Quirks}: Full quirks mode. The browser emulates bugs from older browsers (primarily IE5). Triggered by missing or malformed DOCTYPEs. Affects CSS box model, table layout, font inheritance, and more. - {b Limited_quirks}: Almost standards mode. Only a few specific quirks are applied, mainly affecting table cell vertical alignment. {b Recommendation:} Always use [] to ensure standards mode. @see Quirks Mode Standard @see WHATWG: How quirks mode is determined *) type quirks_mode = Dom.quirks_mode = No_quirks | Quirks | Limited_quirks val pp_quirks_mode : Format.formatter -> quirks_mode -> unit (** Pretty-print quirks mode. *) (** Character encoding detected or specified. HTML documents are sequences of bytes that must be decoded into characters. Different encodings interpret the same bytes differently. For example: - UTF-8: The modern standard, supporting all Unicode characters - Windows-1252: Common on older Western European web pages - ISO-8859-2: Used for Central European languages - UTF-16: Used by some Windows applications The parser detects encoding automatically when using {!parse_bytes}. The detected encoding is available via {!val-encoding}. @see WHATWG: Determining the character encoding @see WHATWG Encoding Standard *) type encoding = Encoding.encoding = | Utf8 (** UTF-8: The dominant encoding for the web, supporting all Unicode *) | Utf16le (** UTF-16 Little-Endian: 16-bit encoding, used by Windows *) | Utf16be (** UTF-16 Big-Endian: 16-bit encoding, network byte order *) | Windows_1252 (** Windows-1252 (CP-1252): Western European, superset of ISO-8859-1 *) | Iso_8859_2 (** ISO-8859-2: Central European (Polish, Czech, Hungarian, etc.) *) | Euc_jp (** EUC-JP: Extended Unix Code for Japanese *) val pp_encoding : Format.formatter -> encoding -> unit (** Pretty-print an encoding using its canonical label. *) (** A parse error encountered during HTML5 parsing. HTML5 parsing {b never fails} - the specification defines error recovery for all malformed input. However, conformance checkers can report these errors. Enable error collection with [~collect_errors:true] if you want to detect malformed HTML. {b Common parse errors:} - ["unexpected-null-character"]: Null byte in the input - ["eof-before-tag-name"]: File ended while reading a tag - ["unexpected-character-in-attribute-name"]: Invalid attribute syntax - ["missing-doctype"]: Document started without [] - ["duplicate-attribute"]: Same attribute appears twice on an element The full list of parse error codes is defined in the WHATWG specification. @see WHATWG: Complete list of parse errors *) type parse_error = Parser.parse_error (** Get the error code. Returns the {!Parse_error_code.t} variant representing this error. This allows pattern matching on specific error types: {[ match Html5rw.error_code err with | Parse_error_code.Unexpected_null_character -> (* handle *) | Parse_error_code.Eof_in_tag -> (* handle *) | Parse_error_code.Tree_construction_error msg -> (* handle tree error *) | _ -> (* other *) ]} Use {!Parse_error_code.to_string} to convert to a string representation. @see WHATWG: Parse error codes *) val error_code : parse_error -> Parse_error_code.t (** Get the line number where the error occurred (1-indexed). Line numbers count from 1 and increment at each newline character. *) val error_line : parse_error -> int (** Get the column number where the error occurred (1-indexed). Column numbers count from 1 and reset at each newline. *) val error_column : parse_error -> int val pp_parse_error : Format.formatter -> parse_error -> unit (** Pretty-print a parse error with location information. *) (** {1 Error Handling} *) (** Global error type that wraps all errors raised by the Html5rw library. This module provides a unified error type for all parsing and selector errors, along with printers and conversion functions. Use this when you want to handle all possible errors from the library in a uniform way. {2 Usage} {[ (* Converting parse errors *) let errors = Html5rw.errors result in List.iter (fun err -> let unified = Html5rw.Error.of_parse_error err in Printf.eprintf "%s\n" (Html5rw.Error.to_string unified) ) errors (* Catching selector errors *) match Html5rw.query result selector with | nodes -> (* success *) | exception Html5rw.Selector.Selector_error code -> let unified = Html5rw.Error.of_selector_error code in Printf.eprintf "%s\n" (Html5rw.Error.to_string unified) ]} *) module Error : sig (** The unified error type for the Html5rw library. *) type t = | Parse_error of { code : Parse_error_code.t; line : int; column : int; } (** An HTML parse error, including location information. Parse errors occur during HTML tokenization and tree construction. The location indicates where in the input the error was detected. @see WHATWG: Parse errors *) | Selector_error of Selector.Error_code.t (** A CSS selector parse error. Selector errors occur when parsing malformed CSS selectors passed to {!query} or {!matches}. *) val of_parse_error : parse_error -> t (** Convert a parse error to the unified error type. {[ let errors = Html5rw.errors result in let unified_errors = List.map Html5rw.Error.of_parse_error errors ]} *) val of_selector_error : Selector.Error_code.t -> t (** Convert a selector error code to the unified error type. {[ match Html5rw.query result "invalid[" with | _ -> () | exception Html5rw.Selector.Selector_error code -> let err = Html5rw.Error.of_selector_error code in Printf.eprintf "%s\n" (Html5rw.Error.to_string err) ]} *) val to_string : t -> string (** Convert to a human-readable error message with location information. Examples: - ["Parse error at 5:12: unexpected-null-character"] - ["Selector error: Expected \]"] *) val pp : Format.formatter -> t -> unit (** Pretty-printer for use with [Format] functions. *) val code_string : t -> string (** Get just the error code as a kebab-case string (without location). This is useful for programmatic error handling or logging. Examples: - ["unexpected-null-character"] - ["expected-closing-bracket"] *) end (** {1 Fragment Parsing} *) (** Context element for HTML fragment parsing (innerHTML). When parsing HTML fragments (like the [innerHTML] of an element), you must specify what element would contain the fragment. This affects how the parser handles certain elements. {b Why context matters:} HTML parsing rules depend on where content appears. For example: - [] is valid inside [] but not inside [

] - [

] is valid inside [

elements are parsed correctly *) ]} @see
WHATWG: The fragment parsing algorithm *) type fragment_context = Parser.fragment_context (** Create a fragment parsing context. The context element determines how the parser interprets the fragment. Choose a context that matches where the fragment would be inserted. @param tag_name Tag name of the context element (e.g., ["div"], ["tr"], ["ul"]). This is the element that would contain the fragment. @param namespace Namespace of the context element: - [None] (default): HTML namespace - [Some "svg"]: SVG namespace - [Some "mathml"]: MathML namespace {b Examples:} {[ (* Parse as innerHTML of a
(most common case) *) let ctx = make_fragment_context ~tag_name:"div" () (* Parse as innerHTML of a
- elements work correctly *) let ctx = make_fragment_context ~tag_name:"ul" () (* Parse as innerHTML of an SVG element *) let ctx = make_fragment_context ~tag_name:"g" ~namespace:(Some "svg") () (* Parse as innerHTML of a

WHATWG: Fragment parsing algorithm *) val make_fragment_context : tag_name:string -> ?namespace:string option -> unit -> fragment_context (** Get the tag name of a fragment context. *) val fragment_context_tag : fragment_context -> string (** Get the namespace of a fragment context. *) val fragment_context_namespace : fragment_context -> string option val pp_fragment_context : Format.formatter -> fragment_context -> unit (** Pretty-print a fragment context. *) (** Result of parsing an HTML document. This record contains everything produced by parsing: - The DOM tree (accessible via {!val-root}) - Any parse errors (accessible via {!val-errors}) - The detected encoding (accessible via {!val-encoding}) *) type t = { root : node; (** Root node of the parsed document tree. For full document parsing, this is a Document node containing the DOCTYPE (if any) and [] element. For fragment parsing, this is a Document Fragment containing the parsed elements. *) errors : parse_error list; (** Parse errors encountered during parsing. This list is empty unless [~collect_errors:true] was passed to the parse function. Errors are in the order they were encountered. @see WHATWG: Parse errors *) encoding : encoding option; (** Character encoding detected during parsing. This is [Some encoding] when using {!parse_bytes} with automatic encoding detection, and [None] when using {!parse} (which expects pre-decoded UTF-8 input). *) } val pp : Format.formatter -> t -> unit (** Pretty-print a parse result summary. *) (** {1 Parsing Functions} *) (** Parse HTML from a [Bytes.Reader.t]. This is the primary parsing function. It reads bytes from the provided reader and returns a DOM tree. The input should be valid UTF-8. {b Creating readers:} {[ open Bytesrw (* From a string *) let reader = Bytes.Reader.of_string html_string (* From a file *) let ic = open_in "page.html" in let reader = Bytes.Reader.of_in_channel ic (* From a buffer *) let reader = Bytes.Reader.of_buffer buf ]} {b Parsing a complete document:} {[ let result = Html5rw.parse reader let doc = Html5rw.root result ]} {b Parsing a fragment:} {[ let ctx = Html5rw.make_fragment_context ~tag_name:"div" () in let result = Html5rw.parse ~fragment_context:ctx reader ]} @param collect_errors If [true], collect parse errors. Default: [false]. Error collection has some performance overhead. @param fragment_context Context element for fragment parsing. If provided, the input is parsed as a fragment (like innerHTML) rather than a complete document. @see WHATWG: HTML parsing algorithm *) val parse : ?collect_errors:bool -> ?fragment_context:fragment_context -> Bytesrw.Bytes.Reader.t -> t (** Parse raw bytes with automatic encoding detection. This function is useful when you have raw bytes and don't know the character encoding. It implements the WHATWG encoding sniffing algorithm: 1. {b BOM detection}: Check for UTF-8, UTF-16LE, or UTF-16BE BOM 2. {b Prescan}: Look for [] in the first 1024 bytes 3. {b Transport hint}: Use the provided [transport_encoding] if any 4. {b Fallback}: Use UTF-8 (the modern web default) The detected encoding is stored in the result's [encoding] field. {b Example:} {[ let bytes = really_input_bytes ic (in_channel_length ic) in let result = Html5rw.parse_bytes bytes in match Html5rw.encoding result with | Some Utf8 -> print_endline "UTF-8 detected" | Some Windows_1252 -> print_endline "Windows-1252 detected" | _ -> () ]} @param collect_errors If [true], collect parse errors. Default: [false]. @param transport_encoding Encoding hint from HTTP Content-Type header. For example, if the server sends [Content-Type: text/html; charset=utf-8], pass [~transport_encoding:"utf-8"]. @param fragment_context Context element for fragment parsing. @see WHATWG: Determining the character encoding *) val parse_bytes : ?collect_errors:bool -> ?transport_encoding:string -> ?fragment_context:fragment_context -> bytes -> t (** {1 Querying} *) (** Query the DOM tree with a CSS selector. CSS selectors are patterns used to select elements in HTML documents. This function returns all nodes matching the selector, in document order. {b Supported selectors:} {i Type selectors:} - [div], [p], [span] - elements by tag name {i Class and ID selectors:} - [#myid] - element with [id="myid"] - [.myclass] - elements with class containing "myclass" {i Attribute selectors:} - [[attr]] - elements with the [attr] attribute - [[attr="value"]] - attribute equals value - [[attr~="value"]] - attribute contains word - [[attr|="value"]] - attribute starts with value or value- - [[attr^="value"]] - attribute starts with value - [[attr$="value"]] - attribute ends with value - [[attr*="value"]] - attribute contains value {i Pseudo-classes:} - [:first-child], [:last-child] - first/last child of parent - [:nth-child(n)] - nth child (1-indexed) - [:only-child] - only child of parent - [:empty] - elements with no children - [:not(selector)] - elements not matching selector {i Combinators:} - [A B] - B descendants of A (any depth) - [A > B] - B direct children of A - [A + B] - B immediately after A (adjacent sibling) - [A ~ B] - B after A (general sibling) {i Universal:} - [*] - all elements {b Examples:} {[ (* All paragraphs *) let ps = query result "p" (* Elements with class "warning" inside a div *) let warnings = query result "div .warning" (* Direct children of nav that are links *) let nav_links = query result "nav > a" (* Complex selector *) let items = query result "ul.menu > li:first-child a[href]" ]} @raise Selector.Selector_error if the selector syntax is invalid @see W3C: Selectors Level 4 *) val query : t -> string -> node list (** Check if a node matches a CSS selector. This is useful for filtering nodes or implementing custom traversals. {b Example:} {[ let is_external_link node = matches node "a[href^='http']" ]} @raise Selector.Selector_error if the selector syntax is invalid *) val matches : node -> string -> bool (** {1 Serialization} *) (** Write the DOM tree to a [Bytes.Writer.t]. This serializes the DOM back to HTML. The output is valid HTML5 that can be parsed to produce an equivalent DOM tree. {b Example:} {[ open Bytesrw let buf = Buffer.create 1024 in let writer = Bytes.Writer.of_buffer buf in Html5rw.to_writer result writer; Bytes.Writer.write_eod writer; let html = Buffer.contents buf ]} @param pretty If [true] (default), add indentation for readability. If [false], output compact HTML with no added whitespace. @param indent_size Spaces per indentation level (default: 2). Only used when [pretty] is [true]. @see WHATWG: Serialising HTML fragments *) val to_writer : ?pretty:bool -> ?indent_size:int -> t -> Bytesrw.Bytes.Writer.t -> unit (** Serialize the DOM tree to a string. Convenience function that serializes to a string instead of a writer. Use {!to_writer} for large documents to avoid memory allocation. @param pretty If [true] (default), add indentation for readability. @param indent_size Spaces per indentation level (default: 2). *) val to_string : ?pretty:bool -> ?indent_size:int -> t -> string (** Extract text content from the DOM tree. This concatenates all text nodes in the document, producing a string with just the readable text (no HTML tags). {b Example:} {[ (* For document:

Hello

World

*) let text = to_text result (* Returns: "Hello World" *) ]} @param separator String to insert between text nodes (default: [" "]) @param strip If [true] (default), trim leading/trailing whitespace *) val to_text : ?separator:string -> ?strip:bool -> t -> string (** Serialize to html5lib test format. This produces the tree format used by the {{:https://github.com/html5lib/html5lib-tests} html5lib-tests} suite. Mainly useful for testing the parser against the reference tests. *) val to_test_format : t -> string (** {1 Result Accessors} *) (** Get the root node of the parsed document. For full document parsing, this returns a Document node. The structure is: {v #document ├── !doctype (if present) └── html ├── head └── body v} For fragment parsing, this returns a Document Fragment node containing the parsed elements directly. *) val root : t -> node (** Get parse errors (if error collection was enabled). Returns an empty list if [~collect_errors:true] was not passed to the parse function, or if the document was well-formed. Errors are returned in the order they were encountered during parsing. @see WHATWG: Parse errors *) val errors : t -> parse_error list (** Get the detected encoding (if parsed from bytes). Returns [Some encoding] when {!parse_bytes} was used, indicating which encoding was detected or specified. Returns [None] when {!parse} was used, since it expects pre-decoded UTF-8 input. @see WHATWG: Determining the character encoding *) val encoding : t -> encoding option (** {1 DOM Utilities} Common DOM operations are available directly on this module. For the full API including more advanced operations, see the {!Dom} module. @see WHATWG: The elements of HTML *) (** Create an element node. Elements are the building blocks of HTML documents. They represent tags like [

], [

], [], etc. @param name Tag name (e.g., ["div"], ["p"], ["span"]) @param namespace Element namespace: - [None] (default): HTML namespace - [Some "svg"]: SVG namespace for graphics - [Some "mathml"]: MathML namespace for math notation @param attrs Initial attributes as [(name, value)] pairs {b Example:} {[ (* Simple element *) let div = create_element "div" () (* Element with attributes *) let link = create_element "a" ~attrs:[("href", "/about"); ("class", "nav-link")] () ]} @see WHATWG: Elements in the DOM *) val create_element : string -> ?namespace:string option -> ?attrs:(string * string) list -> ?location:Dom.location -> unit -> node (** Create a text node. Text nodes contain the readable text content of HTML documents. {b Example:} {[ let text = create_text "Hello, world!" ]} *) val create_text : ?location:Dom.location -> string -> node (** Create a comment node. Comments are preserved in the DOM but not rendered. They're written as [] in HTML. @see WHATWG: Comments *) val create_comment : ?location:Dom.location -> string -> node (** Create an empty document node. The Document node is the root of an HTML document tree. @see WHATWG: The Document object *) val create_document : unit -> node (** Create a document fragment node. Document fragments are lightweight containers for holding nodes without a parent document. Used for template contents and fragment parsing results. @see DOM Standard: DocumentFragment *) val create_document_fragment : unit -> node (** Create a doctype node. For HTML5 documents, use [create_doctype ~name:"html" ()]. @param name DOCTYPE name (usually ["html"]) @param public_id Public identifier (legacy) @param system_id System identifier (legacy) @see WHATWG: The DOCTYPE *) val create_doctype : ?name:string -> ?public_id:string -> ?system_id:string -> ?location:location -> unit -> node (** Append a child node to a parent. The child is added as the last child of the parent. If the child already has a parent, it is first removed from that parent. *) val append_child : node -> node -> unit (** Insert a node before a reference node. @param parent The parent node @param new_child The node to insert @param ref_child The existing child to insert before Raises [Not_found] if [ref_child] is not a child of [parent]. *) val insert_before : node -> node -> node -> unit (** Remove a child node from its parent. Raises [Not_found] if [child] is not a child of [parent]. *) val remove_child : node -> node -> unit (** Get an attribute value. Returns [Some value] if the attribute exists, [None] otherwise. Attribute names are case-sensitive (but were lowercased during parsing). @see WHATWG: Attributes *) val get_attr : node -> string -> string option (** Set an attribute value. If the attribute exists, it is replaced. If not, it is added. *) val set_attr : node -> string -> string -> unit (** Check if a node has an attribute. *) val has_attr : node -> string -> bool (** Get all descendant nodes in document order. Returns all nodes below this node in the tree, in the order they appear in the HTML source (depth-first). *) val descendants : node -> node list (** Get all ancestor nodes from parent to root. Returns the chain of parent nodes, starting with the immediate parent and ending with the Document node. *) val ancestors : node -> node list (** Get text content of a node and its descendants. For text nodes, returns the text directly. For elements, recursively concatenates all descendant text content. *) val get_text_content : node -> string (** Clone a node. @param deep If [true], recursively clone all descendants. If [false] (default), only clone the node itself. *) val clone : ?deep:bool -> node -> node (** {1 Node Predicates} Functions to test what type of node you have. *) (** Test if a node is an element. Elements are HTML tags like [

], [

], []. *) val is_element : node -> bool (** Test if a node is a text node. Text nodes contain character content within elements. *) val is_text : node -> bool (** Test if a node is a comment node. Comment nodes represent HTML comments []. *) val is_comment : node -> bool (** Test if a node is a document node. The document node is the root of a complete HTML document tree. *) val is_document : node -> bool (** Test if a node is a document fragment. Document fragments are lightweight containers for nodes. *) val is_document_fragment : node -> bool (** Test if a node is a doctype node. Doctype nodes represent the [] declaration. *) val is_doctype : node -> bool (** Test if a node has children. *) val has_children : node -> bool

Welcome