OCaml HTML5 parser/serialiser based on Python's JustHTML
1# html5rw - Pure OCaml HTML5 Parser
2
3A pure OCaml HTML5 parser implementing the WHATWG HTML5 parsing specification.
4This library passes the html5lib-tests suite and provides full support for
5tokenization, tree construction, encoding detection, and CSS selector queries.
6This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/).
7
8## Key Features
9
10- **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery
11- **CSS Selectors**: Query the DOM using standard CSS selector syntax
12- **Streaming I/O**: Uses bytesrw for efficient streaming input/output
13- **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm
14- **Entity Decoding**: Complete HTML5 named character reference support
15
16## Usage
17
18```ocaml
19open Bytesrw
20
21(* Parse HTML from a string *)
22let html = "<html><body><p>Hello, world!</p></body></html>"
23let reader = Bytes.Reader.of_string html
24let doc = Html5rw.parse reader
25
26(* Query with CSS selectors *)
27let paragraphs = Html5rw.query doc "p"
28
29(* Extract text content *)
30let text = Html5rw.to_text doc
31
32(* Serialize back to HTML *)
33let output = Html5rw.to_string doc
34```
35
36For fragment parsing (innerHTML):
37
38```ocaml
39(* Parse as innerHTML of a <div> *)
40let ctx = Html5rw.make_fragment_context ~tag_name:"div" ()
41let reader = Bytes.Reader.of_string "<p>Fragment content</p>"
42let doc = Html5rw.parse ~fragment_context:ctx reader
43```
44
45## Installation
46
47```
48opam install html5rw
49```
50
51## Documentation
52
53API documentation is available via:
54
55```
56opam install html5rw
57odig doc html5rw
58```
59
60## License
61
62MIT