OCaml HTML5 parser/serialiser based on Python's JustHTML
1# html5rw - Pure OCaml HTML5 Parser and Conformance Checker
2
3A pure OCaml HTML5 parser and validator implementing the WHATWG HTML5 specification.
4This library passes the html5lib-tests suite and provides full support for
5tokenization, tree construction, encoding detection, CSS selector queries, and
6conformance checking.
7This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/).
8
9## Key Features
10
11- **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery
12- **Conformance Checker**: Validates HTML5 documents against the WHATWG specification
13- **CSS Selectors**: Query the DOM using standard CSS selector syntax
14- **Streaming I/O**: Uses bytesrw for efficient streaming input/output
15- **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm
16- **Entity Decoding**: Complete HTML5 named character reference support
17- **Multiple Output Formats**: Text, JSON (Nu validator compatible), and GNU-style output
18
19## Libraries
20
21- `html5rw` - Core HTML5 parser
22- `html5rw.check` - Conformance checker library
23
24## Command Line Tool
25
26The `html5check` CLI validates HTML5 documents:
27
28```bash
29# Validate a file
30html5check index.html
31
32# Validate from stdin
33cat page.html | html5check -
34
35# JSON output (Nu validator compatible)
36html5check --format=json page.html
37
38# GNU-style output for IDE integration
39html5check --format=gnu page.html
40
41# Show only errors (suppress warnings)
42html5check --errors-only page.html
43
44# Quiet mode - show only counts
45html5check --quiet page.html
46```
47
48Exit codes: 0 = valid, 1 = validation errors, 2 = I/O error.
49
50## Usage
51
52### Parsing HTML
53
54```ocaml
55open Bytesrw
56
57(* Parse HTML from a string *)
58let html = "<html><body><p>Hello, world!</p></body></html>"
59let reader = Bytes.Reader.of_string html
60let doc = Html5rw.parse reader
61
62(* Query with CSS selectors *)
63let paragraphs = Html5rw.query doc "p"
64
65(* Extract text content *)
66let text = Html5rw.to_text doc
67
68(* Serialize back to HTML *)
69let output = Html5rw.to_string doc
70```
71
72For fragment parsing (innerHTML):
73
74```ocaml
75(* Parse as innerHTML of a <div> *)
76let ctx = Html5rw.make_fragment_context ~tag_name:"div" ()
77let reader = Bytes.Reader.of_string "<p>Fragment content</p>"
78let doc = Html5rw.parse ~fragment_context:ctx reader
79```
80
81### Validating HTML
82
83```ocaml
84open Bytesrw
85
86(* Check HTML from a string *)
87let html = "<html><body><p>Hello</p></body></html>"
88let reader = Bytes.Reader.of_string html
89let result = Htmlrw_check.check reader
90
91(* Check for errors *)
92if Htmlrw_check.has_errors result then
93 print_endline "Document has errors";
94
95(* Get all messages *)
96let messages = Htmlrw_check.messages result in
97List.iter (fun msg ->
98 Format.printf "%a@." Htmlrw_check.pp_message msg
99) messages;
100
101(* Get formatted output *)
102let text_output = Htmlrw_check.to_text result in
103let json_output = Htmlrw_check.to_json result in
104let gnu_output = Htmlrw_check.to_gnu result
105```
106
107The checker validates:
108- Parse errors (malformed HTML syntax)
109- Content model violations (invalid element nesting)
110- Attribute errors (invalid or missing required attributes)
111- Structural issues (other conformance problems)
112
113## Installation
114
115```
116opam install html5rw
117```
118
119## Documentation
120
121API documentation is available via:
122
123```
124opam install html5rw
125odig doc html5rw
126```
127
128## License
129
130MIT