+71
-3
README.md
+71
-3
README.md
···
1
-
# html5rw - Pure OCaml HTML5 Parser
2
3
-
A pure OCaml HTML5 parser implementing the WHATWG HTML5 parsing specification.
4
This library passes the html5lib-tests suite and provides full support for
5
-
tokenization, tree construction, encoding detection, and CSS selector queries.
6
This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/).
7
8
## Key Features
9
10
- **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery
11
- **CSS Selectors**: Query the DOM using standard CSS selector syntax
12
- **Streaming I/O**: Uses bytesrw for efficient streaming input/output
13
- **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm
14
- **Entity Decoding**: Complete HTML5 named character reference support
15
16
## Usage
17
18
```ocaml
19
open Bytesrw
···
41
let reader = Bytes.Reader.of_string "<p>Fragment content</p>"
42
let doc = Html5rw.parse ~fragment_context:ctx reader
43
```
44
45
## Installation
46
···
1
+
# html5rw - Pure OCaml HTML5 Parser and Conformance Checker
2
3
+
A pure OCaml HTML5 parser and validator implementing the WHATWG HTML5 specification.
4
This library passes the html5lib-tests suite and provides full support for
5
+
tokenization, tree construction, encoding detection, CSS selector queries, and
6
+
conformance checking.
7
This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/).
8
9
## Key Features
10
11
- **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery
12
+
- **Conformance Checker**: Validates HTML5 documents against the WHATWG specification
13
- **CSS Selectors**: Query the DOM using standard CSS selector syntax
14
- **Streaming I/O**: Uses bytesrw for efficient streaming input/output
15
- **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm
16
- **Entity Decoding**: Complete HTML5 named character reference support
17
+
- **Multiple Output Formats**: Text, JSON (Nu validator compatible), and GNU-style output
18
+
19
+
## Libraries
20
+
21
+
- `html5rw` - Core HTML5 parser
22
+
- `html5rw.check` - Conformance checker library
23
+
24
+
## Command Line Tool
25
+
26
+
The `html5check` CLI validates HTML5 documents:
27
+
28
+
```bash
29
+
# Validate a file
30
+
html5check index.html
31
+
32
+
# Validate from stdin
33
+
cat page.html | html5check -
34
+
35
+
# JSON output (Nu validator compatible)
36
+
html5check --format=json page.html
37
+
38
+
# GNU-style output for IDE integration
39
+
html5check --format=gnu page.html
40
+
41
+
# Show only errors (suppress warnings)
42
+
html5check --errors-only page.html
43
+
44
+
# Quiet mode - show only counts
45
+
html5check --quiet page.html
46
+
```
47
+
48
+
Exit codes: 0 = valid, 1 = validation errors, 2 = I/O error.
49
50
## Usage
51
+
52
+
### Parsing HTML
53
54
```ocaml
55
open Bytesrw
···
77
let reader = Bytes.Reader.of_string "<p>Fragment content</p>"
78
let doc = Html5rw.parse ~fragment_context:ctx reader
79
```
80
+
81
+
### Validating HTML
82
+
83
+
```ocaml
84
+
open Bytesrw
85
+
86
+
(* Check HTML from a string *)
87
+
let html = "<html><body><p>Hello</p></body></html>"
88
+
let reader = Bytes.Reader.of_string html
89
+
let result = Htmlrw_check.check reader
90
+
91
+
(* Check for errors *)
92
+
if Htmlrw_check.has_errors result then
93
+
print_endline "Document has errors";
94
+
95
+
(* Get all messages *)
96
+
let messages = Htmlrw_check.messages result in
97
+
List.iter (fun msg ->
98
+
Format.printf "%a@." Htmlrw_check.pp_message msg
99
+
) messages;
100
+
101
+
(* Get formatted output *)
102
+
let text_output = Htmlrw_check.to_text result in
103
+
let json_output = Htmlrw_check.to_json result in
104
+
let gnu_output = Htmlrw_check.to_gnu result
105
+
```
106
+
107
+
The checker validates:
108
+
- Parse errors (malformed HTML syntax)
109
+
- Content model violations (invalid element nesting)
110
+
- Attribute errors (invalid or missing required attributes)
111
+
- Structural issues (other conformance problems)
112
113
## Installation
114