OCaml HTML5 parser/serialiser based on Python's JustHTML
1# Plan: Nu HTML Validator Test Suite Integration
2
3This document describes how to run the Nu HTML Validator test suite against the OCaml `html5_checker` library.
4
5## Background
6
7The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in `third_party/validator/tests/` contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more.
8
9## Test Suite Structure
10
11### Location
12```
13third_party/validator/tests/
14├── messages.json # Expected error messages keyed by test path
15├── html/ # 2,601 core HTML5 tests
16│ ├── attributes/
17│ ├── elements/
18│ ├── microdata/
19│ ├── mime-types/
20│ ├── obsolete/
21│ ├── parser/
22│ └── ...
23├── html-aria/ # 712 ARIA validation tests
24├── html-its/ # 90 internationalization tests
25├── html-rdfa/ # 212 RDFa tests
26├── html-rdfalite/ # 56 RDFa Lite tests
27├── html-svg/ # 517 SVG-in-HTML tests
28└── xhtml/ # 110 XHTML tests
29```
30
31### Filename Convention
32
33Test files use a suffix to indicate expected outcome:
34
35| Suffix | Meaning | Expected Result |
36|--------|---------|-----------------|
37| `-isvalid.html` | Valid HTML | No errors, no warnings |
38| `-novalid.html` | Invalid HTML | At least one error |
39| `-haswarn.html` | Valid with warning | At least one warning |
40
41### Expected Messages (messages.json)
42
43For `-novalid` and `-haswarn` files, `messages.json` contains the expected error/warning message:
44
45```json
46{
47 "html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...",
48 "html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..."
49}
50```
51
52Note: Messages use Unicode curly quotes (U+201C `"` and U+201D `"`).
53
54## Implementation Steps
55
56### Step 1: Create Messages JSON Parser
57
58Create `test/validator_messages.ml`:
59
60```ocaml
61(** Parser for third_party/validator/tests/messages.json *)
62
63type t = (string, string) Hashtbl.t
64(** Maps test file path to expected error message *)
65
66val load : string -> t
67(** [load path] loads messages.json from [path] *)
68
69val get : t -> string -> string option
70(** [get messages test_path] returns expected message for test, if any *)
71```
72
73Implementation notes:
74- Use `Jsont` library (already a dependency)
75- Keys are relative paths like `"html/parser/foo-novalid.html"`
76- Values are error message strings with Unicode quotes
77
78### Step 2: Create Test File Discovery
79
80Create logic to find and classify test files:
81
82```ocaml
83type expected_outcome =
84 | Valid (* -isvalid.html: expect no errors *)
85 | Invalid (* -novalid.html: expect error matching messages.json *)
86 | HasWarning (* -haswarn.html: expect warning matching messages.json *)
87
88type test_file = {
89 path : string; (* Full filesystem path *)
90 relative_path : string; (* Path relative to tests/, used as key in messages.json *)
91 category : string; (* html, html-aria, etc. *)
92 expected : expected_outcome;
93}
94
95val discover_tests : string -> test_file list
96(** [discover_tests tests_dir] finds all test files recursively *)
97
98val parse_outcome : string -> expected_outcome
99(** [parse_outcome filename] extracts outcome from filename suffix *)
100```
101
102### Step 3: Create Test Runner
103
104Create `test/test_validator.ml`:
105
106```ocaml
107(** Test runner for Nu HTML Validator test suite *)
108
109(** Run a single test, returns (passed, details) *)
110let run_test messages test_file =
111 (* 1. Read HTML content *)
112 let content = read_file test_file.path in
113
114 (* 2. Run validator *)
115 let result = Html5_checker.check
116 ~collect_parse_errors:true
117 ~system_id:test_file.relative_path
118 (Bytesrw.Bytes.Reader.of_string content) in
119
120 (* 3. Check result against expected outcome *)
121 match test_file.expected with
122 | Valid ->
123 (* Should have no errors or warnings *)
124 let errors = Html5_checker.errors result in
125 let warnings = Html5_checker.warnings result in
126 if errors = [] && warnings = [] then
127 (true, "OK: No messages")
128 else
129 (false, Printf.sprintf "Expected valid but got %d errors, %d warnings"
130 (List.length errors) (List.length warnings))
131
132 | Invalid ->
133 (* Should have at least one error matching expected message *)
134 let errors = Html5_checker.errors result in
135 let expected_msg = Validator_messages.get messages test_file.relative_path in
136 if errors = [] then
137 (false, "Expected error but got none")
138 else
139 check_message_match errors expected_msg
140
141 | HasWarning ->
142 (* Should have at least one warning matching expected message *)
143 let warnings = Html5_checker.warnings result in
144 let expected_msg = Validator_messages.get messages test_file.relative_path in
145 if warnings = [] then
146 (false, "Expected warning but got none")
147 else
148 check_message_match warnings expected_msg
149```
150
151### Step 4: Message Matching Strategy
152
153The OCaml checker may produce different message text than the Nu validator. Implement flexible matching:
154
155```ocaml
156(** Check if actual message matches expected *)
157let message_matches ~expected ~actual =
158 (* Strategy 1: Exact match *)
159 if actual = expected then true
160 (* Strategy 2: Normalized match (ignore quote style) *)
161 else if normalize_quotes actual = normalize_quotes expected then true
162 (* Strategy 3: Substring match *)
163 else if String.is_substring actual ~substring:(extract_core expected) then true
164 else false
165
166(** Normalize Unicode curly quotes to ASCII *)
167let normalize_quotes s =
168 s |> String.map (function
169 | '\u{201C}' | '\u{201D}' -> '"' (* " " -> " *)
170 | c -> c)
171```
172
173### Step 5: Test Categories for Selective Running
174
175Map tests to checker categories for phased enablement:
176
177```ocaml
178type checker_category =
179 | Parse_errors (* Built into parser *)
180 | Nesting (* Nesting_checker *)
181 | Aria (* Aria_checker *)
182 | Required_attrs (* Required_attr_checker *)
183 | Obsolete (* Obsolete_checker *)
184 | Id_uniqueness (* Id_checker *)
185 | Table_structure (* Table_checker *)
186 | Heading_structure (* Heading_checker *)
187 | Form_validation (* Form_checker *)
188 | Microdata (* Microdata_checker *)
189 | Language (* Language_checker *)
190 | Unknown
191
192(** Infer category from test path *)
193let categorize_test test_file =
194 match test_file.category, extract_subcategory test_file.relative_path with
195 | "html-aria", _ -> Aria
196 | "html", "parser" -> Parse_errors
197 | "html", "elements" -> Nesting (* mostly *)
198 | "html", "attributes" -> Required_attrs
199 | "html", "obsolete" -> Obsolete
200 | "html", "microdata" -> Microdata
201 | _ -> Unknown
202```
203
204### Step 6: Dune Build Integration
205
206Add to `test/dune`:
207
208```dune
209(executable
210 (name test_validator)
211 (modules test_validator validator_messages)
212 (libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report))
213
214(rule
215 (alias validator-tests)
216 (deps
217 (glob_files_rec ../third_party/validator/tests/**/*.html)
218 ../third_party/validator/tests/messages.json)
219 (action
220 (run %{exe:test_validator.exe} ../third_party/validator/tests)))
221```
222
223Note: Use separate alias `validator-tests` initially (not `runtest`) since many tests will fail until checkers are integrated.
224
225### Step 7: Reporting
226
227Generate reports compatible with existing test infrastructure:
228
229```ocaml
230(** Print summary *)
231let print_summary results =
232 let by_category = group_by_category results in
233 List.iter (fun (cat, tests) ->
234 let passed = List.filter fst tests |> List.length in
235 let total = List.length tests in
236 Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total
237 ) by_category;
238
239 (* Overall *)
240 let total_passed = List.filter (fun (p, _) -> p) results |> List.length in
241 Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results)
242
243(** Generate HTML report *)
244let write_html_report results filename =
245 (* Use Test_report module pattern from other tests *)
246 ...
247```
248
249## Prerequisites
250
251Before tests can pass, the following must be completed:
252
253### 1. Wire Checker Registry into Html5_checker.check
254
255In `lib/html5_checker/html5_checker.ml`, the `check` and `check_dom` functions currently only collect parse errors. They need to:
256
257```ocaml
258let check ?collect_parse_errors ?system_id reader =
259 let doc = Html5rw.parse reader in
260 let collector = Message_collector.create () in
261
262 (* Collect parse errors if requested *)
263 if Option.value ~default:false collect_parse_errors then
264 Parse_error_bridge.collect_parse_errors ?system_id doc
265 |> List.iter (Message_collector.add collector);
266
267 (* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *)
268 let registry = Checker_registry.default () in
269 Dom_walker.walk_registry registry collector (Html5rw.root doc);
270
271 { document = doc; collector; system_id }
272```
273
274### 2. Populate Default Checker Registry
275
276In `lib/html5_checker/checker_registry.ml`:
277
278```ocaml
279let default () =
280 let reg = create () in
281 register reg "nesting" Nesting_checker.checker;
282 register reg "aria" Aria_checker.checker;
283 register reg "required-attrs" Required_attr_checker.checker;
284 register reg "obsolete" Obsolete_checker.checker;
285 register reg "id" Id_checker.checker;
286 register reg "table" Table_checker.checker;
287 register reg "heading" Heading_checker.checker;
288 register reg "form" Form_checker.checker;
289 register reg "microdata" Microdata_checker.checker;
290 register reg "language" Language_checker.checker;
291 reg
292```
293
294### 3. Ensure Checkers Produce Compatible Messages
295
296Review each checker's error messages against `messages.json` to ensure they can be matched. May need to:
297- Use curly quotes in messages
298- Match Nu validator's phrasing
299- Include element/attribute names in same format
300
301## Phased Rollout
302
303Run tests incrementally as checkers are integrated:
304
305| Phase | Command | What's Tested |
306|-------|---------|---------------|
307| 1 | `--category=parse` | Parse errors only (~200 tests) |
308| 2 | `--category=nesting` | + Nesting checker |
309| 3 | `--category=aria` | + ARIA checker (~700 tests) |
310| 4 | `--category=required` | + Required attributes |
311| 5 | (all) | Full suite |
312
313Implement command-line filtering:
314
315```ocaml
316let () =
317 let tests_dir = Sys.argv.(1) in
318 let category_filter =
319 if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in
320
321 let messages = Validator_messages.load (tests_dir ^ "/messages.json") in
322 let tests = discover_tests tests_dir in
323 let tests = match category_filter with
324 | Some cat -> List.filter (fun t -> categorize_test t = cat) tests
325 | None -> tests in
326
327 run_tests messages tests
328```
329
330## Expected Test Counts by Category
331
332Based on file counts in `third_party/validator/tests/`:
333
334| Category | Files | Notes |
335|----------|-------|-------|
336| html/ | 2,601 | Core HTML5 validation |
337| html-aria/ | 712 | ARIA attributes |
338| html-svg/ | 517 | SVG embedded in HTML |
339| html-rdfa/ | 212 | RDFa semantic markup |
340| xhtml/ | 110 | XHTML variant |
341| html-its/ | 90 | Internationalization |
342| html-rdfalite/ | 56 | RDFa Lite |
343| **Total** | **~4,300** | |
344
345## Files to Create
346
347| File | Purpose |
348|------|---------|
349| `test/validator_messages.ml` | Load and query messages.json |
350| `test/test_validator.ml` | Main test runner |
351| `test/dune` (modify) | Add build rules |
352
353## Success Criteria
354
3551. All `-isvalid.html` tests pass (no false positives)
3562. All `-novalid.html` tests produce at least one error
3573. All `-haswarn.html` tests produce at least one warning
3584. Message content matches for implemented checkers
3595. HTML report generated for review
360
361## Reference
362
363- Nu HTML Checker: https://validator.github.io/validator/
364- Test harness reference: `third_party/validator/resources/examples/test-harness/validator-tester.py`
365- Existing OCaml tests: `test/test_html5lib.ml`, `test/test_tokenizer.ml`