+365
PLAN.md
+365
PLAN.md
···
1
+
# Plan: Nu HTML Validator Test Suite Integration
2
+
3
+
This document describes how to run the Nu HTML Validator test suite against the OCaml `html5_checker` library.
4
+
5
+
## Background
6
+
7
+
The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in `third_party/validator/tests/` contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more.
8
+
9
+
## Test Suite Structure
10
+
11
+
### Location
12
+
```
13
+
third_party/validator/tests/
14
+
├── messages.json # Expected error messages keyed by test path
15
+
├── html/ # 2,601 core HTML5 tests
16
+
│ ├── attributes/
17
+
│ ├── elements/
18
+
│ ├── microdata/
19
+
│ ├── mime-types/
20
+
│ ├── obsolete/
21
+
│ ├── parser/
22
+
│ └── ...
23
+
├── html-aria/ # 712 ARIA validation tests
24
+
├── html-its/ # 90 internationalization tests
25
+
├── html-rdfa/ # 212 RDFa tests
26
+
├── html-rdfalite/ # 56 RDFa Lite tests
27
+
├── html-svg/ # 517 SVG-in-HTML tests
28
+
└── xhtml/ # 110 XHTML tests
29
+
```
30
+
31
+
### Filename Convention
32
+
33
+
Test files use a suffix to indicate expected outcome:
34
+
35
+
| Suffix | Meaning | Expected Result |
36
+
|--------|---------|-----------------|
37
+
| `-isvalid.html` | Valid HTML | No errors, no warnings |
38
+
| `-novalid.html` | Invalid HTML | At least one error |
39
+
| `-haswarn.html` | Valid with warning | At least one warning |
40
+
41
+
### Expected Messages (messages.json)
42
+
43
+
For `-novalid` and `-haswarn` files, `messages.json` contains the expected error/warning message:
44
+
45
+
```json
46
+
{
47
+
"html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...",
48
+
"html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..."
49
+
}
50
+
```
51
+
52
+
Note: Messages use Unicode curly quotes (U+201C `"` and U+201D `"`).
53
+
54
+
## Implementation Steps
55
+
56
+
### Step 1: Create Messages JSON Parser
57
+
58
+
Create `test/validator_messages.ml`:
59
+
60
+
```ocaml
61
+
(** Parser for third_party/validator/tests/messages.json *)
62
+
63
+
type t = (string, string) Hashtbl.t
64
+
(** Maps test file path to expected error message *)
65
+
66
+
val load : string -> t
67
+
(** [load path] loads messages.json from [path] *)
68
+
69
+
val get : t -> string -> string option
70
+
(** [get messages test_path] returns expected message for test, if any *)
71
+
```
72
+
73
+
Implementation notes:
74
+
- Use `Jsont` library (already a dependency)
75
+
- Keys are relative paths like `"html/parser/foo-novalid.html"`
76
+
- Values are error message strings with Unicode quotes
77
+
78
+
### Step 2: Create Test File Discovery
79
+
80
+
Create logic to find and classify test files:
81
+
82
+
```ocaml
83
+
type expected_outcome =
84
+
| Valid (* -isvalid.html: expect no errors *)
85
+
| Invalid (* -novalid.html: expect error matching messages.json *)
86
+
| HasWarning (* -haswarn.html: expect warning matching messages.json *)
87
+
88
+
type test_file = {
89
+
path : string; (* Full filesystem path *)
90
+
relative_path : string; (* Path relative to tests/, used as key in messages.json *)
91
+
category : string; (* html, html-aria, etc. *)
92
+
expected : expected_outcome;
93
+
}
94
+
95
+
val discover_tests : string -> test_file list
96
+
(** [discover_tests tests_dir] finds all test files recursively *)
97
+
98
+
val parse_outcome : string -> expected_outcome
99
+
(** [parse_outcome filename] extracts outcome from filename suffix *)
100
+
```
101
+
102
+
### Step 3: Create Test Runner
103
+
104
+
Create `test/test_validator.ml`:
105
+
106
+
```ocaml
107
+
(** Test runner for Nu HTML Validator test suite *)
108
+
109
+
(** Run a single test, returns (passed, details) *)
110
+
let run_test messages test_file =
111
+
(* 1. Read HTML content *)
112
+
let content = read_file test_file.path in
113
+
114
+
(* 2. Run validator *)
115
+
let result = Html5_checker.check
116
+
~collect_parse_errors:true
117
+
~system_id:test_file.relative_path
118
+
(Bytesrw.Bytes.Reader.of_string content) in
119
+
120
+
(* 3. Check result against expected outcome *)
121
+
match test_file.expected with
122
+
| Valid ->
123
+
(* Should have no errors or warnings *)
124
+
let errors = Html5_checker.errors result in
125
+
let warnings = Html5_checker.warnings result in
126
+
if errors = [] && warnings = [] then
127
+
(true, "OK: No messages")
128
+
else
129
+
(false, Printf.sprintf "Expected valid but got %d errors, %d warnings"
130
+
(List.length errors) (List.length warnings))
131
+
132
+
| Invalid ->
133
+
(* Should have at least one error matching expected message *)
134
+
let errors = Html5_checker.errors result in
135
+
let expected_msg = Validator_messages.get messages test_file.relative_path in
136
+
if errors = [] then
137
+
(false, "Expected error but got none")
138
+
else
139
+
check_message_match errors expected_msg
140
+
141
+
| HasWarning ->
142
+
(* Should have at least one warning matching expected message *)
143
+
let warnings = Html5_checker.warnings result in
144
+
let expected_msg = Validator_messages.get messages test_file.relative_path in
145
+
if warnings = [] then
146
+
(false, "Expected warning but got none")
147
+
else
148
+
check_message_match warnings expected_msg
149
+
```
150
+
151
+
### Step 4: Message Matching Strategy
152
+
153
+
The OCaml checker may produce different message text than the Nu validator. Implement flexible matching:
154
+
155
+
```ocaml
156
+
(** Check if actual message matches expected *)
157
+
let message_matches ~expected ~actual =
158
+
(* Strategy 1: Exact match *)
159
+
if actual = expected then true
160
+
(* Strategy 2: Normalized match (ignore quote style) *)
161
+
else if normalize_quotes actual = normalize_quotes expected then true
162
+
(* Strategy 3: Substring match *)
163
+
else if String.is_substring actual ~substring:(extract_core expected) then true
164
+
else false
165
+
166
+
(** Normalize Unicode curly quotes to ASCII *)
167
+
let normalize_quotes s =
168
+
s |> String.map (function
169
+
| '\u{201C}' | '\u{201D}' -> '"' (* " " -> " *)
170
+
| c -> c)
171
+
```
172
+
173
+
### Step 5: Test Categories for Selective Running
174
+
175
+
Map tests to checker categories for phased enablement:
176
+
177
+
```ocaml
178
+
type checker_category =
179
+
| Parse_errors (* Built into parser *)
180
+
| Nesting (* Nesting_checker *)
181
+
| Aria (* Aria_checker *)
182
+
| Required_attrs (* Required_attr_checker *)
183
+
| Obsolete (* Obsolete_checker *)
184
+
| Id_uniqueness (* Id_checker *)
185
+
| Table_structure (* Table_checker *)
186
+
| Heading_structure (* Heading_checker *)
187
+
| Form_validation (* Form_checker *)
188
+
| Microdata (* Microdata_checker *)
189
+
| Language (* Language_checker *)
190
+
| Unknown
191
+
192
+
(** Infer category from test path *)
193
+
let categorize_test test_file =
194
+
match test_file.category, extract_subcategory test_file.relative_path with
195
+
| "html-aria", _ -> Aria
196
+
| "html", "parser" -> Parse_errors
197
+
| "html", "elements" -> Nesting (* mostly *)
198
+
| "html", "attributes" -> Required_attrs
199
+
| "html", "obsolete" -> Obsolete
200
+
| "html", "microdata" -> Microdata
201
+
| _ -> Unknown
202
+
```
203
+
204
+
### Step 6: Dune Build Integration
205
+
206
+
Add to `test/dune`:
207
+
208
+
```dune
209
+
(executable
210
+
(name test_validator)
211
+
(modules test_validator validator_messages)
212
+
(libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report))
213
+
214
+
(rule
215
+
(alias validator-tests)
216
+
(deps
217
+
(glob_files_rec ../third_party/validator/tests/**/*.html)
218
+
../third_party/validator/tests/messages.json)
219
+
(action
220
+
(run %{exe:test_validator.exe} ../third_party/validator/tests)))
221
+
```
222
+
223
+
Note: Use separate alias `validator-tests` initially (not `runtest`) since many tests will fail until checkers are integrated.
224
+
225
+
### Step 7: Reporting
226
+
227
+
Generate reports compatible with existing test infrastructure:
228
+
229
+
```ocaml
230
+
(** Print summary *)
231
+
let print_summary results =
232
+
let by_category = group_by_category results in
233
+
List.iter (fun (cat, tests) ->
234
+
let passed = List.filter fst tests |> List.length in
235
+
let total = List.length tests in
236
+
Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total
237
+
) by_category;
238
+
239
+
(* Overall *)
240
+
let total_passed = List.filter (fun (p, _) -> p) results |> List.length in
241
+
Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results)
242
+
243
+
(** Generate HTML report *)
244
+
let write_html_report results filename =
245
+
(* Use Test_report module pattern from other tests *)
246
+
...
247
+
```
248
+
249
+
## Prerequisites
250
+
251
+
Before tests can pass, the following must be completed:
252
+
253
+
### 1. Wire Checker Registry into Html5_checker.check
254
+
255
+
In `lib/html5_checker/html5_checker.ml`, the `check` and `check_dom` functions currently only collect parse errors. They need to:
256
+
257
+
```ocaml
258
+
let check ?collect_parse_errors ?system_id reader =
259
+
let doc = Html5rw.parse reader in
260
+
let collector = Message_collector.create () in
261
+
262
+
(* Collect parse errors if requested *)
263
+
if Option.value ~default:false collect_parse_errors then
264
+
Parse_error_bridge.collect_parse_errors ?system_id doc
265
+
|> List.iter (Message_collector.add collector);
266
+
267
+
(* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *)
268
+
let registry = Checker_registry.default () in
269
+
Dom_walker.walk_registry registry collector (Html5rw.root doc);
270
+
271
+
{ document = doc; collector; system_id }
272
+
```
273
+
274
+
### 2. Populate Default Checker Registry
275
+
276
+
In `lib/html5_checker/checker_registry.ml`:
277
+
278
+
```ocaml
279
+
let default () =
280
+
let reg = create () in
281
+
register reg "nesting" Nesting_checker.checker;
282
+
register reg "aria" Aria_checker.checker;
283
+
register reg "required-attrs" Required_attr_checker.checker;
284
+
register reg "obsolete" Obsolete_checker.checker;
285
+
register reg "id" Id_checker.checker;
286
+
register reg "table" Table_checker.checker;
287
+
register reg "heading" Heading_checker.checker;
288
+
register reg "form" Form_checker.checker;
289
+
register reg "microdata" Microdata_checker.checker;
290
+
register reg "language" Language_checker.checker;
291
+
reg
292
+
```
293
+
294
+
### 3. Ensure Checkers Produce Compatible Messages
295
+
296
+
Review each checker's error messages against `messages.json` to ensure they can be matched. May need to:
297
+
- Use curly quotes in messages
298
+
- Match Nu validator's phrasing
299
+
- Include element/attribute names in same format
300
+
301
+
## Phased Rollout
302
+
303
+
Run tests incrementally as checkers are integrated:
304
+
305
+
| Phase | Command | What's Tested |
306
+
|-------|---------|---------------|
307
+
| 1 | `--category=parse` | Parse errors only (~200 tests) |
308
+
| 2 | `--category=nesting` | + Nesting checker |
309
+
| 3 | `--category=aria` | + ARIA checker (~700 tests) |
310
+
| 4 | `--category=required` | + Required attributes |
311
+
| 5 | (all) | Full suite |
312
+
313
+
Implement command-line filtering:
314
+
315
+
```ocaml
316
+
let () =
317
+
let tests_dir = Sys.argv.(1) in
318
+
let category_filter =
319
+
if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in
320
+
321
+
let messages = Validator_messages.load (tests_dir ^ "/messages.json") in
322
+
let tests = discover_tests tests_dir in
323
+
let tests = match category_filter with
324
+
| Some cat -> List.filter (fun t -> categorize_test t = cat) tests
325
+
| None -> tests in
326
+
327
+
run_tests messages tests
328
+
```
329
+
330
+
## Expected Test Counts by Category
331
+
332
+
Based on file counts in `third_party/validator/tests/`:
333
+
334
+
| Category | Files | Notes |
335
+
|----------|-------|-------|
336
+
| html/ | 2,601 | Core HTML5 validation |
337
+
| html-aria/ | 712 | ARIA attributes |
338
+
| html-svg/ | 517 | SVG embedded in HTML |
339
+
| html-rdfa/ | 212 | RDFa semantic markup |
340
+
| xhtml/ | 110 | XHTML variant |
341
+
| html-its/ | 90 | Internationalization |
342
+
| html-rdfalite/ | 56 | RDFa Lite |
343
+
| **Total** | **~4,300** | |
344
+
345
+
## Files to Create
346
+
347
+
| File | Purpose |
348
+
|------|---------|
349
+
| `test/validator_messages.ml` | Load and query messages.json |
350
+
| `test/test_validator.ml` | Main test runner |
351
+
| `test/dune` (modify) | Add build rules |
352
+
353
+
## Success Criteria
354
+
355
+
1. All `-isvalid.html` tests pass (no false positives)
356
+
2. All `-novalid.html` tests produce at least one error
357
+
3. All `-haswarn.html` tests produce at least one warning
358
+
4. Message content matches for implemented checkers
359
+
5. HTML report generated for review
360
+
361
+
## Reference
362
+
363
+
- Nu HTML Checker: https://validator.github.io/validator/
364
+
- Test harness reference: `third_party/validator/resources/examples/test-harness/validator-tester.py`
365
+
- Existing OCaml tests: `test/test_html5lib.ml`, `test/test_tokenizer.ml`