# Plan: Nu HTML Validator Test Suite Integration
This document describes how to run the Nu HTML Validator test suite against the OCaml `html5_checker` library.
## Background
The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in `third_party/validator/tests/` contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more.
## Test Suite Structure
### Location
```
third_party/validator/tests/
├── messages.json # Expected error messages keyed by test path
├── html/ # 2,601 core HTML5 tests
│ ├── attributes/
│ ├── elements/
│ ├── microdata/
│ ├── mime-types/
│ ├── obsolete/
│ ├── parser/
│ └── ...
├── html-aria/ # 712 ARIA validation tests
├── html-its/ # 90 internationalization tests
├── html-rdfa/ # 212 RDFa tests
├── html-rdfalite/ # 56 RDFa Lite tests
├── html-svg/ # 517 SVG-in-HTML tests
└── xhtml/ # 110 XHTML tests
```
### Filename Convention
Test files use a suffix to indicate expected outcome:
| Suffix | Meaning | Expected Result |
|--------|---------|-----------------|
| `-isvalid.html` | Valid HTML | No errors, no warnings |
| `-novalid.html` | Invalid HTML | At least one error |
| `-haswarn.html` | Valid with warning | At least one warning |
### Expected Messages (messages.json)
For `-novalid` and `-haswarn` files, `messages.json` contains the expected error/warning message:
```json
{
"html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...",
"html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..."
}
```
Note: Messages use Unicode curly quotes (U+201C `"` and U+201D `"`).
## Implementation Steps
### Step 1: Create Messages JSON Parser
Create `test/validator_messages.ml`:
```ocaml
(** Parser for third_party/validator/tests/messages.json *)
type t = (string, string) Hashtbl.t
(** Maps test file path to expected error message *)
val load : string -> t
(** [load path] loads messages.json from [path] *)
val get : t -> string -> string option
(** [get messages test_path] returns expected message for test, if any *)
```
Implementation notes:
- Use `Jsont` library (already a dependency)
- Keys are relative paths like `"html/parser/foo-novalid.html"`
- Values are error message strings with Unicode quotes
### Step 2: Create Test File Discovery
Create logic to find and classify test files:
```ocaml
type expected_outcome =
| Valid (* -isvalid.html: expect no errors *)
| Invalid (* -novalid.html: expect error matching messages.json *)
| HasWarning (* -haswarn.html: expect warning matching messages.json *)
type test_file = {
path : string; (* Full filesystem path *)
relative_path : string; (* Path relative to tests/, used as key in messages.json *)
category : string; (* html, html-aria, etc. *)
expected : expected_outcome;
}
val discover_tests : string -> test_file list
(** [discover_tests tests_dir] finds all test files recursively *)
val parse_outcome : string -> expected_outcome
(** [parse_outcome filename] extracts outcome from filename suffix *)
```
### Step 3: Create Test Runner
Create `test/test_validator.ml`:
```ocaml
(** Test runner for Nu HTML Validator test suite *)
(** Run a single test, returns (passed, details) *)
let run_test messages test_file =
(* 1. Read HTML content *)
let content = read_file test_file.path in
(* 2. Run validator *)
let result = Html5_checker.check
~collect_parse_errors:true
~system_id:test_file.relative_path
(Bytesrw.Bytes.Reader.of_string content) in
(* 3. Check result against expected outcome *)
match test_file.expected with
| Valid ->
(* Should have no errors or warnings *)
let errors = Html5_checker.errors result in
let warnings = Html5_checker.warnings result in
if errors = [] && warnings = [] then
(true, "OK: No messages")
else
(false, Printf.sprintf "Expected valid but got %d errors, %d warnings"
(List.length errors) (List.length warnings))
| Invalid ->
(* Should have at least one error matching expected message *)
let errors = Html5_checker.errors result in
let expected_msg = Validator_messages.get messages test_file.relative_path in
if errors = [] then
(false, "Expected error but got none")
else
check_message_match errors expected_msg
| HasWarning ->
(* Should have at least one warning matching expected message *)
let warnings = Html5_checker.warnings result in
let expected_msg = Validator_messages.get messages test_file.relative_path in
if warnings = [] then
(false, "Expected warning but got none")
else
check_message_match warnings expected_msg
```
### Step 4: Message Matching Strategy
The OCaml checker may produce different message text than the Nu validator. Implement flexible matching:
```ocaml
(** Check if actual message matches expected *)
let message_matches ~expected ~actual =
(* Strategy 1: Exact match *)
if actual = expected then true
(* Strategy 2: Normalized match (ignore quote style) *)
else if normalize_quotes actual = normalize_quotes expected then true
(* Strategy 3: Substring match *)
else if String.is_substring actual ~substring:(extract_core expected) then true
else false
(** Normalize Unicode curly quotes to ASCII *)
let normalize_quotes s =
s |> String.map (function
| '\u{201C}' | '\u{201D}' -> '"' (* " " -> " *)
| c -> c)
```
### Step 5: Test Categories for Selective Running
Map tests to checker categories for phased enablement:
```ocaml
type checker_category =
| Parse_errors (* Built into parser *)
| Nesting (* Nesting_checker *)
| Aria (* Aria_checker *)
| Required_attrs (* Required_attr_checker *)
| Obsolete (* Obsolete_checker *)
| Id_uniqueness (* Id_checker *)
| Table_structure (* Table_checker *)
| Heading_structure (* Heading_checker *)
| Form_validation (* Form_checker *)
| Microdata (* Microdata_checker *)
| Language (* Language_checker *)
| Unknown
(** Infer category from test path *)
let categorize_test test_file =
match test_file.category, extract_subcategory test_file.relative_path with
| "html-aria", _ -> Aria
| "html", "parser" -> Parse_errors
| "html", "elements" -> Nesting (* mostly *)
| "html", "attributes" -> Required_attrs
| "html", "obsolete" -> Obsolete
| "html", "microdata" -> Microdata
| _ -> Unknown
```
### Step 6: Dune Build Integration
Add to `test/dune`:
```dune
(executable
(name test_validator)
(modules test_validator validator_messages)
(libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report))
(rule
(alias validator-tests)
(deps
(glob_files_rec ../third_party/validator/tests/**/*.html)
../third_party/validator/tests/messages.json)
(action
(run %{exe:test_validator.exe} ../third_party/validator/tests)))
```
Note: Use separate alias `validator-tests` initially (not `runtest`) since many tests will fail until checkers are integrated.
### Step 7: Reporting
Generate reports compatible with existing test infrastructure:
```ocaml
(** Print summary *)
let print_summary results =
let by_category = group_by_category results in
List.iter (fun (cat, tests) ->
let passed = List.filter fst tests |> List.length in
let total = List.length tests in
Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total
) by_category;
(* Overall *)
let total_passed = List.filter (fun (p, _) -> p) results |> List.length in
Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results)
(** Generate HTML report *)
let write_html_report results filename =
(* Use Test_report module pattern from other tests *)
...
```
## Prerequisites
Before tests can pass, the following must be completed:
### 1. Wire Checker Registry into Html5_checker.check
In `lib/html5_checker/html5_checker.ml`, the `check` and `check_dom` functions currently only collect parse errors. They need to:
```ocaml
let check ?collect_parse_errors ?system_id reader =
let doc = Html5rw.parse reader in
let collector = Message_collector.create () in
(* Collect parse errors if requested *)
if Option.value ~default:false collect_parse_errors then
Parse_error_bridge.collect_parse_errors ?system_id doc
|> List.iter (Message_collector.add collector);
(* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *)
let registry = Checker_registry.default () in
Dom_walker.walk_registry registry collector (Html5rw.root doc);
{ document = doc; collector; system_id }
```
### 2. Populate Default Checker Registry
In `lib/html5_checker/checker_registry.ml`:
```ocaml
let default () =
let reg = create () in
register reg "nesting" Nesting_checker.checker;
register reg "aria" Aria_checker.checker;
register reg "required-attrs" Required_attr_checker.checker;
register reg "obsolete" Obsolete_checker.checker;
register reg "id" Id_checker.checker;
register reg "table" Table_checker.checker;
register reg "heading" Heading_checker.checker;
register reg "form" Form_checker.checker;
register reg "microdata" Microdata_checker.checker;
register reg "language" Language_checker.checker;
reg
```
### 3. Ensure Checkers Produce Compatible Messages
Review each checker's error messages against `messages.json` to ensure they can be matched. May need to:
- Use curly quotes in messages
- Match Nu validator's phrasing
- Include element/attribute names in same format
## Phased Rollout
Run tests incrementally as checkers are integrated:
| Phase | Command | What's Tested |
|-------|---------|---------------|
| 1 | `--category=parse` | Parse errors only (~200 tests) |
| 2 | `--category=nesting` | + Nesting checker |
| 3 | `--category=aria` | + ARIA checker (~700 tests) |
| 4 | `--category=required` | + Required attributes |
| 5 | (all) | Full suite |
Implement command-line filtering:
```ocaml
let () =
let tests_dir = Sys.argv.(1) in
let category_filter =
if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in
let messages = Validator_messages.load (tests_dir ^ "/messages.json") in
let tests = discover_tests tests_dir in
let tests = match category_filter with
| Some cat -> List.filter (fun t -> categorize_test t = cat) tests
| None -> tests in
run_tests messages tests
```
## Expected Test Counts by Category
Based on file counts in `third_party/validator/tests/`:
| Category | Files | Notes |
|----------|-------|-------|
| html/ | 2,601 | Core HTML5 validation |
| html-aria/ | 712 | ARIA attributes |
| html-svg/ | 517 | SVG embedded in HTML |
| html-rdfa/ | 212 | RDFa semantic markup |
| xhtml/ | 110 | XHTML variant |
| html-its/ | 90 | Internationalization |
| html-rdfalite/ | 56 | RDFa Lite |
| **Total** | **~4,300** | |
## Files to Create
| File | Purpose |
|------|---------|
| `test/validator_messages.ml` | Load and query messages.json |
| `test/test_validator.ml` | Main test runner |
| `test/dune` (modify) | Add build rules |
## Success Criteria
1. All `-isvalid.html` tests pass (no false positives)
2. All `-novalid.html` tests produce at least one error
3. All `-haswarn.html` tests produce at least one warning
4. Message content matches for implemented checkers
5. HTML report generated for review
## Reference
- Nu HTML Checker: https://validator.github.io/validator/
- Test harness reference: `third_party/validator/resources/examples/test-harness/validator-tester.py`
- Existing OCaml tests: `test/test_html5lib.ml`, `test/test_tokenizer.ml`