OCaml HTML5 parser/serialiser based on Python's JustHTML
1# Plan: Nu HTML Validator Test Suite Integration 2 3This document describes how to run the Nu HTML Validator test suite against the OCaml `html5_checker` library. 4 5## Background 6 7The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in `third_party/validator/tests/` contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more. 8 9## Test Suite Structure 10 11### Location 12``` 13third_party/validator/tests/ 14├── messages.json # Expected error messages keyed by test path 15├── html/ # 2,601 core HTML5 tests 16│ ├── attributes/ 17│ ├── elements/ 18│ ├── microdata/ 19│ ├── mime-types/ 20│ ├── obsolete/ 21│ ├── parser/ 22│ └── ... 23├── html-aria/ # 712 ARIA validation tests 24├── html-its/ # 90 internationalization tests 25├── html-rdfa/ # 212 RDFa tests 26├── html-rdfalite/ # 56 RDFa Lite tests 27├── html-svg/ # 517 SVG-in-HTML tests 28└── xhtml/ # 110 XHTML tests 29``` 30 31### Filename Convention 32 33Test files use a suffix to indicate expected outcome: 34 35| Suffix | Meaning | Expected Result | 36|--------|---------|-----------------| 37| `-isvalid.html` | Valid HTML | No errors, no warnings | 38| `-novalid.html` | Invalid HTML | At least one error | 39| `-haswarn.html` | Valid with warning | At least one warning | 40 41### Expected Messages (messages.json) 42 43For `-novalid` and `-haswarn` files, `messages.json` contains the expected error/warning message: 44 45```json 46{ 47 "html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...", 48 "html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..." 49} 50``` 51 52Note: Messages use Unicode curly quotes (U+201C `"` and U+201D `"`). 53 54## Implementation Steps 55 56### Step 1: Create Messages JSON Parser 57 58Create `test/validator_messages.ml`: 59 60```ocaml 61(** Parser for third_party/validator/tests/messages.json *) 62 63type t = (string, string) Hashtbl.t 64(** Maps test file path to expected error message *) 65 66val load : string -> t 67(** [load path] loads messages.json from [path] *) 68 69val get : t -> string -> string option 70(** [get messages test_path] returns expected message for test, if any *) 71``` 72 73Implementation notes: 74- Use `Jsont` library (already a dependency) 75- Keys are relative paths like `"html/parser/foo-novalid.html"` 76- Values are error message strings with Unicode quotes 77 78### Step 2: Create Test File Discovery 79 80Create logic to find and classify test files: 81 82```ocaml 83type expected_outcome = 84 | Valid (* -isvalid.html: expect no errors *) 85 | Invalid (* -novalid.html: expect error matching messages.json *) 86 | HasWarning (* -haswarn.html: expect warning matching messages.json *) 87 88type test_file = { 89 path : string; (* Full filesystem path *) 90 relative_path : string; (* Path relative to tests/, used as key in messages.json *) 91 category : string; (* html, html-aria, etc. *) 92 expected : expected_outcome; 93} 94 95val discover_tests : string -> test_file list 96(** [discover_tests tests_dir] finds all test files recursively *) 97 98val parse_outcome : string -> expected_outcome 99(** [parse_outcome filename] extracts outcome from filename suffix *) 100``` 101 102### Step 3: Create Test Runner 103 104Create `test/test_validator.ml`: 105 106```ocaml 107(** Test runner for Nu HTML Validator test suite *) 108 109(** Run a single test, returns (passed, details) *) 110let run_test messages test_file = 111 (* 1. Read HTML content *) 112 let content = read_file test_file.path in 113 114 (* 2. Run validator *) 115 let result = Html5_checker.check 116 ~collect_parse_errors:true 117 ~system_id:test_file.relative_path 118 (Bytesrw.Bytes.Reader.of_string content) in 119 120 (* 3. Check result against expected outcome *) 121 match test_file.expected with 122 | Valid -> 123 (* Should have no errors or warnings *) 124 let errors = Html5_checker.errors result in 125 let warnings = Html5_checker.warnings result in 126 if errors = [] && warnings = [] then 127 (true, "OK: No messages") 128 else 129 (false, Printf.sprintf "Expected valid but got %d errors, %d warnings" 130 (List.length errors) (List.length warnings)) 131 132 | Invalid -> 133 (* Should have at least one error matching expected message *) 134 let errors = Html5_checker.errors result in 135 let expected_msg = Validator_messages.get messages test_file.relative_path in 136 if errors = [] then 137 (false, "Expected error but got none") 138 else 139 check_message_match errors expected_msg 140 141 | HasWarning -> 142 (* Should have at least one warning matching expected message *) 143 let warnings = Html5_checker.warnings result in 144 let expected_msg = Validator_messages.get messages test_file.relative_path in 145 if warnings = [] then 146 (false, "Expected warning but got none") 147 else 148 check_message_match warnings expected_msg 149``` 150 151### Step 4: Message Matching Strategy 152 153The OCaml checker may produce different message text than the Nu validator. Implement flexible matching: 154 155```ocaml 156(** Check if actual message matches expected *) 157let message_matches ~expected ~actual = 158 (* Strategy 1: Exact match *) 159 if actual = expected then true 160 (* Strategy 2: Normalized match (ignore quote style) *) 161 else if normalize_quotes actual = normalize_quotes expected then true 162 (* Strategy 3: Substring match *) 163 else if String.is_substring actual ~substring:(extract_core expected) then true 164 else false 165 166(** Normalize Unicode curly quotes to ASCII *) 167let normalize_quotes s = 168 s |> String.map (function 169 | '\u{201C}' | '\u{201D}' -> '"' (* " " -> " *) 170 | c -> c) 171``` 172 173### Step 5: Test Categories for Selective Running 174 175Map tests to checker categories for phased enablement: 176 177```ocaml 178type checker_category = 179 | Parse_errors (* Built into parser *) 180 | Nesting (* Nesting_checker *) 181 | Aria (* Aria_checker *) 182 | Required_attrs (* Required_attr_checker *) 183 | Obsolete (* Obsolete_checker *) 184 | Id_uniqueness (* Id_checker *) 185 | Table_structure (* Table_checker *) 186 | Heading_structure (* Heading_checker *) 187 | Form_validation (* Form_checker *) 188 | Microdata (* Microdata_checker *) 189 | Language (* Language_checker *) 190 | Unknown 191 192(** Infer category from test path *) 193let categorize_test test_file = 194 match test_file.category, extract_subcategory test_file.relative_path with 195 | "html-aria", _ -> Aria 196 | "html", "parser" -> Parse_errors 197 | "html", "elements" -> Nesting (* mostly *) 198 | "html", "attributes" -> Required_attrs 199 | "html", "obsolete" -> Obsolete 200 | "html", "microdata" -> Microdata 201 | _ -> Unknown 202``` 203 204### Step 6: Dune Build Integration 205 206Add to `test/dune`: 207 208```dune 209(executable 210 (name test_validator) 211 (modules test_validator validator_messages) 212 (libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report)) 213 214(rule 215 (alias validator-tests) 216 (deps 217 (glob_files_rec ../third_party/validator/tests/**/*.html) 218 ../third_party/validator/tests/messages.json) 219 (action 220 (run %{exe:test_validator.exe} ../third_party/validator/tests))) 221``` 222 223Note: Use separate alias `validator-tests` initially (not `runtest`) since many tests will fail until checkers are integrated. 224 225### Step 7: Reporting 226 227Generate reports compatible with existing test infrastructure: 228 229```ocaml 230(** Print summary *) 231let print_summary results = 232 let by_category = group_by_category results in 233 List.iter (fun (cat, tests) -> 234 let passed = List.filter fst tests |> List.length in 235 let total = List.length tests in 236 Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total 237 ) by_category; 238 239 (* Overall *) 240 let total_passed = List.filter (fun (p, _) -> p) results |> List.length in 241 Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results) 242 243(** Generate HTML report *) 244let write_html_report results filename = 245 (* Use Test_report module pattern from other tests *) 246 ... 247``` 248 249## Prerequisites 250 251Before tests can pass, the following must be completed: 252 253### 1. Wire Checker Registry into Html5_checker.check 254 255In `lib/html5_checker/html5_checker.ml`, the `check` and `check_dom` functions currently only collect parse errors. They need to: 256 257```ocaml 258let check ?collect_parse_errors ?system_id reader = 259 let doc = Html5rw.parse reader in 260 let collector = Message_collector.create () in 261 262 (* Collect parse errors if requested *) 263 if Option.value ~default:false collect_parse_errors then 264 Parse_error_bridge.collect_parse_errors ?system_id doc 265 |> List.iter (Message_collector.add collector); 266 267 (* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *) 268 let registry = Checker_registry.default () in 269 Dom_walker.walk_registry registry collector (Html5rw.root doc); 270 271 { document = doc; collector; system_id } 272``` 273 274### 2. Populate Default Checker Registry 275 276In `lib/html5_checker/checker_registry.ml`: 277 278```ocaml 279let default () = 280 let reg = create () in 281 register reg "nesting" Nesting_checker.checker; 282 register reg "aria" Aria_checker.checker; 283 register reg "required-attrs" Required_attr_checker.checker; 284 register reg "obsolete" Obsolete_checker.checker; 285 register reg "id" Id_checker.checker; 286 register reg "table" Table_checker.checker; 287 register reg "heading" Heading_checker.checker; 288 register reg "form" Form_checker.checker; 289 register reg "microdata" Microdata_checker.checker; 290 register reg "language" Language_checker.checker; 291 reg 292``` 293 294### 3. Ensure Checkers Produce Compatible Messages 295 296Review each checker's error messages against `messages.json` to ensure they can be matched. May need to: 297- Use curly quotes in messages 298- Match Nu validator's phrasing 299- Include element/attribute names in same format 300 301## Phased Rollout 302 303Run tests incrementally as checkers are integrated: 304 305| Phase | Command | What's Tested | 306|-------|---------|---------------| 307| 1 | `--category=parse` | Parse errors only (~200 tests) | 308| 2 | `--category=nesting` | + Nesting checker | 309| 3 | `--category=aria` | + ARIA checker (~700 tests) | 310| 4 | `--category=required` | + Required attributes | 311| 5 | (all) | Full suite | 312 313Implement command-line filtering: 314 315```ocaml 316let () = 317 let tests_dir = Sys.argv.(1) in 318 let category_filter = 319 if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in 320 321 let messages = Validator_messages.load (tests_dir ^ "/messages.json") in 322 let tests = discover_tests tests_dir in 323 let tests = match category_filter with 324 | Some cat -> List.filter (fun t -> categorize_test t = cat) tests 325 | None -> tests in 326 327 run_tests messages tests 328``` 329 330## Expected Test Counts by Category 331 332Based on file counts in `third_party/validator/tests/`: 333 334| Category | Files | Notes | 335|----------|-------|-------| 336| html/ | 2,601 | Core HTML5 validation | 337| html-aria/ | 712 | ARIA attributes | 338| html-svg/ | 517 | SVG embedded in HTML | 339| html-rdfa/ | 212 | RDFa semantic markup | 340| xhtml/ | 110 | XHTML variant | 341| html-its/ | 90 | Internationalization | 342| html-rdfalite/ | 56 | RDFa Lite | 343| **Total** | **~4,300** | | 344 345## Files to Create 346 347| File | Purpose | 348|------|---------| 349| `test/validator_messages.ml` | Load and query messages.json | 350| `test/test_validator.ml` | Main test runner | 351| `test/dune` (modify) | Add build rules | 352 353## Success Criteria 354 3551. All `-isvalid.html` tests pass (no false positives) 3562. All `-novalid.html` tests produce at least one error 3573. All `-haswarn.html` tests produce at least one warning 3584. Message content matches for implemented checkers 3595. HTML report generated for review 360 361## Reference 362 363- Nu HTML Checker: https://validator.github.io/validator/ 364- Test harness reference: `third_party/validator/resources/examples/test-harness/validator-tester.py` 365- Existing OCaml tests: `test/test_html5lib.ml`, `test/test_tokenizer.ml`