OCaml HTML5 parser/serialiser based on Python's JustHTML

whats the plan

Changed files
+365
+365
PLAN.md
··· 1 + # Plan: Nu HTML Validator Test Suite Integration 2 + 3 + This document describes how to run the Nu HTML Validator test suite against the OCaml `html5_checker` library. 4 + 5 + ## Background 6 + 7 + The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in `third_party/validator/tests/` contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more. 8 + 9 + ## Test Suite Structure 10 + 11 + ### Location 12 + ``` 13 + third_party/validator/tests/ 14 + ├── messages.json # Expected error messages keyed by test path 15 + ├── html/ # 2,601 core HTML5 tests 16 + │ ├── attributes/ 17 + │ ├── elements/ 18 + │ ├── microdata/ 19 + │ ├── mime-types/ 20 + │ ├── obsolete/ 21 + │ ├── parser/ 22 + │ └── ... 23 + ├── html-aria/ # 712 ARIA validation tests 24 + ├── html-its/ # 90 internationalization tests 25 + ├── html-rdfa/ # 212 RDFa tests 26 + ├── html-rdfalite/ # 56 RDFa Lite tests 27 + ├── html-svg/ # 517 SVG-in-HTML tests 28 + └── xhtml/ # 110 XHTML tests 29 + ``` 30 + 31 + ### Filename Convention 32 + 33 + Test files use a suffix to indicate expected outcome: 34 + 35 + | Suffix | Meaning | Expected Result | 36 + |--------|---------|-----------------| 37 + | `-isvalid.html` | Valid HTML | No errors, no warnings | 38 + | `-novalid.html` | Invalid HTML | At least one error | 39 + | `-haswarn.html` | Valid with warning | At least one warning | 40 + 41 + ### Expected Messages (messages.json) 42 + 43 + For `-novalid` and `-haswarn` files, `messages.json` contains the expected error/warning message: 44 + 45 + ```json 46 + { 47 + "html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...", 48 + "html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..." 49 + } 50 + ``` 51 + 52 + Note: Messages use Unicode curly quotes (U+201C `"` and U+201D `"`). 53 + 54 + ## Implementation Steps 55 + 56 + ### Step 1: Create Messages JSON Parser 57 + 58 + Create `test/validator_messages.ml`: 59 + 60 + ```ocaml 61 + (** Parser for third_party/validator/tests/messages.json *) 62 + 63 + type t = (string, string) Hashtbl.t 64 + (** Maps test file path to expected error message *) 65 + 66 + val load : string -> t 67 + (** [load path] loads messages.json from [path] *) 68 + 69 + val get : t -> string -> string option 70 + (** [get messages test_path] returns expected message for test, if any *) 71 + ``` 72 + 73 + Implementation notes: 74 + - Use `Jsont` library (already a dependency) 75 + - Keys are relative paths like `"html/parser/foo-novalid.html"` 76 + - Values are error message strings with Unicode quotes 77 + 78 + ### Step 2: Create Test File Discovery 79 + 80 + Create logic to find and classify test files: 81 + 82 + ```ocaml 83 + type expected_outcome = 84 + | Valid (* -isvalid.html: expect no errors *) 85 + | Invalid (* -novalid.html: expect error matching messages.json *) 86 + | HasWarning (* -haswarn.html: expect warning matching messages.json *) 87 + 88 + type test_file = { 89 + path : string; (* Full filesystem path *) 90 + relative_path : string; (* Path relative to tests/, used as key in messages.json *) 91 + category : string; (* html, html-aria, etc. *) 92 + expected : expected_outcome; 93 + } 94 + 95 + val discover_tests : string -> test_file list 96 + (** [discover_tests tests_dir] finds all test files recursively *) 97 + 98 + val parse_outcome : string -> expected_outcome 99 + (** [parse_outcome filename] extracts outcome from filename suffix *) 100 + ``` 101 + 102 + ### Step 3: Create Test Runner 103 + 104 + Create `test/test_validator.ml`: 105 + 106 + ```ocaml 107 + (** Test runner for Nu HTML Validator test suite *) 108 + 109 + (** Run a single test, returns (passed, details) *) 110 + let run_test messages test_file = 111 + (* 1. Read HTML content *) 112 + let content = read_file test_file.path in 113 + 114 + (* 2. Run validator *) 115 + let result = Html5_checker.check 116 + ~collect_parse_errors:true 117 + ~system_id:test_file.relative_path 118 + (Bytesrw.Bytes.Reader.of_string content) in 119 + 120 + (* 3. Check result against expected outcome *) 121 + match test_file.expected with 122 + | Valid -> 123 + (* Should have no errors or warnings *) 124 + let errors = Html5_checker.errors result in 125 + let warnings = Html5_checker.warnings result in 126 + if errors = [] && warnings = [] then 127 + (true, "OK: No messages") 128 + else 129 + (false, Printf.sprintf "Expected valid but got %d errors, %d warnings" 130 + (List.length errors) (List.length warnings)) 131 + 132 + | Invalid -> 133 + (* Should have at least one error matching expected message *) 134 + let errors = Html5_checker.errors result in 135 + let expected_msg = Validator_messages.get messages test_file.relative_path in 136 + if errors = [] then 137 + (false, "Expected error but got none") 138 + else 139 + check_message_match errors expected_msg 140 + 141 + | HasWarning -> 142 + (* Should have at least one warning matching expected message *) 143 + let warnings = Html5_checker.warnings result in 144 + let expected_msg = Validator_messages.get messages test_file.relative_path in 145 + if warnings = [] then 146 + (false, "Expected warning but got none") 147 + else 148 + check_message_match warnings expected_msg 149 + ``` 150 + 151 + ### Step 4: Message Matching Strategy 152 + 153 + The OCaml checker may produce different message text than the Nu validator. Implement flexible matching: 154 + 155 + ```ocaml 156 + (** Check if actual message matches expected *) 157 + let message_matches ~expected ~actual = 158 + (* Strategy 1: Exact match *) 159 + if actual = expected then true 160 + (* Strategy 2: Normalized match (ignore quote style) *) 161 + else if normalize_quotes actual = normalize_quotes expected then true 162 + (* Strategy 3: Substring match *) 163 + else if String.is_substring actual ~substring:(extract_core expected) then true 164 + else false 165 + 166 + (** Normalize Unicode curly quotes to ASCII *) 167 + let normalize_quotes s = 168 + s |> String.map (function 169 + | '\u{201C}' | '\u{201D}' -> '"' (* " " -> " *) 170 + | c -> c) 171 + ``` 172 + 173 + ### Step 5: Test Categories for Selective Running 174 + 175 + Map tests to checker categories for phased enablement: 176 + 177 + ```ocaml 178 + type checker_category = 179 + | Parse_errors (* Built into parser *) 180 + | Nesting (* Nesting_checker *) 181 + | Aria (* Aria_checker *) 182 + | Required_attrs (* Required_attr_checker *) 183 + | Obsolete (* Obsolete_checker *) 184 + | Id_uniqueness (* Id_checker *) 185 + | Table_structure (* Table_checker *) 186 + | Heading_structure (* Heading_checker *) 187 + | Form_validation (* Form_checker *) 188 + | Microdata (* Microdata_checker *) 189 + | Language (* Language_checker *) 190 + | Unknown 191 + 192 + (** Infer category from test path *) 193 + let categorize_test test_file = 194 + match test_file.category, extract_subcategory test_file.relative_path with 195 + | "html-aria", _ -> Aria 196 + | "html", "parser" -> Parse_errors 197 + | "html", "elements" -> Nesting (* mostly *) 198 + | "html", "attributes" -> Required_attrs 199 + | "html", "obsolete" -> Obsolete 200 + | "html", "microdata" -> Microdata 201 + | _ -> Unknown 202 + ``` 203 + 204 + ### Step 6: Dune Build Integration 205 + 206 + Add to `test/dune`: 207 + 208 + ```dune 209 + (executable 210 + (name test_validator) 211 + (modules test_validator validator_messages) 212 + (libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report)) 213 + 214 + (rule 215 + (alias validator-tests) 216 + (deps 217 + (glob_files_rec ../third_party/validator/tests/**/*.html) 218 + ../third_party/validator/tests/messages.json) 219 + (action 220 + (run %{exe:test_validator.exe} ../third_party/validator/tests))) 221 + ``` 222 + 223 + Note: Use separate alias `validator-tests` initially (not `runtest`) since many tests will fail until checkers are integrated. 224 + 225 + ### Step 7: Reporting 226 + 227 + Generate reports compatible with existing test infrastructure: 228 + 229 + ```ocaml 230 + (** Print summary *) 231 + let print_summary results = 232 + let by_category = group_by_category results in 233 + List.iter (fun (cat, tests) -> 234 + let passed = List.filter fst tests |> List.length in 235 + let total = List.length tests in 236 + Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total 237 + ) by_category; 238 + 239 + (* Overall *) 240 + let total_passed = List.filter (fun (p, _) -> p) results |> List.length in 241 + Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results) 242 + 243 + (** Generate HTML report *) 244 + let write_html_report results filename = 245 + (* Use Test_report module pattern from other tests *) 246 + ... 247 + ``` 248 + 249 + ## Prerequisites 250 + 251 + Before tests can pass, the following must be completed: 252 + 253 + ### 1. Wire Checker Registry into Html5_checker.check 254 + 255 + In `lib/html5_checker/html5_checker.ml`, the `check` and `check_dom` functions currently only collect parse errors. They need to: 256 + 257 + ```ocaml 258 + let check ?collect_parse_errors ?system_id reader = 259 + let doc = Html5rw.parse reader in 260 + let collector = Message_collector.create () in 261 + 262 + (* Collect parse errors if requested *) 263 + if Option.value ~default:false collect_parse_errors then 264 + Parse_error_bridge.collect_parse_errors ?system_id doc 265 + |> List.iter (Message_collector.add collector); 266 + 267 + (* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *) 268 + let registry = Checker_registry.default () in 269 + Dom_walker.walk_registry registry collector (Html5rw.root doc); 270 + 271 + { document = doc; collector; system_id } 272 + ``` 273 + 274 + ### 2. Populate Default Checker Registry 275 + 276 + In `lib/html5_checker/checker_registry.ml`: 277 + 278 + ```ocaml 279 + let default () = 280 + let reg = create () in 281 + register reg "nesting" Nesting_checker.checker; 282 + register reg "aria" Aria_checker.checker; 283 + register reg "required-attrs" Required_attr_checker.checker; 284 + register reg "obsolete" Obsolete_checker.checker; 285 + register reg "id" Id_checker.checker; 286 + register reg "table" Table_checker.checker; 287 + register reg "heading" Heading_checker.checker; 288 + register reg "form" Form_checker.checker; 289 + register reg "microdata" Microdata_checker.checker; 290 + register reg "language" Language_checker.checker; 291 + reg 292 + ``` 293 + 294 + ### 3. Ensure Checkers Produce Compatible Messages 295 + 296 + Review each checker's error messages against `messages.json` to ensure they can be matched. May need to: 297 + - Use curly quotes in messages 298 + - Match Nu validator's phrasing 299 + - Include element/attribute names in same format 300 + 301 + ## Phased Rollout 302 + 303 + Run tests incrementally as checkers are integrated: 304 + 305 + | Phase | Command | What's Tested | 306 + |-------|---------|---------------| 307 + | 1 | `--category=parse` | Parse errors only (~200 tests) | 308 + | 2 | `--category=nesting` | + Nesting checker | 309 + | 3 | `--category=aria` | + ARIA checker (~700 tests) | 310 + | 4 | `--category=required` | + Required attributes | 311 + | 5 | (all) | Full suite | 312 + 313 + Implement command-line filtering: 314 + 315 + ```ocaml 316 + let () = 317 + let tests_dir = Sys.argv.(1) in 318 + let category_filter = 319 + if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in 320 + 321 + let messages = Validator_messages.load (tests_dir ^ "/messages.json") in 322 + let tests = discover_tests tests_dir in 323 + let tests = match category_filter with 324 + | Some cat -> List.filter (fun t -> categorize_test t = cat) tests 325 + | None -> tests in 326 + 327 + run_tests messages tests 328 + ``` 329 + 330 + ## Expected Test Counts by Category 331 + 332 + Based on file counts in `third_party/validator/tests/`: 333 + 334 + | Category | Files | Notes | 335 + |----------|-------|-------| 336 + | html/ | 2,601 | Core HTML5 validation | 337 + | html-aria/ | 712 | ARIA attributes | 338 + | html-svg/ | 517 | SVG embedded in HTML | 339 + | html-rdfa/ | 212 | RDFa semantic markup | 340 + | xhtml/ | 110 | XHTML variant | 341 + | html-its/ | 90 | Internationalization | 342 + | html-rdfalite/ | 56 | RDFa Lite | 343 + | **Total** | **~4,300** | | 344 + 345 + ## Files to Create 346 + 347 + | File | Purpose | 348 + |------|---------| 349 + | `test/validator_messages.ml` | Load and query messages.json | 350 + | `test/test_validator.ml` | Main test runner | 351 + | `test/dune` (modify) | Add build rules | 352 + 353 + ## Success Criteria 354 + 355 + 1. All `-isvalid.html` tests pass (no false positives) 356 + 2. All `-novalid.html` tests produce at least one error 357 + 3. All `-haswarn.html` tests produce at least one warning 358 + 4. Message content matches for implemented checkers 359 + 5. HTML report generated for review 360 + 361 + ## Reference 362 + 363 + - Nu HTML Checker: https://validator.github.io/validator/ 364 + - Test harness reference: `third_party/validator/resources/examples/test-harness/validator-tester.py` 365 + - Existing OCaml tests: `test/test_html5lib.ml`, `test/test_tokenizer.ml`