# Plan: Nu HTML Validator Test Suite Integration This document describes how to run the Nu HTML Validator test suite against the OCaml `html5_checker` library. ## Background The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in `third_party/validator/tests/` contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more. ## Test Suite Structure ### Location ``` third_party/validator/tests/ ├── messages.json # Expected error messages keyed by test path ├── html/ # 2,601 core HTML5 tests │ ├── attributes/ │ ├── elements/ │ ├── microdata/ │ ├── mime-types/ │ ├── obsolete/ │ ├── parser/ │ └── ... ├── html-aria/ # 712 ARIA validation tests ├── html-its/ # 90 internationalization tests ├── html-rdfa/ # 212 RDFa tests ├── html-rdfalite/ # 56 RDFa Lite tests ├── html-svg/ # 517 SVG-in-HTML tests └── xhtml/ # 110 XHTML tests ``` ### Filename Convention Test files use a suffix to indicate expected outcome: | Suffix | Meaning | Expected Result | |--------|---------|-----------------| | `-isvalid.html` | Valid HTML | No errors, no warnings | | `-novalid.html` | Invalid HTML | At least one error | | `-haswarn.html` | Valid with warning | At least one warning | ### Expected Messages (messages.json) For `-novalid` and `-haswarn` files, `messages.json` contains the expected error/warning message: ```json { "html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...", "html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..." } ``` Note: Messages use Unicode curly quotes (U+201C `"` and U+201D `"`). ## Implementation Steps ### Step 1: Create Messages JSON Parser Create `test/validator_messages.ml`: ```ocaml (** Parser for third_party/validator/tests/messages.json *) type t = (string, string) Hashtbl.t (** Maps test file path to expected error message *) val load : string -> t (** [load path] loads messages.json from [path] *) val get : t -> string -> string option (** [get messages test_path] returns expected message for test, if any *) ``` Implementation notes: - Use `Jsont` library (already a dependency) - Keys are relative paths like `"html/parser/foo-novalid.html"` - Values are error message strings with Unicode quotes ### Step 2: Create Test File Discovery Create logic to find and classify test files: ```ocaml type expected_outcome = | Valid (* -isvalid.html: expect no errors *) | Invalid (* -novalid.html: expect error matching messages.json *) | HasWarning (* -haswarn.html: expect warning matching messages.json *) type test_file = { path : string; (* Full filesystem path *) relative_path : string; (* Path relative to tests/, used as key in messages.json *) category : string; (* html, html-aria, etc. *) expected : expected_outcome; } val discover_tests : string -> test_file list (** [discover_tests tests_dir] finds all test files recursively *) val parse_outcome : string -> expected_outcome (** [parse_outcome filename] extracts outcome from filename suffix *) ``` ### Step 3: Create Test Runner Create `test/test_validator.ml`: ```ocaml (** Test runner for Nu HTML Validator test suite *) (** Run a single test, returns (passed, details) *) let run_test messages test_file = (* 1. Read HTML content *) let content = read_file test_file.path in (* 2. Run validator *) let result = Html5_checker.check ~collect_parse_errors:true ~system_id:test_file.relative_path (Bytesrw.Bytes.Reader.of_string content) in (* 3. Check result against expected outcome *) match test_file.expected with | Valid -> (* Should have no errors or warnings *) let errors = Html5_checker.errors result in let warnings = Html5_checker.warnings result in if errors = [] && warnings = [] then (true, "OK: No messages") else (false, Printf.sprintf "Expected valid but got %d errors, %d warnings" (List.length errors) (List.length warnings)) | Invalid -> (* Should have at least one error matching expected message *) let errors = Html5_checker.errors result in let expected_msg = Validator_messages.get messages test_file.relative_path in if errors = [] then (false, "Expected error but got none") else check_message_match errors expected_msg | HasWarning -> (* Should have at least one warning matching expected message *) let warnings = Html5_checker.warnings result in let expected_msg = Validator_messages.get messages test_file.relative_path in if warnings = [] then (false, "Expected warning but got none") else check_message_match warnings expected_msg ``` ### Step 4: Message Matching Strategy The OCaml checker may produce different message text than the Nu validator. Implement flexible matching: ```ocaml (** Check if actual message matches expected *) let message_matches ~expected ~actual = (* Strategy 1: Exact match *) if actual = expected then true (* Strategy 2: Normalized match (ignore quote style) *) else if normalize_quotes actual = normalize_quotes expected then true (* Strategy 3: Substring match *) else if String.is_substring actual ~substring:(extract_core expected) then true else false (** Normalize Unicode curly quotes to ASCII *) let normalize_quotes s = s |> String.map (function | '\u{201C}' | '\u{201D}' -> '"' (* " " -> " *) | c -> c) ``` ### Step 5: Test Categories for Selective Running Map tests to checker categories for phased enablement: ```ocaml type checker_category = | Parse_errors (* Built into parser *) | Nesting (* Nesting_checker *) | Aria (* Aria_checker *) | Required_attrs (* Required_attr_checker *) | Obsolete (* Obsolete_checker *) | Id_uniqueness (* Id_checker *) | Table_structure (* Table_checker *) | Heading_structure (* Heading_checker *) | Form_validation (* Form_checker *) | Microdata (* Microdata_checker *) | Language (* Language_checker *) | Unknown (** Infer category from test path *) let categorize_test test_file = match test_file.category, extract_subcategory test_file.relative_path with | "html-aria", _ -> Aria | "html", "parser" -> Parse_errors | "html", "elements" -> Nesting (* mostly *) | "html", "attributes" -> Required_attrs | "html", "obsolete" -> Obsolete | "html", "microdata" -> Microdata | _ -> Unknown ``` ### Step 6: Dune Build Integration Add to `test/dune`: ```dune (executable (name test_validator) (modules test_validator validator_messages) (libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report)) (rule (alias validator-tests) (deps (glob_files_rec ../third_party/validator/tests/**/*.html) ../third_party/validator/tests/messages.json) (action (run %{exe:test_validator.exe} ../third_party/validator/tests))) ``` Note: Use separate alias `validator-tests` initially (not `runtest`) since many tests will fail until checkers are integrated. ### Step 7: Reporting Generate reports compatible with existing test infrastructure: ```ocaml (** Print summary *) let print_summary results = let by_category = group_by_category results in List.iter (fun (cat, tests) -> let passed = List.filter fst tests |> List.length in let total = List.length tests in Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total ) by_category; (* Overall *) let total_passed = List.filter (fun (p, _) -> p) results |> List.length in Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results) (** Generate HTML report *) let write_html_report results filename = (* Use Test_report module pattern from other tests *) ... ``` ## Prerequisites Before tests can pass, the following must be completed: ### 1. Wire Checker Registry into Html5_checker.check In `lib/html5_checker/html5_checker.ml`, the `check` and `check_dom` functions currently only collect parse errors. They need to: ```ocaml let check ?collect_parse_errors ?system_id reader = let doc = Html5rw.parse reader in let collector = Message_collector.create () in (* Collect parse errors if requested *) if Option.value ~default:false collect_parse_errors then Parse_error_bridge.collect_parse_errors ?system_id doc |> List.iter (Message_collector.add collector); (* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *) let registry = Checker_registry.default () in Dom_walker.walk_registry registry collector (Html5rw.root doc); { document = doc; collector; system_id } ``` ### 2. Populate Default Checker Registry In `lib/html5_checker/checker_registry.ml`: ```ocaml let default () = let reg = create () in register reg "nesting" Nesting_checker.checker; register reg "aria" Aria_checker.checker; register reg "required-attrs" Required_attr_checker.checker; register reg "obsolete" Obsolete_checker.checker; register reg "id" Id_checker.checker; register reg "table" Table_checker.checker; register reg "heading" Heading_checker.checker; register reg "form" Form_checker.checker; register reg "microdata" Microdata_checker.checker; register reg "language" Language_checker.checker; reg ``` ### 3. Ensure Checkers Produce Compatible Messages Review each checker's error messages against `messages.json` to ensure they can be matched. May need to: - Use curly quotes in messages - Match Nu validator's phrasing - Include element/attribute names in same format ## Phased Rollout Run tests incrementally as checkers are integrated: | Phase | Command | What's Tested | |-------|---------|---------------| | 1 | `--category=parse` | Parse errors only (~200 tests) | | 2 | `--category=nesting` | + Nesting checker | | 3 | `--category=aria` | + ARIA checker (~700 tests) | | 4 | `--category=required` | + Required attributes | | 5 | (all) | Full suite | Implement command-line filtering: ```ocaml let () = let tests_dir = Sys.argv.(1) in let category_filter = if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in let messages = Validator_messages.load (tests_dir ^ "/messages.json") in let tests = discover_tests tests_dir in let tests = match category_filter with | Some cat -> List.filter (fun t -> categorize_test t = cat) tests | None -> tests in run_tests messages tests ``` ## Expected Test Counts by Category Based on file counts in `third_party/validator/tests/`: | Category | Files | Notes | |----------|-------|-------| | html/ | 2,601 | Core HTML5 validation | | html-aria/ | 712 | ARIA attributes | | html-svg/ | 517 | SVG embedded in HTML | | html-rdfa/ | 212 | RDFa semantic markup | | xhtml/ | 110 | XHTML variant | | html-its/ | 90 | Internationalization | | html-rdfalite/ | 56 | RDFa Lite | | **Total** | **~4,300** | | ## Files to Create | File | Purpose | |------|---------| | `test/validator_messages.ml` | Load and query messages.json | | `test/test_validator.ml` | Main test runner | | `test/dune` (modify) | Add build rules | ## Success Criteria 1. All `-isvalid.html` tests pass (no false positives) 2. All `-novalid.html` tests produce at least one error 3. All `-haswarn.html` tests produce at least one warning 4. Message content matches for implemented checkers 5. HTML report generated for review ## Reference - Nu HTML Checker: https://validator.github.io/validator/ - Test harness reference: `third_party/validator/resources/examples/test-harness/validator-tester.py` - Existing OCaml tests: `test/test_html5lib.ml`, `test/test_tokenizer.ml`