Plan: Nu HTML Validator Test Suite Integration#
This document describes how to run the Nu HTML Validator test suite against the OCaml html5_checker library.
Background#
The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in third_party/validator/tests/ contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more.
Test Suite Structure#
Location#
third_party/validator/tests/
├── messages.json # Expected error messages keyed by test path
├── html/ # 2,601 core HTML5 tests
│ ├── attributes/
│ ├── elements/
│ ├── microdata/
│ ├── mime-types/
│ ├── obsolete/
│ ├── parser/
│ └── ...
├── html-aria/ # 712 ARIA validation tests
├── html-its/ # 90 internationalization tests
├── html-rdfa/ # 212 RDFa tests
├── html-rdfalite/ # 56 RDFa Lite tests
├── html-svg/ # 517 SVG-in-HTML tests
└── xhtml/ # 110 XHTML tests
Filename Convention#
Test files use a suffix to indicate expected outcome:
| Suffix | Meaning | Expected Result |
|---|---|---|
-isvalid.html |
Valid HTML | No errors, no warnings |
-novalid.html |
Invalid HTML | At least one error |
-haswarn.html |
Valid with warning | At least one warning |
Expected Messages (messages.json)#
For -novalid and -haswarn files, messages.json contains the expected error/warning message:
{
"html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...",
"html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..."
}
Note: Messages use Unicode curly quotes (U+201C " and U+201D ").
Implementation Steps#
Step 1: Create Messages JSON Parser#
Create test/validator_messages.ml:
(** Parser for third_party/validator/tests/messages.json *)
type t = (string, string) Hashtbl.t
(** Maps test file path to expected error message *)
val load : string -> t
(** [load path] loads messages.json from [path] *)
val get : t -> string -> string option
(** [get messages test_path] returns expected message for test, if any *)
Implementation notes:
- Use
Jsontlibrary (already a dependency) - Keys are relative paths like
"html/parser/foo-novalid.html" - Values are error message strings with Unicode quotes
Step 2: Create Test File Discovery#
Create logic to find and classify test files:
type expected_outcome =
| Valid (* -isvalid.html: expect no errors *)
| Invalid (* -novalid.html: expect error matching messages.json *)
| HasWarning (* -haswarn.html: expect warning matching messages.json *)
type test_file = {
path : string; (* Full filesystem path *)
relative_path : string; (* Path relative to tests/, used as key in messages.json *)
category : string; (* html, html-aria, etc. *)
expected : expected_outcome;
}
val discover_tests : string -> test_file list
(** [discover_tests tests_dir] finds all test files recursively *)
val parse_outcome : string -> expected_outcome
(** [parse_outcome filename] extracts outcome from filename suffix *)
Step 3: Create Test Runner#
Create test/test_validator.ml:
(** Test runner for Nu HTML Validator test suite *)
(** Run a single test, returns (passed, details) *)
let run_test messages test_file =
(* 1. Read HTML content *)
let content = read_file test_file.path in
(* 2. Run validator *)
let result = Html5_checker.check
~collect_parse_errors:true
~system_id:test_file.relative_path
(Bytesrw.Bytes.Reader.of_string content) in
(* 3. Check result against expected outcome *)
match test_file.expected with
| Valid ->
(* Should have no errors or warnings *)
let errors = Html5_checker.errors result in
let warnings = Html5_checker.warnings result in
if errors = [] && warnings = [] then
(true, "OK: No messages")
else
(false, Printf.sprintf "Expected valid but got %d errors, %d warnings"
(List.length errors) (List.length warnings))
| Invalid ->
(* Should have at least one error matching expected message *)
let errors = Html5_checker.errors result in
let expected_msg = Validator_messages.get messages test_file.relative_path in
if errors = [] then
(false, "Expected error but got none")
else
check_message_match errors expected_msg
| HasWarning ->
(* Should have at least one warning matching expected message *)
let warnings = Html5_checker.warnings result in
let expected_msg = Validator_messages.get messages test_file.relative_path in
if warnings = [] then
(false, "Expected warning but got none")
else
check_message_match warnings expected_msg
Step 4: Message Matching Strategy#
The OCaml checker may produce different message text than the Nu validator. Implement flexible matching:
(** Check if actual message matches expected *)
let message_matches ~expected ~actual =
(* Strategy 1: Exact match *)
if actual = expected then true
(* Strategy 2: Normalized match (ignore quote style) *)
else if normalize_quotes actual = normalize_quotes expected then true
(* Strategy 3: Substring match *)
else if String.is_substring actual ~substring:(extract_core expected) then true
else false
(** Normalize Unicode curly quotes to ASCII *)
let normalize_quotes s =
s |> String.map (function
| '\u{201C}' | '\u{201D}' -> '"' (* " " -> " *)
| c -> c)
Step 5: Test Categories for Selective Running#
Map tests to checker categories for phased enablement:
type checker_category =
| Parse_errors (* Built into parser *)
| Nesting (* Nesting_checker *)
| Aria (* Aria_checker *)
| Required_attrs (* Required_attr_checker *)
| Obsolete (* Obsolete_checker *)
| Id_uniqueness (* Id_checker *)
| Table_structure (* Table_checker *)
| Heading_structure (* Heading_checker *)
| Form_validation (* Form_checker *)
| Microdata (* Microdata_checker *)
| Language (* Language_checker *)
| Unknown
(** Infer category from test path *)
let categorize_test test_file =
match test_file.category, extract_subcategory test_file.relative_path with
| "html-aria", _ -> Aria
| "html", "parser" -> Parse_errors
| "html", "elements" -> Nesting (* mostly *)
| "html", "attributes" -> Required_attrs
| "html", "obsolete" -> Obsolete
| "html", "microdata" -> Microdata
| _ -> Unknown
Step 6: Dune Build Integration#
Add to test/dune:
(executable
(name test_validator)
(modules test_validator validator_messages)
(libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report))
(rule
(alias validator-tests)
(deps
(glob_files_rec ../third_party/validator/tests/**/*.html)
../third_party/validator/tests/messages.json)
(action
(run %{exe:test_validator.exe} ../third_party/validator/tests)))
Note: Use separate alias validator-tests initially (not runtest) since many tests will fail until checkers are integrated.
Step 7: Reporting#
Generate reports compatible with existing test infrastructure:
(** Print summary *)
let print_summary results =
let by_category = group_by_category results in
List.iter (fun (cat, tests) ->
let passed = List.filter fst tests |> List.length in
let total = List.length tests in
Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total
) by_category;
(* Overall *)
let total_passed = List.filter (fun (p, _) -> p) results |> List.length in
Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results)
(** Generate HTML report *)
let write_html_report results filename =
(* Use Test_report module pattern from other tests *)
...
Prerequisites#
Before tests can pass, the following must be completed:
1. Wire Checker Registry into Html5_checker.check#
In lib/html5_checker/html5_checker.ml, the check and check_dom functions currently only collect parse errors. They need to:
let check ?collect_parse_errors ?system_id reader =
let doc = Html5rw.parse reader in
let collector = Message_collector.create () in
(* Collect parse errors if requested *)
if Option.value ~default:false collect_parse_errors then
Parse_error_bridge.collect_parse_errors ?system_id doc
|> List.iter (Message_collector.add collector);
(* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *)
let registry = Checker_registry.default () in
Dom_walker.walk_registry registry collector (Html5rw.root doc);
{ document = doc; collector; system_id }
2. Populate Default Checker Registry#
In lib/html5_checker/checker_registry.ml:
let default () =
let reg = create () in
register reg "nesting" Nesting_checker.checker;
register reg "aria" Aria_checker.checker;
register reg "required-attrs" Required_attr_checker.checker;
register reg "obsolete" Obsolete_checker.checker;
register reg "id" Id_checker.checker;
register reg "table" Table_checker.checker;
register reg "heading" Heading_checker.checker;
register reg "form" Form_checker.checker;
register reg "microdata" Microdata_checker.checker;
register reg "language" Language_checker.checker;
reg
3. Ensure Checkers Produce Compatible Messages#
Review each checker's error messages against messages.json to ensure they can be matched. May need to:
- Use curly quotes in messages
- Match Nu validator's phrasing
- Include element/attribute names in same format
Phased Rollout#
Run tests incrementally as checkers are integrated:
| Phase | Command | What's Tested |
|---|---|---|
| 1 | --category=parse |
Parse errors only (~200 tests) |
| 2 | --category=nesting |
+ Nesting checker |
| 3 | --category=aria |
+ ARIA checker (~700 tests) |
| 4 | --category=required |
+ Required attributes |
| 5 | (all) | Full suite |
Implement command-line filtering:
let () =
let tests_dir = Sys.argv.(1) in
let category_filter =
if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in
let messages = Validator_messages.load (tests_dir ^ "/messages.json") in
let tests = discover_tests tests_dir in
let tests = match category_filter with
| Some cat -> List.filter (fun t -> categorize_test t = cat) tests
| None -> tests in
run_tests messages tests
Expected Test Counts by Category#
Based on file counts in third_party/validator/tests/:
| Category | Files | Notes |
|---|---|---|
| html/ | 2,601 | Core HTML5 validation |
| html-aria/ | 712 | ARIA attributes |
| html-svg/ | 517 | SVG embedded in HTML |
| html-rdfa/ | 212 | RDFa semantic markup |
| xhtml/ | 110 | XHTML variant |
| html-its/ | 90 | Internationalization |
| html-rdfalite/ | 56 | RDFa Lite |
| Total | ~4,300 |
Files to Create#
| File | Purpose |
|---|---|
test/validator_messages.ml |
Load and query messages.json |
test/test_validator.ml |
Main test runner |
test/dune (modify) |
Add build rules |
Success Criteria#
- All
-isvalid.htmltests pass (no false positives) - All
-novalid.htmltests produce at least one error - All
-haswarn.htmltests produce at least one warning - Message content matches for implemented checkers
- HTML report generated for review
Reference#
- Nu HTML Checker: https://validator.github.io/validator/
- Test harness reference:
third_party/validator/resources/examples/test-harness/validator-tester.py - Existing OCaml tests:
test/test_html5lib.ml,test/test_tokenizer.ml