OCaml HTML5 parser/serialiser based on Python's JustHTML

Plan: Nu HTML Validator Test Suite Integration#

This document describes how to run the Nu HTML Validator test suite against the OCaml html5_checker library.

Background#

The Nu HTML Checker (vnu) is the W3C's official HTML validator. Its test suite in third_party/validator/tests/ contains ~4,300 HTML test files covering HTML5 validation rules including ARIA, microdata, element nesting, required attributes, and more.

Test Suite Structure#

Location#

third_party/validator/tests/
├── messages.json          # Expected error messages keyed by test path
├── html/                  # 2,601 core HTML5 tests
│   ├── attributes/
│   ├── elements/
│   ├── microdata/
│   ├── mime-types/
│   ├── obsolete/
│   ├── parser/
│   └── ...
├── html-aria/             # 712 ARIA validation tests
├── html-its/              # 90 internationalization tests
├── html-rdfa/             # 212 RDFa tests
├── html-rdfalite/         # 56 RDFa Lite tests
├── html-svg/              # 517 SVG-in-HTML tests
└── xhtml/                 # 110 XHTML tests

Filename Convention#

Test files use a suffix to indicate expected outcome:

Suffix Meaning Expected Result
-isvalid.html Valid HTML No errors, no warnings
-novalid.html Invalid HTML At least one error
-haswarn.html Valid with warning At least one warning

Expected Messages (messages.json)#

For -novalid and -haswarn files, messages.json contains the expected error/warning message:

{
  "html-aria/misc/aria-label-div-novalid.html": "The "aria-label" attribute must not be specified on any "div" element unless...",
  "html/mime-types/001-novalid.html": "Bad value \"text/html \" for attribute \"type\" on element \"link\": Bad MIME type..."
}

Note: Messages use Unicode curly quotes (U+201C " and U+201D ").

Implementation Steps#

Step 1: Create Messages JSON Parser#

Create test/validator_messages.ml:

(** Parser for third_party/validator/tests/messages.json *)

type t = (string, string) Hashtbl.t
(** Maps test file path to expected error message *)

val load : string -> t
(** [load path] loads messages.json from [path] *)

val get : t -> string -> string option
(** [get messages test_path] returns expected message for test, if any *)

Implementation notes:

  • Use Jsont library (already a dependency)
  • Keys are relative paths like "html/parser/foo-novalid.html"
  • Values are error message strings with Unicode quotes

Step 2: Create Test File Discovery#

Create logic to find and classify test files:

type expected_outcome =
  | Valid      (* -isvalid.html: expect no errors *)
  | Invalid    (* -novalid.html: expect error matching messages.json *)
  | HasWarning (* -haswarn.html: expect warning matching messages.json *)

type test_file = {
  path : string;           (* Full filesystem path *)
  relative_path : string;  (* Path relative to tests/, used as key in messages.json *)
  category : string;       (* html, html-aria, etc. *)
  expected : expected_outcome;
}

val discover_tests : string -> test_file list
(** [discover_tests tests_dir] finds all test files recursively *)

val parse_outcome : string -> expected_outcome
(** [parse_outcome filename] extracts outcome from filename suffix *)

Step 3: Create Test Runner#

Create test/test_validator.ml:

(** Test runner for Nu HTML Validator test suite *)

(** Run a single test, returns (passed, details) *)
let run_test messages test_file =
  (* 1. Read HTML content *)
  let content = read_file test_file.path in

  (* 2. Run validator *)
  let result = Html5_checker.check
    ~collect_parse_errors:true
    ~system_id:test_file.relative_path
    (Bytesrw.Bytes.Reader.of_string content) in

  (* 3. Check result against expected outcome *)
  match test_file.expected with
  | Valid ->
      (* Should have no errors or warnings *)
      let errors = Html5_checker.errors result in
      let warnings = Html5_checker.warnings result in
      if errors = [] && warnings = [] then
        (true, "OK: No messages")
      else
        (false, Printf.sprintf "Expected valid but got %d errors, %d warnings"
          (List.length errors) (List.length warnings))

  | Invalid ->
      (* Should have at least one error matching expected message *)
      let errors = Html5_checker.errors result in
      let expected_msg = Validator_messages.get messages test_file.relative_path in
      if errors = [] then
        (false, "Expected error but got none")
      else
        check_message_match errors expected_msg

  | HasWarning ->
      (* Should have at least one warning matching expected message *)
      let warnings = Html5_checker.warnings result in
      let expected_msg = Validator_messages.get messages test_file.relative_path in
      if warnings = [] then
        (false, "Expected warning but got none")
      else
        check_message_match warnings expected_msg

Step 4: Message Matching Strategy#

The OCaml checker may produce different message text than the Nu validator. Implement flexible matching:

(** Check if actual message matches expected *)
let message_matches ~expected ~actual =
  (* Strategy 1: Exact match *)
  if actual = expected then true
  (* Strategy 2: Normalized match (ignore quote style) *)
  else if normalize_quotes actual = normalize_quotes expected then true
  (* Strategy 3: Substring match *)
  else if String.is_substring actual ~substring:(extract_core expected) then true
  else false

(** Normalize Unicode curly quotes to ASCII *)
let normalize_quotes s =
  s |> String.map (function
    | '\u{201C}' | '\u{201D}' -> '"'  (* " " -> " *)
    | c -> c)

Step 5: Test Categories for Selective Running#

Map tests to checker categories for phased enablement:

type checker_category =
  | Parse_errors      (* Built into parser *)
  | Nesting           (* Nesting_checker *)
  | Aria              (* Aria_checker *)
  | Required_attrs    (* Required_attr_checker *)
  | Obsolete          (* Obsolete_checker *)
  | Id_uniqueness     (* Id_checker *)
  | Table_structure   (* Table_checker *)
  | Heading_structure (* Heading_checker *)
  | Form_validation   (* Form_checker *)
  | Microdata         (* Microdata_checker *)
  | Language          (* Language_checker *)
  | Unknown

(** Infer category from test path *)
let categorize_test test_file =
  match test_file.category, extract_subcategory test_file.relative_path with
  | "html-aria", _ -> Aria
  | "html", "parser" -> Parse_errors
  | "html", "elements" -> Nesting  (* mostly *)
  | "html", "attributes" -> Required_attrs
  | "html", "obsolete" -> Obsolete
  | "html", "microdata" -> Microdata
  | _ -> Unknown

Step 6: Dune Build Integration#

Add to test/dune:

(executable
 (name test_validator)
 (modules test_validator validator_messages)
 (libraries bytesrw html5rw html5rw.checker jsont jsont.bytesrw test_report))

(rule
 (alias validator-tests)
 (deps
  (glob_files_rec ../third_party/validator/tests/**/*.html)
  ../third_party/validator/tests/messages.json)
 (action
  (run %{exe:test_validator.exe} ../third_party/validator/tests)))

Note: Use separate alias validator-tests initially (not runtest) since many tests will fail until checkers are integrated.

Step 7: Reporting#

Generate reports compatible with existing test infrastructure:

(** Print summary *)
let print_summary results =
  let by_category = group_by_category results in
  List.iter (fun (cat, tests) ->
    let passed = List.filter fst tests |> List.length in
    let total = List.length tests in
    Printf.printf "%s: %d/%d passed\n" (category_name cat) passed total
  ) by_category;

  (* Overall *)
  let total_passed = List.filter (fun (p, _) -> p) results |> List.length in
  Printf.printf "\nTotal: %d/%d passed\n" total_passed (List.length results)

(** Generate HTML report *)
let write_html_report results filename =
  (* Use Test_report module pattern from other tests *)
  ...

Prerequisites#

Before tests can pass, the following must be completed:

1. Wire Checker Registry into Html5_checker.check#

In lib/html5_checker/html5_checker.ml, the check and check_dom functions currently only collect parse errors. They need to:

let check ?collect_parse_errors ?system_id reader =
  let doc = Html5rw.parse reader in
  let collector = Message_collector.create () in

  (* Collect parse errors if requested *)
  if Option.value ~default:false collect_parse_errors then
    Parse_error_bridge.collect_parse_errors ?system_id doc
    |> List.iter (Message_collector.add collector);

  (* TODO: Run checkers - THIS NEEDS TO BE IMPLEMENTED *)
  let registry = Checker_registry.default () in
  Dom_walker.walk_registry registry collector (Html5rw.root doc);

  { document = doc; collector; system_id }

2. Populate Default Checker Registry#

In lib/html5_checker/checker_registry.ml:

let default () =
  let reg = create () in
  register reg "nesting" Nesting_checker.checker;
  register reg "aria" Aria_checker.checker;
  register reg "required-attrs" Required_attr_checker.checker;
  register reg "obsolete" Obsolete_checker.checker;
  register reg "id" Id_checker.checker;
  register reg "table" Table_checker.checker;
  register reg "heading" Heading_checker.checker;
  register reg "form" Form_checker.checker;
  register reg "microdata" Microdata_checker.checker;
  register reg "language" Language_checker.checker;
  reg

3. Ensure Checkers Produce Compatible Messages#

Review each checker's error messages against messages.json to ensure they can be matched. May need to:

  • Use curly quotes in messages
  • Match Nu validator's phrasing
  • Include element/attribute names in same format

Phased Rollout#

Run tests incrementally as checkers are integrated:

Phase Command What's Tested
1 --category=parse Parse errors only (~200 tests)
2 --category=nesting + Nesting checker
3 --category=aria + ARIA checker (~700 tests)
4 --category=required + Required attributes
5 (all) Full suite

Implement command-line filtering:

let () =
  let tests_dir = Sys.argv.(1) in
  let category_filter =
    if Array.length Sys.argv > 2 then Some Sys.argv.(2) else None in

  let messages = Validator_messages.load (tests_dir ^ "/messages.json") in
  let tests = discover_tests tests_dir in
  let tests = match category_filter with
    | Some cat -> List.filter (fun t -> categorize_test t = cat) tests
    | None -> tests in

  run_tests messages tests

Expected Test Counts by Category#

Based on file counts in third_party/validator/tests/:

Category Files Notes
html/ 2,601 Core HTML5 validation
html-aria/ 712 ARIA attributes
html-svg/ 517 SVG embedded in HTML
html-rdfa/ 212 RDFa semantic markup
xhtml/ 110 XHTML variant
html-its/ 90 Internationalization
html-rdfalite/ 56 RDFa Lite
Total ~4,300

Files to Create#

File Purpose
test/validator_messages.ml Load and query messages.json
test/test_validator.ml Main test runner
test/dune (modify) Add build rules

Success Criteria#

  1. All -isvalid.html tests pass (no false positives)
  2. All -novalid.html tests produce at least one error
  3. All -haswarn.html tests produce at least one warning
  4. Message content matches for implemented checkers
  5. HTML report generated for review

Reference#

  • Nu HTML Checker: https://validator.github.io/validator/
  • Test harness reference: third_party/validator/resources/examples/test-harness/validator-tester.py
  • Existing OCaml tests: test/test_html5lib.ml, test/test_tokenizer.ml