A pure OCaml HTML5 parser and validator implementing the WHATWG HTML5 specification. This library passes the html5lib-tests suite and provides full support for tokenization, tree construction, encoding detection, CSS selector queries, and conformance checking. This library was ported from JustHTML.

Key Features#

WHATWG Compliant: Implements the full HTML5 parsing algorithm with proper error recovery
Conformance Checker: Validates HTML5 documents against the WHATWG specification
CSS Selectors: Query the DOM using standard CSS selector syntax
Streaming I/O: Uses bytesrw for efficient streaming input/output
Encoding Detection: Automatic character encoding detection following the WHATWG algorithm
Entity Decoding: Complete HTML5 named character reference support
Multiple Output Formats: Text, JSON (Nu validator compatible), and GNU-style output

Libraries#

html5rw - Core HTML5 parser
html5rw.check - Conformance checker library

Command Line Tool#

The html5check CLI validates HTML5 documents:

# Validate a file
html5check index.html

# Validate from stdin
cat page.html | html5check -

# JSON output (Nu validator compatible)
html5check --format=json page.html

# GNU-style output for IDE integration
html5check --format=gnu page.html

# Show only errors (suppress warnings)
html5check --errors-only page.html

# Quiet mode - show only counts
html5check --quiet page.html

Exit codes: 0 = valid, 1 = validation errors, 2 = I/O error.

Usage#

Parsing HTML#

open Bytesrw

(* Parse HTML from a string *)
let html = "<html><body><p>Hello, world!</p></body></html>"
let reader = Bytes.Reader.of_string html
let doc = Html5rw.parse reader

(* Query with CSS selectors *)
let paragraphs = Html5rw.query doc "p"

(* Extract text content *)
let text = Html5rw.to_text doc

(* Serialize back to HTML *)
let output = Html5rw.to_string doc

For fragment parsing (innerHTML):

(* Parse as innerHTML of a <div> *)
let ctx = Html5rw.make_fragment_context ~tag_name:"div" ()
let reader = Bytes.Reader.of_string "<p>Fragment content</p>"
let doc = Html5rw.parse ~fragment_context:ctx reader

Validating HTML#

open Bytesrw

(* Check HTML from a string *)
let html = "<html><body><p>Hello</p></body></html>"
let reader = Bytes.Reader.of_string html
let result = Htmlrw_check.check reader

(* Check for errors *)
if Htmlrw_check.has_errors result then
  print_endline "Document has errors";

(* Get all messages *)
let messages = Htmlrw_check.messages result in
List.iter (fun msg ->
  Format.printf "%a@." Htmlrw_check.pp_message msg
) messages;

(* Get formatted output *)
let text_output = Htmlrw_check.to_text result in
let json_output = Htmlrw_check.to_json result in
let gnu_output = Htmlrw_check.to_gnu result

The checker validates:

Parse errors (malformed HTML syntax)
Content model violations (invalid element nesting)
Attribute errors (invalid or missing required attributes)
Structural issues (other conformance problems)

Installation#

opam install html5rw

Documentation#

API documentation is available via:

opam install html5rw
odig doc html5rw

License#

MIT

Clone this repository

html5rw - Pure OCaml HTML5 Parser and Conformance Checker#