# html5rw - Pure OCaml HTML5 Parser and Conformance Checker A pure OCaml HTML5 parser and validator implementing the WHATWG HTML5 specification. This library passes the html5lib-tests suite and provides full support for tokenization, tree construction, encoding detection, CSS selector queries, and conformance checking. This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/). ## Key Features - **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery - **Conformance Checker**: Validates HTML5 documents against the WHATWG specification - **CSS Selectors**: Query the DOM using standard CSS selector syntax - **Streaming I/O**: Uses bytesrw for efficient streaming input/output - **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm - **Entity Decoding**: Complete HTML5 named character reference support - **Multiple Output Formats**: Text, JSON (Nu validator compatible), and GNU-style output ## Libraries - `html5rw` - Core HTML5 parser - `html5rw.check` - Conformance checker library ## Command Line Tool The `html5check` CLI validates HTML5 documents: ```bash # Validate a file html5check index.html # Validate from stdin cat page.html | html5check - # JSON output (Nu validator compatible) html5check --format=json page.html # GNU-style output for IDE integration html5check --format=gnu page.html # Show only errors (suppress warnings) html5check --errors-only page.html # Quiet mode - show only counts html5check --quiet page.html ``` Exit codes: 0 = valid, 1 = validation errors, 2 = I/O error. ## Usage ### Parsing HTML ```ocaml open Bytesrw (* Parse HTML from a string *) let html = "

Hello, world!

" let reader = Bytes.Reader.of_string html let doc = Html5rw.parse reader (* Query with CSS selectors *) let paragraphs = Html5rw.query doc "p" (* Extract text content *) let text = Html5rw.to_text doc (* Serialize back to HTML *) let output = Html5rw.to_string doc ``` For fragment parsing (innerHTML): ```ocaml (* Parse as innerHTML of a
*) let ctx = Html5rw.make_fragment_context ~tag_name:"div" () let reader = Bytes.Reader.of_string "

Fragment content

" let doc = Html5rw.parse ~fragment_context:ctx reader ``` ### Validating HTML ```ocaml open Bytesrw (* Check HTML from a string *) let html = "

Hello

" let reader = Bytes.Reader.of_string html let result = Htmlrw_check.check reader (* Check for errors *) if Htmlrw_check.has_errors result then print_endline "Document has errors"; (* Get all messages *) let messages = Htmlrw_check.messages result in List.iter (fun msg -> Format.printf "%a@." Htmlrw_check.pp_message msg ) messages; (* Get formatted output *) let text_output = Htmlrw_check.to_text result in let json_output = Htmlrw_check.to_json result in let gnu_output = Htmlrw_check.to_gnu result ``` The checker validates: - Parse errors (malformed HTML syntax) - Content model violations (invalid element nesting) - Attribute errors (invalid or missing required attributes) - Structural issues (other conformance problems) ## Installation ``` opam install html5rw ``` ## Documentation API documentation is available via: ``` opam install html5rw odig doc html5rw ``` ## License MIT