OCaml HTML5 parser/serialiser based on Python's JustHTML
1# html5rw - Pure OCaml HTML5 Parser and Conformance Checker 2 3A pure OCaml HTML5 parser and validator implementing the WHATWG HTML5 specification. 4This library passes the html5lib-tests suite and provides full support for 5tokenization, tree construction, encoding detection, CSS selector queries, and 6conformance checking. 7This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/). 8 9## Key Features 10 11- **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery 12- **Conformance Checker**: Validates HTML5 documents against the WHATWG specification 13- **CSS Selectors**: Query the DOM using standard CSS selector syntax 14- **Streaming I/O**: Uses bytesrw for efficient streaming input/output 15- **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm 16- **Entity Decoding**: Complete HTML5 named character reference support 17- **Multiple Output Formats**: Text, JSON (Nu validator compatible), and GNU-style output 18 19## Libraries 20 21- `html5rw` - Core HTML5 parser 22- `html5rw.check` - Conformance checker library 23 24## Command Line Tool 25 26The `html5check` CLI validates HTML5 documents: 27 28```bash 29# Validate a file 30html5check index.html 31 32# Validate from stdin 33cat page.html | html5check - 34 35# JSON output (Nu validator compatible) 36html5check --format=json page.html 37 38# GNU-style output for IDE integration 39html5check --format=gnu page.html 40 41# Show only errors (suppress warnings) 42html5check --errors-only page.html 43 44# Quiet mode - show only counts 45html5check --quiet page.html 46``` 47 48Exit codes: 0 = valid, 1 = validation errors, 2 = I/O error. 49 50## Usage 51 52### Parsing HTML 53 54```ocaml 55open Bytesrw 56 57(* Parse HTML from a string *) 58let html = "<html><body><p>Hello, world!</p></body></html>" 59let reader = Bytes.Reader.of_string html 60let doc = Html5rw.parse reader 61 62(* Query with CSS selectors *) 63let paragraphs = Html5rw.query doc "p" 64 65(* Extract text content *) 66let text = Html5rw.to_text doc 67 68(* Serialize back to HTML *) 69let output = Html5rw.to_string doc 70``` 71 72For fragment parsing (innerHTML): 73 74```ocaml 75(* Parse as innerHTML of a <div> *) 76let ctx = Html5rw.make_fragment_context ~tag_name:"div" () 77let reader = Bytes.Reader.of_string "<p>Fragment content</p>" 78let doc = Html5rw.parse ~fragment_context:ctx reader 79``` 80 81### Validating HTML 82 83```ocaml 84open Bytesrw 85 86(* Check HTML from a string *) 87let html = "<html><body><p>Hello</p></body></html>" 88let reader = Bytes.Reader.of_string html 89let result = Htmlrw_check.check reader 90 91(* Check for errors *) 92if Htmlrw_check.has_errors result then 93 print_endline "Document has errors"; 94 95(* Get all messages *) 96let messages = Htmlrw_check.messages result in 97List.iter (fun msg -> 98 Format.printf "%a@." Htmlrw_check.pp_message msg 99) messages; 100 101(* Get formatted output *) 102let text_output = Htmlrw_check.to_text result in 103let json_output = Htmlrw_check.to_json result in 104let gnu_output = Htmlrw_check.to_gnu result 105``` 106 107The checker validates: 108- Parse errors (malformed HTML syntax) 109- Content model violations (invalid element nesting) 110- Attribute errors (invalid or missing required attributes) 111- Structural issues (other conformance problems) 112 113## Installation 114 115``` 116opam install html5rw 117``` 118 119## Documentation 120 121API documentation is available via: 122 123``` 124opam install html5rw 125odig doc html5rw 126``` 127 128## License 129 130MIT