OCaml HTML5 parser/serialiser based on Python's JustHTML

mention checker

Changed files
+71 -3
+71 -3
README.md
··· 1 - # html5rw - Pure OCaml HTML5 Parser 2 3 - A pure OCaml HTML5 parser implementing the WHATWG HTML5 parsing specification. 4 This library passes the html5lib-tests suite and provides full support for 5 - tokenization, tree construction, encoding detection, and CSS selector queries. 6 This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/). 7 8 ## Key Features 9 10 - **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery 11 - **CSS Selectors**: Query the DOM using standard CSS selector syntax 12 - **Streaming I/O**: Uses bytesrw for efficient streaming input/output 13 - **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm 14 - **Entity Decoding**: Complete HTML5 named character reference support 15 16 ## Usage 17 18 ```ocaml 19 open Bytesrw ··· 41 let reader = Bytes.Reader.of_string "<p>Fragment content</p>" 42 let doc = Html5rw.parse ~fragment_context:ctx reader 43 ``` 44 45 ## Installation 46
··· 1 + # html5rw - Pure OCaml HTML5 Parser and Conformance Checker 2 3 + A pure OCaml HTML5 parser and validator implementing the WHATWG HTML5 specification. 4 This library passes the html5lib-tests suite and provides full support for 5 + tokenization, tree construction, encoding detection, CSS selector queries, and 6 + conformance checking. 7 This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/). 8 9 ## Key Features 10 11 - **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery 12 + - **Conformance Checker**: Validates HTML5 documents against the WHATWG specification 13 - **CSS Selectors**: Query the DOM using standard CSS selector syntax 14 - **Streaming I/O**: Uses bytesrw for efficient streaming input/output 15 - **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm 16 - **Entity Decoding**: Complete HTML5 named character reference support 17 + - **Multiple Output Formats**: Text, JSON (Nu validator compatible), and GNU-style output 18 + 19 + ## Libraries 20 + 21 + - `html5rw` - Core HTML5 parser 22 + - `html5rw.check` - Conformance checker library 23 + 24 + ## Command Line Tool 25 + 26 + The `html5check` CLI validates HTML5 documents: 27 + 28 + ```bash 29 + # Validate a file 30 + html5check index.html 31 + 32 + # Validate from stdin 33 + cat page.html | html5check - 34 + 35 + # JSON output (Nu validator compatible) 36 + html5check --format=json page.html 37 + 38 + # GNU-style output for IDE integration 39 + html5check --format=gnu page.html 40 + 41 + # Show only errors (suppress warnings) 42 + html5check --errors-only page.html 43 + 44 + # Quiet mode - show only counts 45 + html5check --quiet page.html 46 + ``` 47 + 48 + Exit codes: 0 = valid, 1 = validation errors, 2 = I/O error. 49 50 ## Usage 51 + 52 + ### Parsing HTML 53 54 ```ocaml 55 open Bytesrw ··· 77 let reader = Bytes.Reader.of_string "<p>Fragment content</p>" 78 let doc = Html5rw.parse ~fragment_context:ctx reader 79 ``` 80 + 81 + ### Validating HTML 82 + 83 + ```ocaml 84 + open Bytesrw 85 + 86 + (* Check HTML from a string *) 87 + let html = "<html><body><p>Hello</p></body></html>" 88 + let reader = Bytes.Reader.of_string html 89 + let result = Htmlrw_check.check reader 90 + 91 + (* Check for errors *) 92 + if Htmlrw_check.has_errors result then 93 + print_endline "Document has errors"; 94 + 95 + (* Get all messages *) 96 + let messages = Htmlrw_check.messages result in 97 + List.iter (fun msg -> 98 + Format.printf "%a@." Htmlrw_check.pp_message msg 99 + ) messages; 100 + 101 + (* Get formatted output *) 102 + let text_output = Htmlrw_check.to_text result in 103 + let json_output = Htmlrw_check.to_json result in 104 + let gnu_output = Htmlrw_check.to_gnu result 105 + ``` 106 + 107 + The checker validates: 108 + - Parse errors (malformed HTML syntax) 109 + - Content model violations (invalid element nesting) 110 + - Attribute errors (invalid or missing required attributes) 111 + - Structural issues (other conformance problems) 112 113 ## Installation 114