OCaml HTML5 parser/serialiser based on Python's JustHTML
1# html5rw - Pure OCaml HTML5 Parser 2 3A pure OCaml HTML5 parser implementing the WHATWG HTML5 parsing specification. 4This library passes the html5lib-tests suite and provides full support for 5tokenization, tree construction, encoding detection, and CSS selector queries. 6This library was ported from [JustHTML](https://github.com/EmilStenstrom/justhtml/). 7 8## Key Features 9 10- **WHATWG Compliant**: Implements the full HTML5 parsing algorithm with proper error recovery 11- **CSS Selectors**: Query the DOM using standard CSS selector syntax 12- **Streaming I/O**: Uses bytesrw for efficient streaming input/output 13- **Encoding Detection**: Automatic character encoding detection following the WHATWG algorithm 14- **Entity Decoding**: Complete HTML5 named character reference support 15 16## Usage 17 18```ocaml 19open Bytesrw 20 21(* Parse HTML from a string *) 22let html = "<html><body><p>Hello, world!</p></body></html>" 23let reader = Bytes.Reader.of_string html 24let doc = Html5rw.parse reader 25 26(* Query with CSS selectors *) 27let paragraphs = Html5rw.query doc "p" 28 29(* Extract text content *) 30let text = Html5rw.to_text doc 31 32(* Serialize back to HTML *) 33let output = Html5rw.to_string doc 34``` 35 36For fragment parsing (innerHTML): 37 38```ocaml 39(* Parse as innerHTML of a <div> *) 40let ctx = Html5rw.make_fragment_context ~tag_name:"div" () 41let reader = Bytes.Reader.of_string "<p>Fragment content</p>" 42let doc = Html5rw.parse ~fragment_context:ctx reader 43``` 44 45## Installation 46 47``` 48opam install html5rw 49``` 50 51## Documentation 52 53API documentation is available via: 54 55``` 56opam install html5rw 57odig doc html5rw 58``` 59 60## License 61 62MIT