OCaml HTML5 parser/serialiser based on Python's JustHTML
OCaml 30.6%
Dune 0.1%
Other 69.4%
62 3 0

Clone this repository

https://tangled.org/anil.recoil.org/ocaml-html5rw
git@git.recoil.org:anil.recoil.org/ocaml-html5rw

For self-hosted knots, clone URLs may differ based on your setup.

README.md

html5rw - Pure OCaml HTML5 Parser#

A pure OCaml HTML5 parser implementing the WHATWG HTML5 parsing specification. This library passes the html5lib-tests suite and provides full support for tokenization, tree construction, encoding detection, and CSS selector queries. This library was ported from JustHTML.

Key Features#

  • WHATWG Compliant: Implements the full HTML5 parsing algorithm with proper error recovery
  • CSS Selectors: Query the DOM using standard CSS selector syntax
  • Streaming I/O: Uses bytesrw for efficient streaming input/output
  • Encoding Detection: Automatic character encoding detection following the WHATWG algorithm
  • Entity Decoding: Complete HTML5 named character reference support

Usage#

open Bytesrw

(* Parse HTML from a string *)
let html = "<html><body><p>Hello, world!</p></body></html>"
let reader = Bytes.Reader.of_string html
let doc = Html5rw.parse reader

(* Query with CSS selectors *)
let paragraphs = Html5rw.query doc "p"

(* Extract text content *)
let text = Html5rw.to_text doc

(* Serialize back to HTML *)
let output = Html5rw.to_string doc

For fragment parsing (innerHTML):

(* Parse as innerHTML of a <div> *)
let ctx = Html5rw.make_fragment_context ~tag_name:"div" ()
let reader = Bytes.Reader.of_string "<p>Fragment content</p>"
let doc = Html5rw.parse ~fragment_context:ctx reader

Installation#

opam install html5rw

Documentation#

API documentation is available via:

opam install html5rw
odig doc html5rw

License#

MIT