commits
Implements 9 specialized fuzzers to exercise different aspects of the HTML parser:
- fuzz_properties.ml: Property-based testing (8 invariants including parse
stability, output bounds, idempotence)
- fuzz_structure.ml: Structure-aware HTML generation with mutation strategies
- fuzz_exhaustion.ml: Resource exhaustion tests for DoS vectors (deep nesting,
wide trees, huge text, many attributes, etc.)
- fuzz_error_recovery.ml: Error recovery tests (12 categories including
unclosed tags, misnested elements, invalid attributes)
- fuzz_serializer.ml: Serializer-specific tests (attributes, void elements,
raw text, whitespace, entities, foreign content)
- fuzz_streaming.ml: Parsing determinism and roundtrip stability tests
- fuzz_encoding.ml: UTF-8 handling, BOM, surrogates, control characters
- fuzz_fragment.ml: Fragment parsing with various context elements
- fuzz_security.ml: mXSS stability and XSS vector handling
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test_crash and test_pre executables to dune build
- Show namespace in DOM dump for debugging SVG/MathML issues
- Print full s2/s3 serialization output for easier diff analysis
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit adds multiple fixes to ensure stable roundtrips (parse-serialize-
parse-serialize = stable output) for malformed HTML inputs:
1. Nested formatting element handling:
- Track open formatting elements (a, b, i, em, strong, etc.) during serialization
- When encountering a nested same-type formatting element, skip the inner
wrapper to produce flatter HTML that parses consistently
2. Empty table handling:
- Detect tables with no real content (only comments/text)
- Skip empty table wrappers since content would be foster-parented anyway
- Add implicit tbody wrappers where needed for table structure
3. Structural element handling:
- Skip nested body/head/html elements that cause parsing instability
- Output their children directly without the invalid wrapper
4. Improved context tracking:
- Track foreign content depth for proper SVG/MathML handling
- Pass serialization context through recursive calls
These fixes improve AFL crash test pass rate from 49/104 (47%) to 104/104 (100%)
while maintaining 100% pass rate on all official html5lib tests.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Per WHATWG spec section 13.2.5.8 (Tag name state), when '<' is encountered
during tag name parsing, it should be appended to the current tag token's
tag name as part of "anything else" handling - not emit the current tag
and switch to tag open state.
This fixes 3 tree-construction test failures:
- <div<div> now correctly parses as element named "div<div"
- <p>Test</p<p>Test2</p> now correctly handles </p<p> as invalid end tag
- <option><XH<optgroup> now correctly parses XH<optgroup as element name
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When serializing elements inside SVG or MathML foreign content, HTML breakout
elements (per WHATWG spec section 13.2.6.5) like div, span, table, etc. would
cause the parser to exit foreign content on reparse. This creates roundtrip
instability.
To fix this, we now:
- Track foreign content context (SVG/MathML) during serialization
- Detect HTML integration points (foreignObject, desc, title in SVG)
- Prefix breakout elements with 'x-' to make them custom elements when in
foreign content, ensuring stable roundtrips
This improves AFL crash test pass rate from ~86 failing to 14 failing (90/104
passes).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add attribute name validation to skip attributes with invalid names
(control chars, whitespace, quotes, angle brackets, slash, equals)
- Add strict element name sanitization (ASCII-only, 0x21-0x7E excluding
special HTML chars) to ensure consistent reparsing
- Skip attributes with invalid names during serialization instead of
outputting malformed HTML
- Element names with invalid chars are sanitized by removing invalid
bytes and defaulting to "span" if empty
- Add (allow_empty) to html5rw-js package in dune-project since lib/js
was removed
- Add test_crash.ml for analyzing fuzz crash files with roundtrip debug
- Add test_pre.ml for testing pre/textarea newline handling
Fixes roundtrip instability found by AFL fuzzing. After these fixes,
86/104 crash corpus files pass roundtrip tests. The remaining 18 are
edge cases involving complex svg+table interactions in severely
malformed input where HTML5 error recovery produces non-deterministic
DOM structures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Serializer fixes:
- Add leading newline preservation for pre, textarea, and listing elements.
Per HTML5 spec, the parser strips a single leading newline after these
element tags, so the serializer must emit an extra newline to preserve
content that starts with a newline.
- Escape < in attribute values to prevent tag injection during reparsing
Parser fix:
- Reset ignore_lf flag when any element is inserted, not just on character
tokens. Per HTML5 spec, only the immediately next token after
pre/textarea/listing should be checked for leading LF, but we were
persisting the flag across element insertions.
Tokenizer fix:
- Handle < in tag names per HTML5 spec: emit parse error and reconsume
in tag open state. Previously < was incorrectly added to tag names.
These fixes improve roundtrip stability for edge cases with:
- Pre-formatted elements with nested elements and leading newlines
- Attribute values containing < characters
- Malformed HTML with < inside tag names
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add fuzz_afl.ml for direct AFL++ fuzzing with afl-persistent
- Add dune-workspace with AFL instrumentation context
- Update fuzz/dune with afl-persistent dependency
- Document fuzzing workflow and discoveries in OCAML-FUZZING.md
The AFL fuzzer tests roundtrip stability, clone consistency,
selector crash resistance, and text extraction. Run with:
dune build -x afl ./fuzz/fuzz_afl.exe
afl-fuzz -i fuzz/input_corpus -o fuzz/output -- \
_build/afl/fuzz/fuzz_afl.exe @@
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bug fixes found via property-based fuzzing with Crowbar:
1. Raw text element escaping: Text content inside <script>, <style>,
<iframe>, <xmp>, <noembed>, <noframes>, and <noscript> was being
HTML-escaped during serialization, causing double-escaping on each
roundtrip. Fixed by detecting raw text elements and outputting
their content without escaping.
2. Escapable raw text elements: <textarea> and <title> content should
only have & escaped, not < or >. Added separate handling for these.
3. Plaintext element serialization: <plaintext> content accumulated
closing tags on each roundtrip because any content after plaintext
gets absorbed into its content on reparse. Fixed by:
- Not outputting closing tag for plaintext
- Propagating "plaintext encountered" flag through serialization
- Stopping serialization of closing tags for ancestors once
plaintext is found
Also adds fuzz/ directory with comprehensive Crowbar-based property
tests covering crash resistance, roundtrip stability, selector
parsing, DOM manipulation, and more.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add JS/WASM compilation rules for running html5lib conformance tests
in the browser:
- lib/js/htmlrw_js_tests.ml: Browser-compatible test runner that can
run tree-construction and encoding detection tests
- lib/js/dune: Updated with test runner executable for js and wasm modes
- test-regression.html: Interactive test page that loads test data files
and runs the full regression suite with progress and filtering
The test runner exposes a JavaScript API (html5rwTests) that can:
- Run individual test files
- Run all tests from a file list
- Quick parse test for simple validation
Build with: opam exec -- dune build lib/js/htmlrw-tests.js
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements 9 specialized fuzzers to exercise different aspects of the HTML parser:
- fuzz_properties.ml: Property-based testing (8 invariants including parse
stability, output bounds, idempotence)
- fuzz_structure.ml: Structure-aware HTML generation with mutation strategies
- fuzz_exhaustion.ml: Resource exhaustion tests for DoS vectors (deep nesting,
wide trees, huge text, many attributes, etc.)
- fuzz_error_recovery.ml: Error recovery tests (12 categories including
unclosed tags, misnested elements, invalid attributes)
- fuzz_serializer.ml: Serializer-specific tests (attributes, void elements,
raw text, whitespace, entities, foreign content)
- fuzz_streaming.ml: Parsing determinism and roundtrip stability tests
- fuzz_encoding.ml: UTF-8 handling, BOM, surrogates, control characters
- fuzz_fragment.ml: Fragment parsing with various context elements
- fuzz_security.ml: mXSS stability and XSS vector handling
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit adds multiple fixes to ensure stable roundtrips (parse-serialize-
parse-serialize = stable output) for malformed HTML inputs:
1. Nested formatting element handling:
- Track open formatting elements (a, b, i, em, strong, etc.) during serialization
- When encountering a nested same-type formatting element, skip the inner
wrapper to produce flatter HTML that parses consistently
2. Empty table handling:
- Detect tables with no real content (only comments/text)
- Skip empty table wrappers since content would be foster-parented anyway
- Add implicit tbody wrappers where needed for table structure
3. Structural element handling:
- Skip nested body/head/html elements that cause parsing instability
- Output their children directly without the invalid wrapper
4. Improved context tracking:
- Track foreign content depth for proper SVG/MathML handling
- Pass serialization context through recursive calls
These fixes improve AFL crash test pass rate from 49/104 (47%) to 104/104 (100%)
while maintaining 100% pass rate on all official html5lib tests.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Per WHATWG spec section 13.2.5.8 (Tag name state), when '<' is encountered
during tag name parsing, it should be appended to the current tag token's
tag name as part of "anything else" handling - not emit the current tag
and switch to tag open state.
This fixes 3 tree-construction test failures:
- <div<div> now correctly parses as element named "div<div"
- <p>Test</p<p>Test2</p> now correctly handles </p<p> as invalid end tag
- <option><XH<optgroup> now correctly parses XH<optgroup as element name
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When serializing elements inside SVG or MathML foreign content, HTML breakout
elements (per WHATWG spec section 13.2.6.5) like div, span, table, etc. would
cause the parser to exit foreign content on reparse. This creates roundtrip
instability.
To fix this, we now:
- Track foreign content context (SVG/MathML) during serialization
- Detect HTML integration points (foreignObject, desc, title in SVG)
- Prefix breakout elements with 'x-' to make them custom elements when in
foreign content, ensuring stable roundtrips
This improves AFL crash test pass rate from ~86 failing to 14 failing (90/104
passes).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add attribute name validation to skip attributes with invalid names
(control chars, whitespace, quotes, angle brackets, slash, equals)
- Add strict element name sanitization (ASCII-only, 0x21-0x7E excluding
special HTML chars) to ensure consistent reparsing
- Skip attributes with invalid names during serialization instead of
outputting malformed HTML
- Element names with invalid chars are sanitized by removing invalid
bytes and defaulting to "span" if empty
- Add (allow_empty) to html5rw-js package in dune-project since lib/js
was removed
- Add test_crash.ml for analyzing fuzz crash files with roundtrip debug
- Add test_pre.ml for testing pre/textarea newline handling
Fixes roundtrip instability found by AFL fuzzing. After these fixes,
86/104 crash corpus files pass roundtrip tests. The remaining 18 are
edge cases involving complex svg+table interactions in severely
malformed input where HTML5 error recovery produces non-deterministic
DOM structures.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Serializer fixes:
- Add leading newline preservation for pre, textarea, and listing elements.
Per HTML5 spec, the parser strips a single leading newline after these
element tags, so the serializer must emit an extra newline to preserve
content that starts with a newline.
- Escape < in attribute values to prevent tag injection during reparsing
Parser fix:
- Reset ignore_lf flag when any element is inserted, not just on character
tokens. Per HTML5 spec, only the immediately next token after
pre/textarea/listing should be checked for leading LF, but we were
persisting the flag across element insertions.
Tokenizer fix:
- Handle < in tag names per HTML5 spec: emit parse error and reconsume
in tag open state. Previously < was incorrectly added to tag names.
These fixes improve roundtrip stability for edge cases with:
- Pre-formatted elements with nested elements and leading newlines
- Attribute values containing < characters
- Malformed HTML with < inside tag names
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add fuzz_afl.ml for direct AFL++ fuzzing with afl-persistent
- Add dune-workspace with AFL instrumentation context
- Update fuzz/dune with afl-persistent dependency
- Document fuzzing workflow and discoveries in OCAML-FUZZING.md
The AFL fuzzer tests roundtrip stability, clone consistency,
selector crash resistance, and text extraction. Run with:
dune build -x afl ./fuzz/fuzz_afl.exe
afl-fuzz -i fuzz/input_corpus -o fuzz/output -- \
_build/afl/fuzz/fuzz_afl.exe @@
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Bug fixes found via property-based fuzzing with Crowbar:
1. Raw text element escaping: Text content inside <script>, <style>,
<iframe>, <xmp>, <noembed>, <noframes>, and <noscript> was being
HTML-escaped during serialization, causing double-escaping on each
roundtrip. Fixed by detecting raw text elements and outputting
their content without escaping.
2. Escapable raw text elements: <textarea> and <title> content should
only have & escaped, not < or >. Added separate handling for these.
3. Plaintext element serialization: <plaintext> content accumulated
closing tags on each roundtrip because any content after plaintext
gets absorbed into its content on reparse. Fixed by:
- Not outputting closing tag for plaintext
- Propagating "plaintext encountered" flag through serialization
- Stopping serialization of closing tags for ancestors once
plaintext is found
Also adds fuzz/ directory with comprehensive Crowbar-based property
tests covering crash resistance, roundtrip stability, selector
parsing, DOM manipulation, and more.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add JS/WASM compilation rules for running html5lib conformance tests
in the browser:
- lib/js/htmlrw_js_tests.ml: Browser-compatible test runner that can
run tree-construction and encoding detection tests
- lib/js/dune: Updated with test runner executable for js and wasm modes
- test-regression.html: Interactive test page that loads test data files
and runs the full regression suite with progress and filtering
The test runner exposes a JavaScript API (html5rwTests) that can:
- Run individual test files
- Run all tests from a file list
- Quick parse test for simple validation
Build with: opam exec -- dune build lib/js/htmlrw-tests.js
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>