OCaml HTML5 parser/serialiser based on Python's JustHTML

Add fuzzing infrastructure with AFL++ support

- Add fuzz_afl.ml for direct AFL++ fuzzing with afl-persistent
- Add dune-workspace with AFL instrumentation context
- Update fuzz/dune with afl-persistent dependency
- Document fuzzing workflow and discoveries in OCAML-FUZZING.md

The AFL fuzzer tests roundtrip stability, clone consistency,
selector crash resistance, and text extraction. Run with:

dune build -x afl ./fuzz/fuzz_afl.exe
afl-fuzz -i fuzz/input_corpus -o fuzz/output -- \
_build/afl/fuzz/fuzz_afl.exe @@

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

+659
OCAML-FUZZING.md
··· 1 + # OCaml Fuzzing with Crowbar 2 + 3 + This guide covers setting up property-based fuzz testing for OCaml libraries using [Crowbar](https://github.com/stedolan/crowbar). Crowbar supports two modes: 4 + 5 + 1. **QuickCheck mode** - Fast, deterministic testing without AFL (good for CI) 6 + 2. **AFL mode** - Coverage-guided fuzzing with American Fuzzy Lop 7 + 8 + ## Quick Start 9 + 10 + ### 1. Add Dependencies 11 + 12 + Add `crowbar` to your opam dependencies: 13 + 14 + ```bash 15 + opam install crowbar 16 + ``` 17 + 18 + ### 2. Create Fuzz Test Directory 19 + 20 + Create a `fuzz/` directory at the root of your project: 21 + 22 + ``` 23 + project/ 24 + ├── lib/ 25 + ├── fuzz/ 26 + │ ├── dune 27 + │ └── fuzz_mylib.ml 28 + └── dune-project 29 + ``` 30 + 31 + ### 3. Add Dune Configuration 32 + 33 + Create `fuzz/dune`: 34 + 35 + ```lisp 36 + (executable 37 + (name fuzz_mylib) 38 + (libraries mylib crowbar)) 39 + ``` 40 + 41 + For multiple fuzz targets: 42 + 43 + ```lisp 44 + (executable 45 + (name fuzz_parser) 46 + (libraries mylib crowbar)) 47 + 48 + (executable 49 + (name fuzz_encoder) 50 + (libraries mylib crowbar)) 51 + ``` 52 + 53 + ### 4. Write Fuzz Tests 54 + 55 + See [Test Structure](#test-structure) below for patterns and examples. 56 + 57 + ### 5. Run Tests 58 + 59 + **QuickCheck mode** (no AFL required): 60 + ```bash 61 + dune exec ./fuzz/fuzz_mylib.exe 62 + ``` 63 + 64 + **AFL mode** (requires AFL instrumentation): 65 + ```bash 66 + # Build with AFL instrumentation 67 + dune build --context afl 68 + 69 + # Run AFL fuzzer 70 + mkdir -p input_corpus output 71 + echo "seed" > input_corpus/seed 72 + afl-fuzz -i input_corpus -o output -- _build/afl/fuzz/fuzz_mylib.exe @@ 73 + ``` 74 + 75 + ## Test Structure 76 + 77 + ### Basic Test Pattern 78 + 79 + ```ocaml 80 + open Crowbar 81 + 82 + (* Property: operation shouldn't crash on arbitrary input *) 83 + let () = 84 + add_test ~name:"mylib_no_crash" [ bytes ] @@ fun input -> 85 + match MyLib.parse input with 86 + | Ok _ | Error _ -> check true 87 + 88 + (* Property: roundtrip encode/decode *) 89 + let () = 90 + add_test ~name:"mylib_roundtrip" [ int ] @@ fun n -> 91 + let encoded = MyLib.encode n in 92 + match MyLib.decode encoded with 93 + | Error e -> failwith ("roundtrip failed: " ^ e) 94 + | Ok decoded -> check (decoded = n) 95 + ``` 96 + 97 + ### Crowbar Generators 98 + 99 + Crowbar provides generators for creating test inputs: 100 + 101 + ```ocaml 102 + open Crowbar 103 + 104 + (* Primitive generators *) 105 + let _ = bool (* bool *) 106 + let _ = int (* int (small range in quickcheck mode) *) 107 + let _ = int8 (* 0-255 *) 108 + let _ = int32 (* int32 *) 109 + let _ = int64 (* int64 *) 110 + let _ = float (* float *) 111 + let _ = bytes (* arbitrary byte string *) 112 + let _ = bytes_fixed 10 (* exactly 10 bytes *) 113 + 114 + (* Range generator *) 115 + let _ = range 100 (* 0 to 99 *) 116 + 117 + (* Choice from constants *) 118 + let _ = choose [ 119 + const "hello"; 120 + const "world"; 121 + const ""; 122 + ] 123 + 124 + (* Mapping over generators *) 125 + let positive_int = map [int] (fun n -> abs n) 126 + 127 + (* Optional values *) 128 + let maybe_int = option int (* int option *) 129 + 130 + (* Lists *) 131 + let int_list = list int (* int list *) 132 + 133 + (* Pairs and tuples *) 134 + let pair = pair int bytes (* int * bytes *) 135 + ``` 136 + 137 + ### Common Test Patterns 138 + 139 + #### 1. Crash Resistance (Robustness) 140 + 141 + Test that arbitrary input doesn't cause crashes: 142 + 143 + ```ocaml 144 + let () = 145 + add_test ~name:"parser_no_crash" [ bytes ] @@ fun input -> 146 + (* Should never raise an exception *) 147 + let _ = try Parser.parse input with _ -> [] in 148 + check true 149 + ``` 150 + 151 + #### 2. Roundtrip Property 152 + 153 + Test encode/decode symmetry: 154 + 155 + ```ocaml 156 + let () = 157 + add_test ~name:"codec_roundtrip" [ int ] @@ fun value -> 158 + match encode value with 159 + | Error e -> failwith ("encode failed: " ^ e) 160 + | Ok encoded -> 161 + match decode encoded with 162 + | Error e -> failwith ("decode failed: " ^ e) 163 + | Ok decoded -> check (decoded = value) 164 + ``` 165 + 166 + #### 3. Idempotence 167 + 168 + Test that applying operation twice equals applying once: 169 + 170 + ```ocaml 171 + let () = 172 + add_test ~name:"normalize_idempotent" [ bytes ] @@ fun input -> 173 + let once = normalize input in 174 + let twice = normalize once in 175 + check (once = twice) 176 + ``` 177 + 178 + #### 4. Known Edge Cases 179 + 180 + Test specific problematic inputs: 181 + 182 + ```ocaml 183 + let () = 184 + add_test ~name:"empty_input" [ const () ] @@ fun () -> 185 + match parse "" with 186 + | Ok _ | Error _ -> check true 187 + 188 + let () = 189 + add_test ~name:"boundary_values" 190 + [ choose [ 191 + const 0; 192 + const (-1); 193 + const max_int; 194 + const min_int; 195 + ] ] 196 + @@ fun n -> 197 + match process n with 198 + | Ok _ | Error _ -> check true 199 + ``` 200 + 201 + #### 5. Type Mismatch Handling 202 + 203 + Test graceful handling of wrong types: 204 + 205 + ```ocaml 206 + let () = 207 + add_test ~name:"type_mismatch" [ bytes ] @@ fun input -> 208 + (* Parsing arbitrary bytes as int should fail gracefully *) 209 + match decode_int input with 210 + | Ok _ -> check true (* unexpected success is fine *) 211 + | Error _ -> check true (* expected failure *) 212 + ``` 213 + 214 + #### 6. Comparison with Reference Implementation 215 + 216 + ```ocaml 217 + let () = 218 + add_test ~name:"matches_reference" [ int; int ] @@ fun a b -> 219 + let result = MyLib.add a b in 220 + let expected = a + b in 221 + check (result = expected) 222 + ``` 223 + 224 + ### Handling Expected Failures 225 + 226 + When some inputs are expected to fail: 227 + 228 + ```ocaml 229 + (* Large integers may exceed safe precision *) 230 + let max_safe_int = 9007199254740991 231 + 232 + let () = 233 + add_test ~name:"large_int_handling" [ range max_int ] @@ fun n -> 234 + match encode n with 235 + | Error _ -> 236 + (* Large ints beyond safe precision correctly rejected *) 237 + check (n > max_safe_int) 238 + | Ok encoded -> 239 + match decode encoded with 240 + | Error _ -> check (n > max_safe_int) 241 + | Ok decoded -> check (decoded = n) 242 + ``` 243 + 244 + ### Float Comparison 245 + 246 + Use epsilon comparison for floating-point roundtrips: 247 + 248 + ```ocaml 249 + let () = 250 + add_test ~name:"float_roundtrip" 251 + [ choose [ const 0.0; const 3.14; const (-2.5) ] ] 252 + @@ fun f -> 253 + match encode f |> Result.bind decode with 254 + | Error e -> failwith e 255 + | Ok decoded -> 256 + (* Use epsilon comparison for floats *) 257 + check (Float.abs (decoded -. f) < 0.0001) 258 + ``` 259 + 260 + ## AFL Fuzzing Setup 261 + 262 + ### Prerequisites 263 + 264 + - OCaml compiler with AFL support 265 + - AFL or AFL++ installed 266 + - Dune 3.0+ 267 + 268 + ### Dune Workspace Configuration 269 + 270 + Create or update `dune-workspace` at project root: 271 + 272 + ```lisp 273 + (lang dune 3.20) 274 + 275 + (context default) 276 + 277 + (context 278 + (default 279 + (name afl) 280 + (profile afl))) 281 + 282 + (env 283 + (afl 284 + (ocamlopt_flags (:standard -afl-instrument)))) 285 + ``` 286 + 287 + This creates an `afl` build context that instruments binaries for coverage-guided fuzzing. 288 + 289 + ### Building for AFL 290 + 291 + ```bash 292 + dune build --context afl 293 + ``` 294 + 295 + Instrumented binaries are output to `_build/afl/`. 296 + 297 + ### Running AFL 298 + 299 + ```bash 300 + # Create input corpus with seed files 301 + mkdir -p input_corpus 302 + echo '{"key": "value"}' > input_corpus/json1 303 + echo 'key: value' > input_corpus/yaml1 304 + 305 + # Run AFL fuzzer 306 + afl-fuzz -i input_corpus -o findings -- _build/afl/fuzz/fuzz_mylib.exe @@ 307 + ``` 308 + 309 + ### Common AFL Issues 310 + 311 + #### Core Pattern Error 312 + 313 + ``` 314 + AFL: Oops, your system is configured to send core dump notifications to an external utility... 315 + ``` 316 + 317 + **Fix:** 318 + ```bash 319 + sudo su -c 'echo core > /proc/sys/kernel/core_pattern' 320 + ``` 321 + 322 + #### CPU Scaling Error 323 + 324 + ``` 325 + AFL: The CPU frequency scaling governor is set to "powersave"... 326 + ``` 327 + 328 + **Fix:** 329 + ```bash 330 + sudo cpufreq-set -g performance 331 + # Or use AFL_SKIP_CPUFREQ=1 to ignore (not recommended) 332 + ``` 333 + 334 + ## Complete Example 335 + 336 + Here's a complete fuzz test file for a YAML library: 337 + 338 + ```ocaml 339 + (** Property-based tests for MyYaml using Crowbar *) 340 + 341 + open Crowbar 342 + 343 + (* Helper functions *) 344 + let decode codec s = 345 + let reader = Bytesrw.Bytes.Reader.of_string s in 346 + MyYaml.decode codec reader 347 + 348 + let encode codec v = 349 + let buf = Buffer.create 256 in 350 + let writer = Bytesrw.Bytes.Writer.of_buffer buf in 351 + match MyYaml.encode codec v writer with 352 + | Ok () -> Ok (Buffer.contents buf) 353 + | Error e -> Error e 354 + 355 + (* Test: Arbitrary input shouldn't crash *) 356 + let () = 357 + add_test ~name:"yaml_no_crash" [ bytes ] @@ fun input -> 358 + match decode Codec.string input with 359 + | Ok _ | Error _ -> check true 360 + 361 + (* Test: String roundtrip *) 362 + let () = 363 + add_test ~name:"yaml_string_roundtrip" 364 + [ choose [ const "hello"; const ""; const "with\nnewline" ] ] 365 + @@ fun s -> 366 + match encode Codec.string s with 367 + | Error e -> failwith ("encode: " ^ e) 368 + | Ok yaml -> 369 + match decode Codec.string yaml with 370 + | Error e -> failwith ("decode: " ^ e) 371 + | Ok decoded -> check (decoded = s) 372 + 373 + (* Test: Int roundtrip *) 374 + let () = 375 + add_test ~name:"yaml_int_roundtrip" [ range 1000000 ] @@ fun n -> 376 + match encode Codec.int n with 377 + | Error e -> failwith ("encode: " ^ e) 378 + | Ok yaml -> 379 + match decode Codec.int yaml with 380 + | Error e -> failwith ("decode: " ^ e) 381 + | Ok decoded -> check (decoded = n) 382 + 383 + (* Test: Bool roundtrip *) 384 + let () = 385 + add_test ~name:"yaml_bool_roundtrip" [ bool ] @@ fun b -> 386 + match encode Codec.bool b with 387 + | Error e -> failwith ("encode: " ^ e) 388 + | Ok yaml -> 389 + match decode Codec.bool yaml with 390 + | Error e -> failwith ("decode: " ^ e) 391 + | Ok decoded -> check (decoded = b) 392 + 393 + (* Test: List roundtrip *) 394 + let () = 395 + add_test ~name:"yaml_list_roundtrip" 396 + [ choose [ const []; const [1;2;3]; const [0] ] ] 397 + @@ fun lst -> 398 + let codec = Codec.list Codec.int in 399 + match encode codec lst with 400 + | Error e -> failwith ("encode: " ^ e) 401 + | Ok yaml -> 402 + match decode codec yaml with 403 + | Error e -> failwith ("decode: " ^ e) 404 + | Ok decoded -> check (decoded = lst) 405 + 406 + (* Test: Empty input *) 407 + let () = 408 + add_test ~name:"yaml_empty" [ const () ] @@ fun () -> 409 + match decode Codec.string "" with 410 + | Ok _ | Error _ -> check true 411 + 412 + (* Test: Nested structures *) 413 + let () = 414 + add_test ~name:"yaml_nested" 415 + [ choose [ const [[]]; const [[1;2];[3;4]] ] ] 416 + @@ fun nested -> 417 + let codec = Codec.list (Codec.list Codec.int) in 418 + match encode codec nested with 419 + | Error e -> failwith ("encode: " ^ e) 420 + | Ok yaml -> 421 + match decode codec yaml with 422 + | Error e -> failwith ("decode: " ^ e) 423 + | Ok decoded -> check (decoded = nested) 424 + ``` 425 + 426 + ## Tips for Writing Effective Fuzz Tests 427 + 428 + 1. **Start with crash resistance** - The most basic test ensures arbitrary input doesn't crash your code. 429 + 430 + 2. **Test roundtrips** - If you have encode/decode pairs, roundtrip tests catch many bugs. 431 + 432 + 3. **Use `choose` for edge cases** - Combine random generation with known problematic values. 433 + 434 + 4. **Keep generators focused** - Generate only valid inputs for properties that require them. 435 + 436 + 5. **Handle expected failures gracefully** - Use pattern matching to distinguish expected vs unexpected failures. 437 + 438 + 6. **Test at boundaries** - Empty inputs, maximum values, and type boundaries often reveal bugs. 439 + 440 + 7. **Run QuickCheck mode in CI** - It's fast and catches many issues without AFL setup. 441 + 442 + 8. **Use AFL for deep testing** - AFL's coverage guidance finds edge cases that random testing misses. 443 + 444 + ## Bugs Found by This Approach 445 + 446 + Example bugs discovered through fuzz testing: 447 + 448 + - **Empty string encoding** - Empty strings weren't properly quoted, causing decode failures 449 + - **Empty array encoding** - Empty arrays `[]` weren't using flow style, breaking roundtrips 450 + - **Large integer precision** - Integers > 2^53-1 lost precision in float-based formats 451 + - **Option type decoding** - Nested type combinators weren't being unwrapped during scalar decoding 452 + - **Special character handling** - Strings with newlines or special YAML characters needed escaping 453 + 454 + ### HTML Parser Bugs (html5rw) 455 + 456 + Fuzz testing html5rw discovered these serialization bugs: 457 + 458 + - **Raw text element double-escaping** - Elements like `<script>`, `<style>`, `<xmp>`, `<iframe>`, `<noembed>`, `<noframes>`, `<noscript>` were having their content HTML-escaped during serialization, which broke roundtrips (e.g., `<` becoming `&lt;` inside script tags) 459 + - **Escapable raw text handling** - Elements `<textarea>` and `<title>` need special treatment: only `&` needs escaping, not `<` or `>` 460 + - **Plaintext element serialization** - The `<plaintext>` element cannot be closed in HTML5, so serialization must stop after emitting its content 461 + 462 + ## AFL++ Compatibility Notes 463 + 464 + When using AFL++ (modern fork of AFL) with OCaml: 465 + 466 + ### Crowbar Persistent Mode Issues 467 + 468 + Crowbar's built-in AFL support may not work with AFL++. The `@@` file-based mode with Crowbar often results in "No instrumentation detected" errors even when instrumentation is present. 469 + 470 + **Solution**: Create a separate AFL fuzzer using `afl-persistent` directly: 471 + 472 + ```ocaml 473 + (* fuzz_afl.ml - Simple AFL-compatible fuzzer *) 474 + 475 + let fuzz_input input = 476 + try 477 + (* Your fuzzing logic here *) 478 + let result = MyLib.parse input in 479 + let serialized = MyLib.serialize result in 480 + (* Check properties, assert on failures *) 481 + if not (some_property serialized) then 482 + assert false; 483 + true 484 + with 485 + | Assert_failure _ -> raise (Assert_failure ("", 0, 0)) 486 + | _ -> true (* Expected failures for malformed input *) 487 + 488 + let read_file filename = 489 + let ic = open_in_bin filename in 490 + let n = in_channel_length ic in 491 + let s = really_input_string ic n in 492 + close_in ic; 493 + s 494 + 495 + let () = 496 + AflPersistent.run (fun () -> 497 + if Array.length Sys.argv < 2 then begin 498 + Printf.eprintf "Usage: %s <input_file>\n" Sys.argv.(0); 499 + exit 1 500 + end; 501 + let input = read_file Sys.argv.(1) in 502 + ignore (fuzz_input input) 503 + ) 504 + ``` 505 + 506 + Add `afl-persistent` to dependencies in dune: 507 + 508 + ```lisp 509 + (executable 510 + (name fuzz_afl) 511 + (libraries mylib afl-persistent)) 512 + ``` 513 + 514 + ### Building with AFL Instrumentation 515 + 516 + Use `-x afl` flag (not `--context afl`): 517 + 518 + ```bash 519 + dune build -x afl ./fuzz/fuzz_afl.exe 520 + ``` 521 + 522 + ### Environment Variables for AFL++ 523 + 524 + ```bash 525 + # Suppress core_pattern warning (when you can't modify /proc/sys/kernel/core_pattern) 526 + export AFL_I_DONT_CARE_ABOUT_MISSING_CRASHES=1 527 + 528 + # Skip CPU frequency scaling check 529 + export AFL_SKIP_CPUFREQ=1 530 + 531 + # Run AFL 532 + afl-fuzz -i input_corpus -o output -- _build/afl/fuzz/fuzz_afl.exe @@ 533 + ``` 534 + 535 + ### Verifying Instrumentation 536 + 537 + Use `afl-showmap` to verify instrumentation is detected: 538 + 539 + ```bash 540 + afl-showmap -o /tmp/map -- _build/afl/fuzz/fuzz_afl.exe /tmp/test_input 541 + cat /tmp/map | head -20 # Should show tuples like "000062:3" 542 + ``` 543 + 544 + A healthy instrumentation should capture 200+ tuples for a non-trivial program. 545 + 546 + ## Creating Effective Seed Corpus 547 + 548 + For HTML parsers, include diverse constructs: 549 + 550 + ```bash 551 + mkdir -p fuzz/input_corpus 552 + 553 + # Basic structure 554 + echo '<html><head><title>Test</title></head><body><p>Hello</p></body></html>' > fuzz/input_corpus/basic.html 555 + 556 + # Raw text elements (critical for escaping bugs) 557 + echo '<script>if (a < b && c > d) {}</script>' > fuzz/input_corpus/script.html 558 + echo '<style>.foo { content: "<bar>" }</style>' > fuzz/input_corpus/style.html 559 + 560 + # Escapable raw text elements 561 + echo '<textarea>Text with <html> tags & entities</textarea>' > fuzz/input_corpus/textarea.html 562 + 563 + # Tables (complex nesting rules) 564 + echo '<table><tr><td>Cell</td></tr></table>' > fuzz/input_corpus/table.html 565 + 566 + # Forms 567 + echo '<form><input type="text"><button>Submit</button></form>' > fuzz/input_corpus/form.html 568 + 569 + # SVG/MathML (foreign content) 570 + echo '<svg><circle r="50"/></svg>' > fuzz/input_corpus/svg.html 571 + echo '<math><mrow><mi>x</mi></mrow></math>' > fuzz/input_corpus/math.html 572 + 573 + # Entities 574 + echo '<p>&amp; &lt; &gt; &nbsp;</p>' > fuzz/input_corpus/entities.html 575 + 576 + # Template elements 577 + echo '<template><div>Content</div></template>' > fuzz/input_corpus/template.html 578 + 579 + # Malformed HTML (parser recovery) 580 + echo '<p>Unclosed<div>Mixed</p></div>' > fuzz/input_corpus/malformed.html 581 + 582 + # Comments 583 + echo '<!--comment-->Text<!---->More' > fuzz/input_corpus/comment.html 584 + ``` 585 + 586 + ## Recommended Test Properties for Parsers 587 + 588 + ### 1. Roundtrip Stabilization 589 + 590 + The most effective parser test: parse → serialize → parse → serialize should produce identical output on the second iteration: 591 + 592 + ```ocaml 593 + let () = 594 + add_test ~name:"roundtrip_stabilizes" [ bytes ] @@ fun input -> 595 + let r1 = parse input in 596 + let s1 = serialize r1 in 597 + let r2 = parse s1 in 598 + let s2 = serialize r2 in 599 + let r3 = parse s2 in 600 + let s3 = serialize r3 in 601 + (* First roundtrip may differ, but second must stabilize *) 602 + check (s2 = s3) 603 + ``` 604 + 605 + ### 2. Clone Consistency 606 + 607 + For DOM-like structures with clone operations: 608 + 609 + ```ocaml 610 + let () = 611 + add_test ~name:"clone_identical" [ bytes ] @@ fun input -> 612 + let doc = parse input in 613 + let root = get_root doc in 614 + let cloned = clone ~deep:true root in 615 + check (to_html root = to_html cloned) 616 + ``` 617 + 618 + ### 3. Error Consistency 619 + 620 + Parse errors should be deterministic: 621 + 622 + ```ocaml 623 + let () = 624 + add_test ~name:"errors_consistent" [ bytes ] @@ fun input -> 625 + let r1 = parse input in 626 + let r2 = parse input in 627 + check (get_errors r1 = get_errors r2) 628 + ``` 629 + 630 + ## Workflow for Fuzz-Driven Bug Fixing 631 + 632 + 1. **Run QuickCheck mode first** - Fast iteration, catches obvious bugs 633 + ```bash 634 + dune exec ./fuzz/fuzz_mylib.exe 635 + ``` 636 + 637 + 2. **Fix bugs, re-run** - Iterate until QuickCheck passes 638 + 639 + 3. **Run AFL for deeper coverage** - 5-10 minutes initially 640 + ```bash 641 + dune build -x afl ./fuzz/fuzz_afl.exe 642 + AFL_I_DONT_CARE_ABOUT_MISSING_CRASHES=1 AFL_SKIP_CPUFREQ=1 \ 643 + timeout 600 afl-fuzz -i fuzz/input_corpus -o fuzz/output -- \ 644 + _build/afl/fuzz/fuzz_afl.exe @@ 645 + ``` 646 + 647 + 4. **Check for crashes** - Examine `fuzz/output/default/crashes/` 648 + ```bash 649 + ls fuzz/output/default/crashes/ 650 + # Reproduce crash: 651 + _build/afl/fuzz/fuzz_afl.exe fuzz/output/default/crashes/id:000000,... 652 + ``` 653 + 654 + 5. **Fix and repeat** - Continue until no crashes found after extended runs 655 + 656 + 6. **Measure coverage** - Track corpus growth and edge coverage 657 + ```bash 658 + cat fuzz/output/default/fuzzer_stats | grep -E "(execs_done|corpus_count|saved_crashes)" 659 + ```
+12
dune-workspace
··· 1 + (lang dune 3.20) 2 + 3 + (context default) 4 + 5 + (context 6 + (default 7 + (name afl) 8 + (profile afl))) 9 + 10 + (env 11 + (afl 12 + (ocamlopt_flags (:standard -afl-instrument))))
+4
fuzz/dune
··· 1 1 (executable 2 2 (name fuzz_html5rw) 3 3 (libraries bytesrw html5rw crowbar)) 4 + 5 + (executable 6 + (name fuzz_afl) 7 + (libraries bytesrw html5rw afl-persistent))
+92
fuzz/fuzz_afl.ml
··· 1 + (** Simple AFL-compatible fuzzer for html5rw 2 + 3 + This fuzzer reads input from a file (passed as command line argument) 4 + and runs several property tests on it. It uses AflPersistent for 5 + efficient AFL fuzzing. 6 + *) 7 + 8 + (* Helper to create a bytes reader from a string *) 9 + let reader_of_string s = Bytesrw.Bytes.Reader.of_string s 10 + 11 + (* Serialize a parse result to string *) 12 + let serialize result = 13 + Html5rw.to_string ~pretty:false result 14 + 15 + (* Main fuzzing function - returns true if test passes *) 16 + let fuzz_input input = 17 + try 18 + (* Test 1: Parse should not crash *) 19 + let result = Html5rw.parse (reader_of_string input) in 20 + 21 + (* Test 2: Serialization should not crash *) 22 + let serialized = serialize result in 23 + 24 + (* Test 3: Reparse should not crash *) 25 + let result2 = Html5rw.parse (reader_of_string serialized) in 26 + let serialized2 = serialize result2 in 27 + 28 + (* Test 4: Roundtrip should stabilize (s2 == s3) *) 29 + let result3 = Html5rw.parse (reader_of_string serialized2) in 30 + let serialized3 = serialize result3 in 31 + 32 + if serialized2 <> serialized3 then begin 33 + Printf.eprintf "ROUNDTRIP BUG:\n"; 34 + Printf.eprintf "Input: %s\n" (String.escaped (String.sub input 0 (min 100 (String.length input)))); 35 + Printf.eprintf "s2: %s\n" (String.escaped (String.sub serialized2 0 (min 100 (String.length serialized2)))); 36 + Printf.eprintf "s3: %s\n" (String.escaped (String.sub serialized3 0 (min 100 (String.length serialized3)))); 37 + (* Signal a bug to AFL by aborting *) 38 + assert false 39 + end; 40 + 41 + (* Test 5: Text extraction should not crash *) 42 + let _ = Html5rw.to_text result in 43 + 44 + (* Test 6: Clone should produce identical output *) 45 + let root = Html5rw.root result in 46 + let cloned = Html5rw.clone ~deep:true root in 47 + let original_html = Html5rw.Dom.to_html ~pretty:false root in 48 + let cloned_html = Html5rw.Dom.to_html ~pretty:false cloned in 49 + 50 + if original_html <> cloned_html then begin 51 + Printf.eprintf "CLONE BUG:\n"; 52 + Printf.eprintf "Original: %s\n" (String.escaped (String.sub original_html 0 (min 100 (String.length original_html)))); 53 + Printf.eprintf "Cloned: %s\n" (String.escaped (String.sub cloned_html 0 (min 100 (String.length cloned_html)))); 54 + assert false 55 + end; 56 + 57 + (* Test 7: Selector queries should not crash (test a few common patterns) *) 58 + let selectors = ["*"; "div"; ".class"; "#id"; "div > p"; "[attr]"] in 59 + List.iter (fun sel -> 60 + try 61 + let _ = Html5rw.query result sel in () 62 + with Html5rw.Selector.Selector_error _ -> () 63 + ) selectors; 64 + 65 + true 66 + with 67 + | Assert_failure _ -> 68 + (* Re-raise assert failures so AFL sees the crash *) 69 + raise (Assert_failure ("", 0, 0)) 70 + | _ -> 71 + (* Other exceptions are expected for malformed input *) 72 + true 73 + 74 + (* Read file contents *) 75 + let read_file filename = 76 + let ic = open_in_bin filename in 77 + let n = in_channel_length ic in 78 + let s = really_input_string ic n in 79 + close_in ic; 80 + s 81 + 82 + (* Main entry point *) 83 + let () = 84 + (* Use AflPersistent for efficient AFL fuzzing *) 85 + AflPersistent.run (fun () -> 86 + if Array.length Sys.argv < 2 then begin 87 + Printf.eprintf "Usage: %s <input_file>\n" Sys.argv.(0); 88 + exit 1 89 + end; 90 + let input = read_file Sys.argv.(1) in 91 + ignore (fuzz_input input) 92 + )