Speculative HTML parsing (off-main-thread tokenization) #145

Summary#

Implement speculative HTML parsing by running the HTML tokenizer on a background thread while the main thread executes scripts, enabling faster page loads.

Background#

When the HTML parser encounters a <script> tag, it must pause tree construction and execute the script (which may call document.write). During this pause, the tokenizer could speculatively continue scanning the rest of the document to discover resources (<link>, <img>, <script src>) that can be fetched in parallel. This is a major performance win for script-heavy pages.

Acceptance Criteria#

Speculative tokenizer: a second tokenizer instance that scans ahead in the HTML input while the main parser is blocked on script execution
Resource discovery: the speculative tokenizer identifies preloadable resources:
- <link rel="stylesheet" href="..."> — CSS files
- <script src="..."> — JavaScript files
- <img src="..."> — Images
- <link rel="preload" href="..." as="..."> — Preload hints
Preload queue: discovered URLs are sent to the resource loader for early fetching
Speculation results: speculatively tokenized tokens are buffered and reused by the main parser if document.write didn't invalidate them
Invalidation: if document.write injects content, discard speculative results from that point forward and re-tokenize
Thread safety: speculative tokenizer runs on a background thread; communication via channels (std::sync::mpsc)
Main parser behavior is unchanged when speculation is disabled
All existing HTML parsing tests pass
Add tests for: speculation hit (no document.write), speculation miss (document.write invalidates)

Implementation Notes#

The speculative tokenizer only needs to find tags — it doesn't need to build a DOM tree
It can use a simplified state machine that only tracks tag names and src/href attributes
Communication: main thread sends (html_bytes, start_offset) to speculative thread; speculative thread sends back Vec
Use std::thread::spawn and std::sync::mpsc::channel — no external threading crates
The speculative tokenizer should be conservative: if it encounters <script> (without src), it should stop speculating until the main parser catches up
This is an optimization — the browser must work correctly with speculation disabled

Dependencies#

None — independent of other Phase 15 work.

Phase#

Phase 15: Performance