we (web engine): Experimental web browser project to understand the limits of Claude

Speculative HTML parsing (off-main-thread tokenization) #145

open opened by pierrelf.com

Summary#

Implement speculative HTML parsing by running the HTML tokenizer on a background thread while the main thread executes scripts, enabling faster page loads.

Background#

When the HTML parser encounters a <script> tag, it must pause tree construction and execute the script (which may call document.write). During this pause, the tokenizer could speculatively continue scanning the rest of the document to discover resources (<link>, <img>, <script src>) that can be fetched in parallel. This is a major performance win for script-heavy pages.

Acceptance Criteria#

  • Speculative tokenizer: a second tokenizer instance that scans ahead in the HTML input while the main parser is blocked on script execution
  • Resource discovery: the speculative tokenizer identifies preloadable resources:
    • <link rel="stylesheet" href="..."> — CSS files
    • <script src="..."> — JavaScript files
    • <img src="..."> — Images
    • <link rel="preload" href="..." as="..."> — Preload hints
  • Preload queue: discovered URLs are sent to the resource loader for early fetching
  • Speculation results: speculatively tokenized tokens are buffered and reused by the main parser if document.write didn't invalidate them
  • Invalidation: if document.write injects content, discard speculative results from that point forward and re-tokenize
  • Thread safety: speculative tokenizer runs on a background thread; communication via channels (std::sync::mpsc)
  • Main parser behavior is unchanged when speculation is disabled
  • All existing HTML parsing tests pass
  • Add tests for: speculation hit (no document.write), speculation miss (document.write invalidates)

Implementation Notes#

  • The speculative tokenizer only needs to find tags — it doesn't need to build a DOM tree
  • It can use a simplified state machine that only tracks tag names and src/href attributes
  • Communication: main thread sends (html_bytes, start_offset) to speculative thread; speculative thread sends back Vec
  • Use std::thread::spawn and std::sync::mpsc::channel — no external threading crates
  • The speculative tokenizer should be conservative: if it encounters <script> (without src), it should stop speculating until the main parser catches up
  • This is an optimization — the browser must work correctly with speculation disabled

Dependencies#

None — independent of other Phase 15 work.

Phase#

Phase 15: Performance

sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:meotu43t6usg4qdwzenk4s2t/sh.tangled.repo.issue/3mi523wvmqf2s