we (web engine): Experimental web browser project to understand the limits of Claude

Encoding sniffing: BOM, HTTP charset, meta prescan #72

open opened by pierrelf.com

Phase 8 — Resource Loading + Character Encoding + Real Page Loading#

Implement encoding detection/sniffing per the WHATWG Encoding Standard and HTML spec.

Requirements#

  • BOM sniffing: detect UTF-8 BOM (EF BB BF), UTF-16LE BOM (FF FE), UTF-16BE BOM (FE FF) at start of byte stream
  • HTTP Content-Type charset: extract charset parameter from Content-Type header (e.g., text/html; charset=utf-8)
  • HTML meta prescan: scan the first 1024 bytes of an HTML document for <meta charset="..."> or <meta http-equiv="Content-Type" content="text/html; charset=..."> without full HTML parsing
  • Encoding confidence levels: tentative vs certain (BOM and HTTP header are certain; meta prescan is tentative)
  • Default encoding: fall back to Windows-1252 for HTML if no encoding is detected (per spec)

API#

pub enum EncodingSource {
    Bom,
    HttpHeader,
    MetaPrescan,
    Default,
}

pub fn sniff_encoding(
    bytes: &[u8],
    http_content_type: Option<&str>,
) -> (Encoding, EncodingSource);

Acceptance Criteria#

  • BOM detection works for UTF-8, UTF-16LE, UTF-16BE
  • HTTP charset extraction handles quoted and unquoted values
  • Meta prescan finds charset in <meta> tags within first 1024 bytes
  • Correct priority: BOM > HTTP header > meta prescan > default
  • No external dependencies, no unsafe
  • Unit tests for each detection method and priority ordering

Dependencies#

Depends on: WHATWG Encoding: UTF-8 and UTF-16 codecs, Legacy single-byte encodings

sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:meotu43t6usg4qdwzenk4s2t/sh.tangled.repo.issue/3mhkt62ryxo2l