Encoding sniffing: BOM, HTTP charset, meta prescan #72

Phase 8 — Resource Loading + Character Encoding + Real Page Loading#

Implement encoding detection/sniffing per the WHATWG Encoding Standard and HTML spec.

Requirements#

BOM sniffing: detect UTF-8 BOM (EF BB BF), UTF-16LE BOM (FF FE), UTF-16BE BOM (FE FF) at start of byte stream
HTTP Content-Type charset: extract charset parameter from Content-Type header (e.g., text/html; charset=utf-8)
HTML meta prescan: scan the first 1024 bytes of an HTML document for <meta charset="..."> or <meta http-equiv="Content-Type" content="text/html; charset=..."> without full HTML parsing
Encoding confidence levels: tentative vs certain (BOM and HTTP header are certain; meta prescan is tentative)
Default encoding: fall back to Windows-1252 for HTML if no encoding is detected (per spec)

API#

pub enum EncodingSource {
    Bom,
    HttpHeader,
    MetaPrescan,
    Default,
}

pub fn sniff_encoding(
    bytes: &[u8],
    http_content_type: Option<&str>,
) -> (Encoding, EncodingSource);

Acceptance Criteria#

BOM detection works for UTF-8, UTF-16LE, UTF-16BE
HTTP charset extraction handles quoted and unquoted values
Meta prescan finds charset in <meta> tags within first 1024 bytes
Correct priority: BOM > HTTP header > meta prescan > default
No external dependencies, no unsafe
Unit tests for each detection method and priority ordering

Dependencies#

Depends on: WHATWG Encoding: UTF-8 and UTF-16 codecs, Legacy single-byte encodings