Phase 8 — Resource Loading + Character Encoding + Real Page Loading#
Implement encoding detection/sniffing per the WHATWG Encoding Standard and HTML spec.
Requirements#
- BOM sniffing: detect UTF-8 BOM (EF BB BF), UTF-16LE BOM (FF FE), UTF-16BE BOM (FE FF) at start of byte stream
- HTTP Content-Type charset: extract charset parameter from Content-Type header (e.g.,
text/html; charset=utf-8) - HTML meta prescan: scan the first 1024 bytes of an HTML document for
<meta charset="...">or<meta http-equiv="Content-Type" content="text/html; charset=...">without full HTML parsing - Encoding confidence levels: tentative vs certain (BOM and HTTP header are certain; meta prescan is tentative)
- Default encoding: fall back to Windows-1252 for HTML if no encoding is detected (per spec)
API#
pub enum EncodingSource {
Bom,
HttpHeader,
MetaPrescan,
Default,
}
pub fn sniff_encoding(
bytes: &[u8],
http_content_type: Option<&str>,
) -> (Encoding, EncodingSource);
Acceptance Criteria#
- BOM detection works for UTF-8, UTF-16LE, UTF-16BE
- HTTP charset extraction handles quoted and unquoted values
- Meta prescan finds charset in
<meta>tags within first 1024 bytes - Correct priority: BOM > HTTP header > meta prescan > default
- No external dependencies, no
unsafe - Unit tests for each detection method and priority ordering
Dependencies#
Depends on: WHATWG Encoding: UTF-8 and UTF-16 codecs, Legacy single-byte encodings