Phase 8 — Resource Loading + Character Encoding + Real Page Loading#
Implement UTF-8 and UTF-16 text encoding/decoding in the encoding crate per the WHATWG Encoding Standard.
Requirements#
- UTF-8 decoder: streaming byte-to-codepoint decoder with error handling (replacement character U+FFFD for invalid sequences)
- UTF-8 encoder: codepoint-to-byte encoder
- UTF-16LE decoder: decode UTF-16 little-endian byte sequences to codepoints, with BOM handling
- UTF-16BE decoder: decode UTF-16 big-endian byte sequences to codepoints
- Surrogate pair handling: proper decoding of surrogate pairs in UTF-16
- Public API:
decode(bytes, encoding) -> Stringandencode(text, encoding) -> Vec<u8> - Encoding enum/label lookup per WHATWG spec (e.g., "utf-8", "utf8", "unicode-1-1-utf-8" all map to UTF-8)
Acceptance Criteria#
- UTF-8 decoder passes all valid/invalid sequence tests
- UTF-16LE/BE decoders handle surrogate pairs and BOM
- Encoding label lookup is case-insensitive and handles aliases
- No external dependencies, no
unsafe - Unit tests with edge cases (overlong sequences, lone surrogates, BOM stripping)
Context#
The encoding crate currently exists as a stub (lib.rs with only a header comment). This issue implements the core codecs that all other encoding work depends on.