WHATWG Encoding: UTF-8 and UTF-16 codecs #70

Phase 8 — Resource Loading + Character Encoding + Real Page Loading#

Implement UTF-8 and UTF-16 text encoding/decoding in the encoding crate per the WHATWG Encoding Standard.

Requirements#

UTF-8 decoder: streaming byte-to-codepoint decoder with error handling (replacement character U+FFFD for invalid sequences)
UTF-8 encoder: codepoint-to-byte encoder
UTF-16LE decoder: decode UTF-16 little-endian byte sequences to codepoints, with BOM handling
UTF-16BE decoder: decode UTF-16 big-endian byte sequences to codepoints
Surrogate pair handling: proper decoding of surrogate pairs in UTF-16
Public API: decode(bytes, encoding) -> String and encode(text, encoding) -> Vec<u8>
Encoding enum/label lookup per WHATWG spec (e.g., "utf-8", "utf8", "unicode-1-1-utf-8" all map to UTF-8)

Acceptance Criteria#

UTF-8 decoder passes all valid/invalid sequence tests
UTF-16LE/BE decoders handle surrogate pairs and BOM
Encoding label lookup is case-insensitive and handles aliases
No external dependencies, no unsafe
Unit tests with edge cases (overlong sequences, lone surrogates, BOM stripping)

Context#

The encoding crate currently exists as a stub (lib.rs with only a header comment). This issue implements the core codecs that all other encoding work depends on.