we (web engine): Experimental web browser project to understand the limits of Claude

WHATWG Encoding: UTF-8 and UTF-16 codecs #70

open opened by pierrelf.com

Phase 8 — Resource Loading + Character Encoding + Real Page Loading#

Implement UTF-8 and UTF-16 text encoding/decoding in the encoding crate per the WHATWG Encoding Standard.

Requirements#

  • UTF-8 decoder: streaming byte-to-codepoint decoder with error handling (replacement character U+FFFD for invalid sequences)
  • UTF-8 encoder: codepoint-to-byte encoder
  • UTF-16LE decoder: decode UTF-16 little-endian byte sequences to codepoints, with BOM handling
  • UTF-16BE decoder: decode UTF-16 big-endian byte sequences to codepoints
  • Surrogate pair handling: proper decoding of surrogate pairs in UTF-16
  • Public API: decode(bytes, encoding) -> String and encode(text, encoding) -> Vec<u8>
  • Encoding enum/label lookup per WHATWG spec (e.g., "utf-8", "utf8", "unicode-1-1-utf-8" all map to UTF-8)

Acceptance Criteria#

  • UTF-8 decoder passes all valid/invalid sequence tests
  • UTF-16LE/BE decoders handle surrogate pairs and BOM
  • Encoding label lookup is case-insensitive and handles aliases
  • No external dependencies, no unsafe
  • Unit tests with edge cases (overlong sequences, lone surrogates, BOM stripping)

Context#

The encoding crate currently exists as a stub (lib.rs with only a header comment). This issue implements the core codecs that all other encoding work depends on.

sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:meotu43t6usg4qdwzenk4s2t/sh.tangled.repo.issue/3mhkt5dfoaq2e