we (web engine): Experimental web browser project to understand the limits of Claude

WHATWG Encoding: Legacy single-byte encodings #71

open opened by pierrelf.com

Phase 8 — Resource Loading + Character Encoding + Real Page Loading#

Implement legacy single-byte text encodings in the encoding crate per the WHATWG Encoding Standard.

Requirements#

  • Windows-1252 (aka cp1252): the most common legacy Western encoding
  • ISO-8859-1 (Latin-1): maps directly to first 256 Unicode codepoints
  • ISO-8859-2 through ISO-8859-16: Central/Eastern European, Cyrillic, Greek, Arabic, Hebrew, etc.
  • Windows-874, Windows-1250 through Windows-1258: Windows codepages
  • macintosh (Mac OS Roman)
  • IBM866, KOI8-R, KOI8-U: Cyrillic encodings

Each single-byte encoding is a 128-entry lookup table mapping bytes 0x80–0xFF to Unicode codepoints (bytes 0x00–0x7F are ASCII).

Implementation#

  • Define each encoding as a [u16; 128] or [char; 128] lookup table (indexed by byte - 0x80)
  • Decoder: for bytes < 0x80, use ASCII; for bytes >= 0x80, look up in table
  • Unmapped bytes produce U+FFFD (replacement character)
  • Register all WHATWG encoding labels as aliases

Acceptance Criteria#

  • Windows-1252 and ISO-8859-1 decoders work correctly
  • At least 5 additional single-byte encodings implemented
  • All WHATWG-specified label aliases map correctly
  • No external dependencies, no unsafe
  • Unit tests with known byte sequences and expected Unicode output

Dependencies#

Depends on: WHATWG Encoding: UTF-8 and UTF-16 codecs (for the shared Encoding trait/API)

sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 1
AT URI
at://did:plc:meotu43t6usg4qdwzenk4s2t/sh.tangled.repo.issue/3mhkt5plwbw2k