pierrelf.com / we

we (web engine): Experimental web browser project to understand the limits of Claude

WHATWG Encoding: Legacy single-byte encodings #71

open opened by

pierrelf.com 1 week ago

Phase 8 — Resource Loading + Character Encoding + Real Page Loading#

Implement legacy single-byte text encodings in the encoding crate per the WHATWG Encoding Standard.

Requirements#

Windows-1252 (aka cp1252): the most common legacy Western encoding
ISO-8859-1 (Latin-1): maps directly to first 256 Unicode codepoints
ISO-8859-2 through ISO-8859-16: Central/Eastern European, Cyrillic, Greek, Arabic, Hebrew, etc.
Windows-874, Windows-1250 through Windows-1258: Windows codepages
macintosh (Mac OS Roman)
IBM866, KOI8-R, KOI8-U: Cyrillic encodings

Each single-byte encoding is a 128-entry lookup table mapping bytes 0x80–0xFF to Unicode codepoints (bytes 0x00–0x7F are ASCII).

Implementation#

Define each encoding as a [u16; 128] or [char; 128] lookup table (indexed by byte - 0x80)
Decoder: for bytes < 0x80, use ASCII; for bytes >= 0x80, look up in table
Unmapped bytes produce U+FFFD (replacement character)
Register all WHATWG encoding labels as aliases

Acceptance Criteria#

Windows-1252 and ISO-8859-1 decoders work correctly
At least 5 additional single-byte encodings implemented
All WHATWG-specified label aliases map correctly
No external dependencies, no unsafe
Unit tests with known byte sequences and expected Unicode output

Dependencies#

Depends on: WHATWG Encoding: UTF-8 and UTF-16 codecs (for the shared Encoding trait/API)

sign up or login to add to the discussion

Labels

None yet.

assignee

None yet.

Participants 1

AT URI

at://did:plc:meotu43t6usg4qdwzenk4s2t/sh.tangled.repo.issue/3mhkt5plwbw2k