Implement a JavaScript lexer/tokenizer conforming to ECMAScript 2024 specification.
Scope#
Create the we-js crate with a tokenizer that converts JavaScript source text into a stream of tokens.
Token Types#
- Identifiers and keywords: all ES2024 keywords (
var,let,const,function,class,if,else,for,while,do,switch,case,break,continue,return,throw,try,catch,finally,new,delete,typeof,instanceof,void,in,of,import,export,default,async,await,yield, etc.) - Punctuators: all operators and delimiters (
+,-,*,/,%,**,=,==,===,!=,!==,<,>,<=,>=,&&,||,??,?.,...,=>, etc.) - Numeric literals: decimal, hex (
0x), octal (0o), binary (0b), floating point, exponential notation - String literals: single and double quoted, escape sequences (
\n,\t,\uXXXX,\u{XXXXX},\\, etc.) - Template literals: backtick strings with
${...}interpolation support (TemplateHead, TemplateMiddle, TemplateTail tokens) - Regular expression literals:
/pattern/flags - Comments: single-line (
//) and multi-line (/* */) — skip or optionally preserve - Boolean/null literals:
true,false,null
Features#
- Track source positions (line, column) for error reporting
- Handle Unicode identifiers (basic — at minimum ASCII plus common Unicode letters)
- Distinguish division
/from RegExp literal/based on context - Automatic semicolon insertion awareness (track newlines between tokens)
Acceptance Criteria#
-
Tokenenum with all ES2024 token types -
Lexerstruct that takes&strsource and producesVec<Token>or iterator - Correct tokenization of all numeric literal forms
- Correct string literal parsing with escape sequences
- Template literal tokenization
- Keyword vs identifier distinction
- Source position tracking (line/column on each token)
- Unit tests covering each token type, edge cases, and error handling
Phase 10 — JavaScript Engine (issue 1 of 15). No dependencies.