spacez#
spaCy's en_core_web_sm NER pipeline, reimplemented in zig.
hash embeddings → residual CNN → BILUO transition parser, all from the original weights. no python, no runtime dependencies, ~6MB binary.
install#
zig fetch --save https://tangled.sh/@zzstoatzz.io/spacez/archive/main
then in build.zig:
const spacez = b.dependency("spacez", .{}).module("spacez");
exe.root_module.addImport("spacez", spacez);
usage#
const spacez = @import("spacez");
var model = try spacez.Model.load(spacez.en_core_web_sm);
var entities: [64]spacez.SpanEntity = undefined;
const n = spacez.recognize(&model, "Barack Obama visited Paris.", &entities);
for (entities[0..n]) |e| {
// e.label = .PERSON, .GPE, etc.
// e.start / e.end = byte offsets into source text
}
weights are embedded at compile time — no files to download or manage.
what's here#
- hash embeddings — murmurhash + bloom → 4 embedding tables, reduced via maxout
- CNN encoder — 4 residual blocks (seq2col → dense → maxout → layernorm → residual add)
- transition parser — greedy BILUO with 73 actions (9 entity types × 4 transitions + 1)
- tokenizer — full port of spaCy's English tokenizer (~1,000 special cases, prefix/suffix/infix rules)
- SIMD ops — vectorized matvec, vadd using zig's
@Vectorintrinsics
validation#
73/73 test sentences produce identical entities and byte offsets to spaCy en_core_web_sm 3.8.0, including multi-byte UTF-8, contractions, hyphens, possessives, informal text, and zero-entity inputs.
# head-to-head comparison (requires spaCy + en_core_web_sm)
uv run --python 3.12 --with spacy \
--with 'en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl' \
python scripts/compare.py
entity types#
PERSON, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, FAC, NORP, LOC
(plus CARDINAL, DATE, MONEY, PERCENT, ORDINAL, QUANTITY, TIME, LAW — inherited from the model)