zzstoatzz.io / spacez

fork atom

this repo has no description

fork atom

Zig 81.8%

Python 18.1%

Just 0.1%

5 1 0

examples

3 days ago

3 days ago

3 days ago

4 days ago

3 days ago

4 days ago

3 days ago

3 days ago

3 days ago

4 days ago

commits 5

fix: CNN padding, parser features, and UTF-8 offsets — 73/73 match

a555bdea

zzstoatzz.io +1

3 days ago

main

feat: bundled weights, NER CLI, and head-to-head comparison

63d001a8

zzstoatzz.io +1

3 days ago

chore: include en_core_web_sm weights in package

07068a9b

zzstoatzz.io +1

3 days ago

feat: model inference, tokenizer, and full NER pipeline

ab1eb3ac

zzstoatzz.io +1

4 days ago

branches 1

main 3 days ago default

README.md

spacez#

spaCy's en_core_web_sm NER pipeline, reimplemented in zig.

hash embeddings → residual CNN → BILUO transition parser, all from the original weights. no python, no runtime dependencies, ~6MB binary.

install#

zig fetch --save https://tangled.sh/@zzstoatzz.io/spacez/archive/main

then in build.zig:

const spacez = b.dependency("spacez", .{}).module("spacez");
exe.root_module.addImport("spacez", spacez);

usage#

const spacez = @import("spacez");

var model = try spacez.Model.load(spacez.en_core_web_sm);

var entities: [64]spacez.SpanEntity = undefined;
const n = spacez.recognize(&model, "Barack Obama visited Paris.", &entities);

for (entities[0..n]) |e| {
    // e.label = .PERSON, .GPE, etc.
    // e.start / e.end = byte offsets into source text
}

weights are embedded at compile time — no files to download or manage.

what's here#

hash embeddings — murmurhash + bloom → 4 embedding tables, reduced via maxout
CNN encoder — 4 residual blocks (seq2col → dense → maxout → layernorm → residual add)
transition parser — greedy BILUO with 73 actions (9 entity types × 4 transitions + 1)
tokenizer — full port of spaCy's English tokenizer (~1,000 special cases, prefix/suffix/infix rules)
SIMD ops — vectorized matvec, vadd using zig's @Vector intrinsics

validation#

73/73 test sentences produce identical entities and byte offsets to spaCy en_core_web_sm 3.8.0, including multi-byte UTF-8, contractions, hyphens, possessives, informal text, and zero-entity inputs.

# head-to-head comparison (requires spaCy + en_core_web_sm)
uv run --python 3.12 --with spacy \
  --with 'en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl' \
  python scripts/compare.py

entity types#

PERSON, ORG, GPE, PRODUCT, EVENT, WORK_OF_ART, FAC, NORP, LOC

(plus CARDINAL, DATE, MONEY, PERCENT, ORDINAL, QUANTITY, TIME, LAW — inherited from the model)

Clone this repository

spacez#

install#

usage#

what's here#

validation#

entity types#