motet#

A mote of data, composed.

A personal search indexer that crawls the corners of the web you care about and indexes them into a local full-text search engine. The opposite of Google — small, focused, private, and yours.

                  ┌──────────────┐
                  │  motet CLI   │
                  │  & Web UI    │
                  └──────┬───────┘
                         │
              ┌──────────┴──────────┐
              │     motet_core      │
              ├──────────┬──────────┤
              │ Crawlers │  Query   │
              │ (blog,   │  Engine  │
              │  yelp,   │ (BM25)   │
              │  reddit, │          │
              │  crates) │          │
              ├──────────┴──────────┤
              │  Tantivy │  SQLite  │
              │  (index) │  (meta)  │
              └──────────┴──────────┘
                     │          │
          ┌──────────┘          └──────────┐
          ▼                                ▼
  ~/.local/share/motet/index/   ~/.local/share/motet/motet.db

Features#

Focused — you choose what gets indexed, no SEO spam or ads
Fast — sub-millisecond queries over a compact local index
Private — runs locally, no tracking, no cloud dependency
Dual interface — CLI for the terminal, React web UI in the browser
Single binary — the web frontend is embedded in the Rust binary
Extensible — add new sources by implementing the crawler trait

Quick Start#

# Initialize default config (This Week in Rust + Scout Magazine)
motet init

# Crawl all configured sources
motet crawl

# Search your index
motet search "rust async"

# Start the web UI
motet serve
# → http://127.0.0.1:3838

Installation#

From Source#

cargo install --path motet_cli

Nix#

nix build   # or: nix develop

Configuration#

Sources are defined in ~/.config/motet/sources.json. Run motet init to generate a default config, then edit it to add your own sources.

{
  "sources": {
    "this_week_in_rust": {
      "kind": "blog",
      "url": "https://this-week-in-rust.org/",
      "crawl_interval": "7d",
      "selector": "article",
      "max_pages": 50,
      "source_kind_label": "blog"
    },
    "scout_magazine": {
      "kind": "blog",
      "url": "https://scoutmagazine.ca/category/food-drink/",
      "crawl_interval": "3d",
      "selector": "article",
      "max_pages": 50,
      "source_kind_label": "restaurant"
    }
  }
}

Source Kinds#

Kind	Description	Status
`blog`	Generic HTML blog scraper	Implemented
`yelp`	Yelp Fusion API	Planned
`reddit`	Reddit posts via API	Planned
`crates_io`	crates.io package metadata	Planned

Configuration Fields#

Field	Required	Description
`kind`	yes	Crawler type (see table above)
`url`	blog	Base URL to crawl
`crawl_interval`	no	Re-crawl frequency (`30m`, `12h`, `1d`, `7d`)
`selector`	no	CSS selector for article elements
`max_pages`	no	Maximum pages to crawl per run
`source_kind_label`	no	Facet label for filtering (`blog`, `restaurant`)

CLI Reference#

motet init                          Write default config
motet crawl                         Crawl all sources
motet crawl --source <name>         Crawl one source
motet crawl --dry-run               Show what would be crawled
motet search <query>                Search the index
motet search <query> --limit 20     Limit result count
motet serve                         Start web UI on :3838
motet serve --port 8080             Custom port
motet stats                         Show index statistics

Crates#

Crate	Description
`motet_core`	Library — crawlers, index, query, config
`motet_cli`	Binary — CLI commands and web server

How It Works#

Crawling#

Each source has a crawler that fetches pages, extracts text content, and produces structured documents. The generic blog crawler:

Fetches the index page at the configured URL
Extracts article links using the CSS selector
Fetches each article and extracts title + body text
Produces a snippet (first ~300 chars) for search result display

Indexing#

Documents are stored in two places:

Store	Contents	Purpose
Tantivy	URL, title, body, facets, tags, date	Full-text search (BM25)
SQLite	Crawl timestamps, ETags, structured metadata	Freshness, dedup, domain-specific data

Searching#

Queries run against the Tantivy index using BM25 scoring across the title and body fields. Results include the source kind and name as facets for filtering.

Storage Layout#

~/.config/motet/
└── sources.json              # Source configuration

~/.local/share/motet/
├── index/                    # Tantivy full-text index
└── motet.db                  # SQLite metadata

Development#

nix develop     # Enter dev shell (Rust 1.90 + Node 22)
cargo build     # Build
cargo test      # Test
cargo clippy    # Lint

Building the Web UI#

cd motet_web
npm install
npm run build   # Output to dist/, embedded in binary

License#

Apache-2.0 OR MIT

Clone this repository