web archiver with MASL bundle mode for ATProto. captures web pages as content-addressed bundles stored on your PDS with optional IPFS pinning. includes a recursive site crawler for archiving entire websites.

what it does#

single mode: archives a single HTML page with its CID stored as a systems.witchcraft.archive.capture ATProto record
bundle mode: archives a page + all subresources (CSS, JS, images, fonts) as a MASL bundle — each resource gets its own content-addressed blob on your PDS
site mode (site_archive.py): recursively crawls a website (BFS), archives each page as a bundle, creates a site manifest linking them all together. internal links are rewritten to point to sibling archive captures.
CSS url() scanning: follows @import and url() references in stylesheets to capture fonts, background images, etc.
IPFS pinning: optionally pins the HTML to IPFS via a local kubo node
PDS blob storage: uploads all resources as PDS blobs with proper content-type headers

usage#

# single page archive
python web_archive.py https://example.com

# bundle mode (page + all subresources)
python web_archive.py https://example.com --bundle

# bundle with resource limit
python web_archive.py https://example.com --bundle --max-resources 50

# skip IPFS pinning
python web_archive.py https://example.com --no-ipfs

# list all archives
python web_archive.py --list

# search archives
python web_archive.py --search "example"

# verify archive integrity
python web_archive.py --verify <rkey>

site archiver#

# dry-run: show what would be archived
python site_archive.py https://example.com --dry-run

# archive a site (default: depth 2, max 30 pages)
python site_archive.py https://example.com

# customize crawl depth and page limit
python site_archive.py https://example.com --depth 3 --max-pages 50

# list site archives
python site_archive.py --list

# show status of a site archive
python site_archive.py --status <rkey>

auth#

set these environment variables:

export ATP_PDS_URL=https://your.pds.example.com
export ATP_HANDLE=your.handle
export ATP_PASSWORD=your-app-password

dependencies#

pip install requests beautifulsoup4

optional: ipfs CLI (kubo) for IPFS pinning

record types#

systems.witchcraft.archive.capture — single page captures
systems.witchcraft.archive.bundle — bundle archives with MASL-shaped manifest
- masl.resources: path → {src: CID, content-type} (spec-conformant CID strings)
- blobs: path → ATProto blob ref (for content retrieval from PDS)
- archive metadata (url, title, capturedAt, etc) at top level
- see MASL spec for the manifest format
systems.witchcraft.archive.site — site manifests linking multiple page bundles
- page list with URLs, titles, depths, and bundle rkeys
- link map for internal link rewriting between archived pages

viewer#

archived pages can be viewed with the archive viewer hosted on wisp.place.

license#

MIT

Clone this repository

web-archive#