web-archive#
web archiver with MASL bundle mode for ATProto. captures web pages as content-addressed bundles stored on your PDS with optional IPFS pinning. includes a recursive site crawler for archiving entire websites.
what it does#
- single mode: archives a single HTML page with its CID stored as a
systems.witchcraft.archive.captureATProto record - bundle mode: archives a page + all subresources (CSS, JS, images, fonts) as a MASL bundle — each resource gets its own content-addressed blob on your PDS
- site mode (
site_archive.py): recursively crawls a website (BFS), archives each page as a bundle, creates a site manifest linking them all together. internal links are rewritten to point to sibling archive captures. - CSS url() scanning: follows
@importandurl()references in stylesheets to capture fonts, background images, etc. - IPFS pinning: optionally pins the HTML to IPFS via a local kubo node
- PDS blob storage: uploads all resources as PDS blobs with proper content-type headers
usage#
# single page archive
python web_archive.py https://example.com
# bundle mode (page + all subresources)
python web_archive.py https://example.com --bundle
# bundle with resource limit
python web_archive.py https://example.com --bundle --max-resources 50
# skip IPFS pinning
python web_archive.py https://example.com --no-ipfs
# list all archives
python web_archive.py --list
# search archives
python web_archive.py --search "example"
# verify archive integrity
python web_archive.py --verify <rkey>
site archiver#
# dry-run: show what would be archived
python site_archive.py https://example.com --dry-run
# archive a site (default: depth 2, max 30 pages)
python site_archive.py https://example.com
# customize crawl depth and page limit
python site_archive.py https://example.com --depth 3 --max-pages 50
# list site archives
python site_archive.py --list
# show status of a site archive
python site_archive.py --status <rkey>
auth#
set these environment variables:
export ATP_PDS_URL=https://your.pds.example.com
export ATP_HANDLE=your.handle
export ATP_PASSWORD=your-app-password
dependencies#
pip install requests beautifulsoup4
optional: ipfs CLI (kubo) for IPFS pinning
record types#
systems.witchcraft.archive.capture— single page capturessystems.witchcraft.archive.bundle— bundle archives with MASL-shaped manifestmasl.resources: path → {src: CID, content-type} (spec-conformant CID strings)blobs: path → ATProto blob ref (for content retrieval from PDS)- archive metadata (url, title, capturedAt, etc) at top level
- see MASL spec for the manifest format
systems.witchcraft.archive.site— site manifests linking multiple page bundles- page list with URLs, titles, depths, and bundle rkeys
- link map for internal link rewriting between archived pages
viewer#
archived pages can be viewed with the archive viewer hosted on wisp.place.
license#
MIT