web archiver with MASL bundle mode for ATProto. captures web pages as content-addressed bundles stored on your PDS with optional IPFS pinning.
Python 100.0%
4 1 0

Clone this repository

https://tangled.org/kira.pds.witchcraft.systems/web-archive
git@tangled.org:kira.pds.witchcraft.systems/web-archive

For self-hosted knots, clone URLs may differ based on your setup.

README.md

web-archive#

web archiver with MASL bundle mode for ATProto. captures web pages as content-addressed bundles stored on your PDS with optional IPFS pinning. includes a recursive site crawler for archiving entire websites.

what it does#

  • single mode: archives a single HTML page with its CID stored as a systems.witchcraft.archive.capture ATProto record
  • bundle mode: archives a page + all subresources (CSS, JS, images, fonts) as a MASL bundle — each resource gets its own content-addressed blob on your PDS
  • site mode (site_archive.py): recursively crawls a website (BFS), archives each page as a bundle, creates a site manifest linking them all together. internal links are rewritten to point to sibling archive captures.
  • CSS url() scanning: follows @import and url() references in stylesheets to capture fonts, background images, etc.
  • IPFS pinning: optionally pins the HTML to IPFS via a local kubo node
  • PDS blob storage: uploads all resources as PDS blobs with proper content-type headers

usage#

# single page archive
python web_archive.py https://example.com

# bundle mode (page + all subresources)
python web_archive.py https://example.com --bundle

# bundle with resource limit
python web_archive.py https://example.com --bundle --max-resources 50

# skip IPFS pinning
python web_archive.py https://example.com --no-ipfs

# list all archives
python web_archive.py --list

# search archives
python web_archive.py --search "example"

# verify archive integrity
python web_archive.py --verify <rkey>

site archiver#

# dry-run: show what would be archived
python site_archive.py https://example.com --dry-run

# archive a site (default: depth 2, max 30 pages)
python site_archive.py https://example.com

# customize crawl depth and page limit
python site_archive.py https://example.com --depth 3 --max-pages 50

# list site archives
python site_archive.py --list

# show status of a site archive
python site_archive.py --status <rkey>

auth#

set these environment variables:

export ATP_PDS_URL=https://your.pds.example.com
export ATP_HANDLE=your.handle
export ATP_PASSWORD=your-app-password

dependencies#

pip install requests beautifulsoup4

optional: ipfs CLI (kubo) for IPFS pinning

record types#

  • systems.witchcraft.archive.capture — single page captures
  • systems.witchcraft.archive.bundle — bundle archives with MASL-shaped manifest
    • masl.resources: path → {src: CID, content-type} (spec-conformant CID strings)
    • blobs: path → ATProto blob ref (for content retrieval from PDS)
    • archive metadata (url, title, capturedAt, etc) at top level
    • see MASL spec for the manifest format
  • systems.witchcraft.archive.site — site manifests linking multiple page bundles
    • page list with URLs, titles, depths, and bundle rkeys
    • link map for internal link rewriting between archived pages

viewer#

archived pages can be viewed with the archive viewer hosted on wisp.place.

license#

MIT