commit 7c0996448a3b3f544e7c5f277a4d5641b61cd561 · techaro.lol/anubis · tangled

techaro.lol / anubis

Weighs the soul of incoming HTTP requests to stop AI crawlers

chore(default-config): allowlist common crawl (#753)

This may seem strange, but allowlisting common crawl means that scrapers
have less incentive to scrape because they can just grab the data from
common crawl instead of scraping it again.

authored by Xe Iaso and committed by GitHub 6 months ago 7c099644 d7a758f8

options

Changed files

+19 -3

data

bots

ai-catchall.yaml

ai-robots-txt.yaml

crawlers

_allow-good.yaml

commoncrawl.yaml

docs

docs

CHANGELOG.md

+1 -1

data/bots/ai-catchall.yaml

··· 7 7 # Warning: May contain user agents that _must_ be blocked in robots.txt, or the opt-out will have no effect. 8 8 - name: "ai-catchall" 9 9 user_agent_regex: >- 10 - AI2Bot|Ai2Bot-Dolma|aiHitBot|Amazonbot|anthropic-ai|Brightbot 1.0|Bytespider|CCBot|Claude-Web|cohere-ai|cohere-training-data-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google-CloudVertexBot|GoogleOther|GoogleOther-Image|GoogleOther-Video|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo Bot|meta-externalagent|Meta-ExternalAgent|meta-externalfetcher|Meta-ExternalFetcher|NovaAct|omgili|omgilibot|Operator|PanguBot|Perplexity-User|PerplexityBot|PetalBot|QualifiedBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade indexer bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio-Extended|wpbot|YouBot 10 + AI2Bot|Ai2Bot-Dolma|aiHitBot|Amazonbot|anthropic-ai|Brightbot 1.0|Bytespider|Claude-Web|cohere-ai|cohere-training-data-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google-CloudVertexBot|GoogleOther|GoogleOther-Image|GoogleOther-Video|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|imgproxy|ISSCyberRiskCrawler|Kangaroo Bot|meta-externalagent|Meta-ExternalAgent|meta-externalfetcher|Meta-ExternalFetcher|NovaAct|omgili|omgilibot|Operator|PanguBot|Perplexity-User|PerplexityBot|PetalBot|QualifiedBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade indexer bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio-Extended|wpbot|YouBot 11 11 action: DENY

+3 -1

data/bots/ai-robots-txt.yaml

··· 1 1 # Warning: Contains user agents that _must_ be blocked in robots.txt, or the opt-out will have no effect. 2 2 # Note: Blocks human-directed/non-training user agents 3 + # 4 + # CCBot is allowed because if Common Crawl is allowed, then scrapers don't need to scrape to get the data. 3 5 - name: "ai-robots-txt" 4 6 user_agent_regex: >- 5 - AI2Bot|Ai2Bot-Dolma|aiHitBot|Amazonbot|Andibot|anthropic-ai|Applebot|Applebot-Extended|bedrockbot|Brightbot 1.0|Bytespider|CCBot|ChatGPT-User|Claude-SearchBot|Claude-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|EchoboxBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google-CloudVertexBot|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo Bot|meta-externalagent|Meta-ExternalAgent|meta-externalfetcher|Meta-ExternalFetcher|MistralAI-User/1.0|MyCentralAIScraperBot|NovaAct|OAI-SearchBot|omgili|omgilibot|Operator|PanguBot|Panscient|panscient.com|Perplexity-User|PerplexityBot|PetalBot|PhindBot|Poseidon Research Crawler|QualifiedBot|QuillBot|quillbot.com|SBIntuitionsBot|Scrapy|SemrushBot|SemrushBot-BA|SemrushBot-CT|SemrushBot-OCOB|SemrushBot-SI|SemrushBot-SWA|Sidetrade indexer bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio-Extended|wpbot|YandexAdditional|YandexAdditionalBot|YouBot 7 + AI2Bot|Ai2Bot-Dolma|aiHitBot|Amazonbot|Andibot|anthropic-ai|Applebot|Applebot-Extended|bedrockbot|Brightbot 1.0|Bytespider|ChatGPT-User|Claude-SearchBot|Claude-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Cotoyogi|Crawlspace|Diffbot|DuckAssistBot|EchoboxBot|FacebookBot|facebookexternalhit|Factset_spyderbot|FirecrawlAgent|FriendlyCrawler|Google-CloudVertexBot|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo Bot|meta-externalagent|Meta-ExternalAgent|meta-externalfetcher|Meta-ExternalFetcher|MistralAI-User/1.0|MyCentralAIScraperBot|NovaAct|OAI-SearchBot|omgili|omgilibot|Operator|PanguBot|Panscient|panscient.com|Perplexity-User|PerplexityBot|PetalBot|PhindBot|Poseidon Research Crawler|QualifiedBot|QuillBot|quillbot.com|SBIntuitionsBot|Scrapy|SemrushBot|SemrushBot-BA|SemrushBot-CT|SemrushBot-OCOB|SemrushBot-SI|SemrushBot-SWA|Sidetrade indexer bot|TikTokSpider|Timpibot|VelenPublicWebCrawler|Webzio-Extended|wpbot|YandexAdditional|YandexAdditionalBot|YouBot 6 8 action: DENY

+2 -1

data/crawlers/_allow-good.yaml

··· 6 6 - import: (data)/crawlers/internet-archive.yaml 7 7 - import: (data)/crawlers/kagibot.yaml 8 8 - import: (data)/crawlers/marginalia.yaml 9 - - import: (data)/crawlers/mojeekbot.yaml 9 + - import: (data)/crawlers/mojeekbot.yaml 10 + - import: (data)/crawlers/commoncrawl.yaml

+12

data/crawlers/commoncrawl.yaml

··· 1 + - name: common-crawl 2 + user_agent_regex: CCBot 3 + action: ALLOW 4 + # https://index.commoncrawl.org/ccbot.json 5 + remote_addresses: 6 + [ 7 + "2600:1f28:365:80b0::/60", 8 + "18.97.9.168/29", 9 + "18.97.14.80/29", 10 + "18.97.14.88/30", 11 + "98.85.178.216/32", 12 + ]

+1

docs/docs/CHANGELOG.md

··· 23 23 - Add translation for German language ([#741](https://github.com/TecharoHQ/anubis/pull/741)) 24 24 - Remove the "Success" interstitial after a proof of work challenge is concluded. 25 25 - Add option for forcing a specific language ([#742](https://github.com/TecharoHQ/anubis/pull/742)) 26 + - Allow [Common Crawl](https://commoncrawl.org/) by default so scrapers have less incentive to scrape 26 27 27 28 ### Potentially breaking changes 28 29