Added an formal pipeline to make sure that we don't supply double values.

dekker.one / Fourmi

fork

A web scraper build to search specific information for a given compound (and its pseudonyms)

fork

Jip J. Dekker 12 years ago 55843d32 8dd2c168

+21 -1

2 changed files

expand all

Scrapy

pipelines.py

settings.py

+18 -1

Scrapy/pipelines.py

··· 2 2 # 3 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 4 4 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 5 + from scrapy.exceptions import DropItem 6 + 5 7 6 8 class FourmiPipeline(object): 9 + 10 + def __init__(self): 11 + self.known_values = set() 12 + 7 13 def process_item(self, item, spider): 8 - return item 14 + """ 15 + Processing the items so exact doubles are dropped 16 + :param item: The incoming item 17 + :param spider: The spider which scraped the spider 18 + :return: :raise DropItem: Returns the item if unique or drops them if it's already known 19 + """ 20 + value = item['attribute'], item['value'] 21 + if value in self.known_values: 22 + raise DropItem("Duplicate item found: %s" % item) 23 + else: 24 + self.known_values.add(value) 25 + return item

Scrapy/settings.py

··· 10 10 11 11 SPIDER_MODULES = ['Scrapy.spiders'] 12 12 NEWSPIDER_MODULE = 'Scrapy.spiders' 13 + ITEM_PIPELINES = { 14 + 'Scrapy.pipelines.FourmiPipeline': 100 15 + } 13 16 14 17 # Crawl responsibly by identifying yourself (and your website) on the user-agent 15 18 #USER_AGENT = 'Fourmi (+http://www.yourdomain.com)'

Configure Feed

Configure Feed