如何使用Scrapy提取和处理数据-duidaima 堆代码

如何使用Scrapy提取和处理数据

发布于 2个月前
 494 热度

 0 评论

人走茶凉
0 粉丝 39 篇博客

在这篇文章中，我们将深入了解Scrapy的强大功能，学习如何使用Scrapy提取和处理数据。

一、数据提取：Selectors和Item
在Scrapy中，提取数据主要通过Selectors来完成。Selectors基于XPath或CSS表达式的查询语言来选取HTML文档中的元素。你可以在你的爬虫中使用response对象的xpath或css方法来创建一个Selector对象。例如，我们可以修改我们的QuotesSpider爬虫，使用Selectors来提取每个引用的文本和作者：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            text = quote.css('span.text::text').get()
            author = quote.css('span small::text').get()
            print(f'Text: {text}, Author: {author}')

此外，Scrapy还提供了Item类，可以定义你想要收集的数据结构。Item类非常适合收集结构化数据，如我们从quotes.toscrape.com中获取的引用：

import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
然后我们可以修改QuotesSpider爬虫，使其生成和收集QuoteItem对象：

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuoteItem()
            item['text'] = quote.css('span.text::text').get()
            item['author'] = quote.css('span small::text').get()
            yield item

二、数据处理：Pipelines
Scrapy使用数据管道（pipelines）来处理爬虫从网页中抓取的Item。当爬虫生成一个Item，它将被发送到Item Pipeline进行处理。
Item Pipeline是一些按照执行顺序排列的类，每个类都是一个数据处理单元。每个Item Pipeline组件都是一个Python类，必须实现一个process_item方法。这个方法必须返回一个Item对象，或者抛出DropItem异常，被丢弃的item将不会被之后的pipeline组件所处理。

例如，我们可以添加一个Pipeline，将收集的引用保存到JSON文件中：

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quotes.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

然后你需要在项目的设置文件（settings.py）中启用你的Pipeline：

ITEM_PIPELINES = {
   'tutorial.pipelines.JsonWriterPipeline': 1,
}

在这篇文章中，我们深入地探讨了Scrapy的功能，包括如何使用Selectors和Item提取数据，如何使用Pipelines处理数据。在下一篇文章中，我们将学习如何使用Scrapy处理更复杂的情况，如登录、cookies、以及如何避免爬虫被网站识别和封锁等问题。

 用户评论

Python编程
 153 成员 |  323 话题
+我要提问 +随便写写

可能感兴趣的话题

Python如何读取二进制文件？

Python中的类（Class）和对象（Object）

深入理解Python中的模块和包

Python如何读写CSV文件