2019-03-20发表2025-04-15更新技术 / python / 爬虫2 分钟读完 (大约299个字)

在python工程中执行爬虫

scrapy一般使用方式，都是通过编写一个Spider，然后通过命令行执行指令，来开启一个爬虫并执行，例如：

1	scrapy runspider quotes_spider.py -o quotes.json

这种方式不太适合大型项目以及定制化爬取，所以需要想办法调用scrapy接口，来实现代码中调用。
一种常用做法是使用scrapyd或者scrapy-cloud，这里我们都不使用。

使用CrawlerProcess

scrapy.crawler.CrawlerProcess会创建一个Twisted reactor，设置好日志格式和shutdown handlers。Crawler本身也是使用该类启动的爬虫
下方是一个简单例子：

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess(settings={
    'FEED_FORMAT': 'json',
    'FEED_URI': 'items.json'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

使用CrawlerRunner

scrapy.crawler.CrawlerRunner类提供了更多可控性，不会创建reactor，也不会干扰现有的reactor。

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

在python工程中执行爬虫

http://www.lephee.net/2019/03/20/scrapy-in-code/

作者

LePhee

发布于

2019-03-20

更新于

2025-04-15

许可协议

在python工程中执行爬虫

使用CrawlerProcess

使用CrawlerRunner

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

目录

链接

分类

最新文章

归档

标签

广告