前言
上午写完那篇文章后,下午在睡觉,晚上就想试试scrapy比较一下速度,那个更快,我是第一次用scrapy下载图片,第一次我使用requests下载的。。。贼鸡儿慢,就是单线程;后来翻了翻文档按照官方的例子改了改算是成功了,这篇文章就说一下我遇到的坑吧,文末对比两者速度
正文 站点分析就免去了,看上一片文章
首先新建一个项目
1 ➜ scrapy git:(master) ✗ scrapy startproject ins_crawl
接着生成spider:
1 2 ➜ scrapy git:(master) ✗ cd ins_crawl ➜ ins_crawl git:(master) ✗ scrapy genspider ins instagram.com
为了方便观看,我先tree一下我项目:
. ├── ins_crawl │ ├── init .py │ ├── pycache │ │ ├── init .cpython-37.pyc │ │ ├── items.cpython-37.pyc │ │ ├── middlewares.cpython-37.pyc │ │ ├── pipelines.cpython-37.pyc │ │ └── settings.cpython-37.pyc │ ├── images │ │ ├── InsImagesPipeline.py │ │ ├── init .py │ │ └── pycache │ │ ├── InsImagesPipeline.cpython-37.pyc │ │ └── init .cpython-37.pyc │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── init .py │ ├── pycache │ │ ├── init .cpython-37.pyc │ │ ├── config.cpython-37.pyc │ │ └── ins.cpython-37.pyc │ ├── config.py │ └── ins.py └── scrapy.cfg
6 directories, 21 files
打开 ins_crawl/spider/ins.py 文件,代码如下,注意看注释:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 import scrapyimport requestsimport jsonimport loggingfrom urllib.parse import (urlencode, urljoin)from ins_crawl.spiders.config import * from ins_crawl.items import InsCrawlItemLOGGER = logging.getLogger(__name__) class InsSpider (scrapy.Spider ): name = 'ins' allowed_domains = ['instagram.com' ] start_urls = ['http://instagram.com/' ] def __init__ (self, username='taeri__taeri' , *args, **kwargs ): """ :params username:用户名,可以在命令行传参 """ super (InsSpider, self).__init__(*args, **kwargs) self.username = username self.shared_data = self.get_shared_data() def request (self, end_cursor, callback ): """ request 方法,作用如其名 """ url = urljoin(self.start_urls[0 ], 'graphql/query/' ) + '?' params = { 'query_hash' : 'f2405b236d85e8296cf30347c9f08c2a' , 'variables' : '{{"id":"{0}","first":{1},"after":"{2}"}}' .format ( self.user_id, 50 , end_cursor), } url = url + urlencode(params) request = scrapy.Request(url=url, callback=callback, meta={'proxy' : 'http://127.0.0.1:8001' }) request.cookies['csrftoken' ] = CSRFTOKEN request.headers['User-Agent' ] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' return request def start_requests (self ): """ 重写start_requests方法 """ if self.shared_data is not None : user = self.shared_data['entry_data' ]['ProfilePage' ][0 ]['graphql' ]['user' ] self.user_id = user['id' ] self.count = user['edge_owner_to_timeline_media' ]['count' ] LOGGER.info('\n{}\nUser id:{}\nTotal {} photos.\n{}\n' .format ('-' *20 , self.user_id, self.count, '-' *20 )) for i, url in enumerate (self.start_urls): yield self.request("" , self.parse_item) else : LOGGER.error('-----[ERROR] shared_data is None.' ) def parse_item (self, response ): j = json.loads(response.text) edge_media = j['data' ]['user' ]['edge_owner_to_timeline_media' ] edges = edge_media['edges' ] if edges: for edge in edges: item = InsCrawlItem() item['image_url' ] = edge['node' ]['display_url' ] item['username' ] = self.username yield item has_next_page = edge_media['page_info' ]['has_next_page' ] if has_next_page: end_cursor = edge_media['page_info' ]['end_cursor' ] yield self.request(end_cursor, self.parse_item) else : LOGGER.info('获取照片完毕.' ) def get_shared_data (self ): """ 获取 shared data :return: """ try : proxies = { 'http' : 'http://' + PROXY, 'https' : 'https://' + PROXY } with requests.get(self.start_urls[0 ] + self.username, proxies=proxies) as resp: html = resp.text if html is not None and '_sharedData' in html: shared_data = html.split("window._sharedData = " )[1 ].split( ";</script>" )[0 ] if not shared_data: print ('Not found [share data]' ) exit(1 ) return json.loads(shared_data) except Exception as exc: LOGGER.error('[-----]' , repr (exc))
config.py
1 2 PROXY = '127.0.0.1:8001' CSRFTOKEN = ''
items.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import scrapyclass InsCrawlItem (scrapy.Item ): image_url = scrapy.Field() username = scrapy.Field()
Pipelines 里没动,就不贴了
InsImagesPipeline.py,有官方提供的例子改的,media-pipeline
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import loggingimport scrapyfrom scrapy.pipelines.images import ImagesPipelinefrom scrapy.exceptions import DropItemLOGGER = logging.getLogger(__name__) class InsImagesPipeline (ImagesPipeline ): def get_media_requests (self, item, info ): image_url = item['image_url' ] yield scrapy.Request(image_url, meta={'proxy' : 'http://127.0.0.1:8001' }) def item_completed (self, results, item, info ): image_paths = [x['path' ] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images" ) print ('-----[DOWLOADING]开始下载:' , item['image_url' ]) return item
InsProxyMiddlewares.py
1 2 3 4 5 6 from ins_crawl.spiders.config import *class InsProxyMiddlewares (object ): def process_request (self, request, spider ): request.meta['proxy' ] = 'http://127.0.0.1:8001'
settings.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 BOT_NAME = 'scrapy' SPIDER_MODULES = ['ins_crawl.spiders' ] NEWSPIDER_MODULE = 'ins_crawl.spiders' ROBOTSTXT_OBEY = False DEFAULT_REQUEST_HEADERS = { 'User-Agent' :'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36' , } DOWNLOADER_MIDDLEWARES = { 'ins_crawl.middlewares.InsCrawlDownloaderMiddleware' : 543 , } ITEM_PIPELINES = { 'ins_crawl.pipelines.InsCrawlPipeline' : 2 , 'ins_crawl.images.InsImagesPipeline.InsImagesPipeline' :1 , } IMAGES_STORE = '/Users/2h0n91i2hen/Pictures/Instagram/'
运行 taeri__taeri 该用户目前有 430张照片
抓取430张照片
Scrapy 耗时87秒, 0.20232558139534884 / 张。
asyncio+aiohttp 耗时21秒, 0.04883720930232558 / 张
之后我换了个用户:ponysmakeup,照片数量是964
asyncio+aiohttp 耗时42.9秒,0.04450207468879668 / 张
而scrapy 耗时159.9秒, 0.1658713692946058 / 张
总结 我一开始以为scrapy会很快,没想到比不过asyncio+aiohttp这阵容,我打算用 aiohttp+asyncio+aioredis 写一个代理池,预算用一个礼拜吧,不知道要多久 (:
项目:ins_crawl