怎么用Python实现异步爬虫?
由于网络IO阻止了所有优化请求,因此我们可以采用异步方式进行优化,如多线程或协程并行抓取网页数据,这里使用Python协程来实现。
# coding: utf8
"""协程版本爬虫,提高抓取效率"""
from gevent import monkey
monkey.patch_all()
import requests
from lxml import etree
from gevent.pool import Pool
def main():
# 1. 定义页面URL和解析规则
crawl_urls = [
'https://book.douban.com/subject/25862578/',
'https://book.douban.com/subject/26698660/',
'https://book.douban.com/subject/2230208/'
]
rule = "//div[@id='wrapper']/h1/span/text()"
# 2. 抓取
pool = Pool(size=10)
for url in crawl_urls:
pool.spawn(crawl, url, rule)
pool.join()
def crawl(url, rule):
# 3. 发起HTTP请求
response = requests.get(url)
# 4. 解析HTML
result = etree.HTML(response.text).xpath(rule)[0]
# 5. 保存结果
print result
if __name__ == '__main__':
main()
推荐阅读
热门文章
因为专业! 所以简单! 产品至上,价格实惠 是我们服务追求的宗旨
免费试用