scrapy

对于很多简单页面，厌倦了一行一行自己写爬虫，所以尝试一下 Scrapy。重点考察

开发效率
异常处理
防封机制

使用教程参考官网 scrapy.org, 写得很细致。

安装 Scrapy

sudo pip install scrapy

在 Mac OS 10.11 酋长石系统上报错

build/temp.macosx-10.10-x86_64-2.7/_openssl.c:400:10: fatal error: 'openssl/aes.h' file not found

开始以为是 openssl 没有安装，于是尝试

$ brew install openssl
Warning: openssl-1.0.1j already installed

实际上不是，而是因为 Mac OS X 10.11 EL Capital 抛弃了 openssl 的头文件。参考 mac osx 10.11 编译 git 2.6.1 报错

解决方法是, 执行

xcode-select -p

打印 Xcode 的工作目录, 我的是 /Applications/Xcode.app/Contents/Developer, 进入该目录

$ cd /Applications/Xcode.app/Contents/Developer
$ find . -name ssl.h
./Toolchains/XcodeDefault.xctoolchain/usr/lib/swift-migrator/sdk/MacOSX.sdk/usr/include/openssl/ssl.h

找到 openssl 所在的目录

cd Toolchains/XcodeDefault.xctoolchain/usr/lib/swift-migrator/sdk/MacOSX.sdk/usr/include/
cp -R openssl /usr/local/include/

为何要拷贝到 /usr/local/include/ 呢？因为错误日志里，提到 clang 用到了 -I/usr/local/include/

实际上，上面 scrapy 的安装并没有完成，如何运行报错的话，执行下面的操作。如果没有，则略过

ImportError: Twisted requires zope.interface 3.6.0 or later: no module named zope.interface.

sudo pip install zope.interface

ImportError: No module named cryptography.hazmat.bindings.openssl.binding

sudo pip install pyOpenSSL==0.13

另外，Ubuntu 上也不省心 14.04 上会报错

fatal error: libxml/xmlversion.h: No such file or directory

解决方法

sudo pip uninstall scrapy
sudo apt-get install libxml2-dev libxslt1-dev python-dev

如果是 VPS 上，还会遇到内存不够的问题

command 'x86_64-linux-gnu-gcc' failed with exit status 4

问题定位方法参考 dmesg | tail

解决方法参考设置交换分区

Hello world! 爬取单个页面，热热身

例如爬取 reddit quote 里的名言

# -*- coding: utf-8 -*-

import scrapy

class RedditQuoteSpider(scrapy.Spider):
    name = 'reddit-quote'
    start_urls = ['https://www.reddit.com/r/quotes/']

    def parse(self, response):
        for text in response.css('a.title::text'):
            print text.extract()

很简单，对不对。相比 BeautifulSoup + Requests 的组合，代码更简洁一些，省去了一些重复代码。再配上我手撸的 vim snippets 开发效率更是可怕。

Scrapy shell - 调试神器

Scrapy shell 是基于 iPython 的，所以提前把 iPython 装好。

之前经常遇到的问题是，调试爬虫 selector （目标定位），修改一次就要重新运行一次，每运行一次都要重新爬取一次原网页，极其浪费时间。 Scrapy shell 解决了这个问题，例如，我想调试 sunzhongwei.com 页面的爬取, 在 terminal 里执行

scrapy shell "http://www.sunzhongwei.com"

会看到这样的返回

[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x104c6bb10>
[s]   item       {}
[s]   request    <GET http://www.sunzhongwei.com>
[s]   response   <200 http://www.sunzhongwei.com>
[s]   settings   <scrapy.settings.Settings object at 0x104c6ba90>
[s]   spider     <DefaultSpider 'default' at 0x10519f950>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

显而易见 response 就是我们经常要用到的调试对象, 即爬取到的网页对象。例如，我想爬取当前网页的 title

In [11]: print response.css("title::text")[0].extract()
大象笔记

这里介绍一下 Scrapy 内置的两种爬取网页内容的 selector, 称为选择器比较好

css, 类似 jQuery 里的选择器, 我还是适合用这个
xpath

css, xpath 返回的都是 selector list, 要提取出 unicode list 就需要使用 extract() 或者 re(), 对于只需要获取一条数据时，用 extract_first() 和 re_first() 比使用 extract()[0] 方便多了，至少不需要处理异常。

小技巧，每调试一个页面都需要重新开一个 scrapy shell 么？不需要。实际上, 还可以这样启动 scrapy shell, 即不带 URL, 在 shell 内更新 URL

scrapy shell
In [4]: fetch("http://www.baidu.com")

比 Scrapy Shell 更方便的调试神器 - Chrome

获取 XPath 的方法

Right click on the node => "Copy XPath"

验证 XPath 的方法

You can use $x in the Chrome javascript console. No extensions needed.

ex: $x("//img")

Also the search box in the web inspector will accept xpath

参考 Is there a way to get the xpath in google chrome?

Chrome 真是个神器的工具!

Scrapy 代码的初始化

实际上 Hello world 那个示例，可以这样自动生成代码

scrapy startproject reddit
cd reddit
scrapy genspider quote reddit.com 
cd reddit/reddit/spiders
vim quote.py

记不住命令没问题，scrapy -h 就能看到提示。进入项目执行可以看到更多的命令。

Scrapy 对于写入数据库的操作，如何保证不阻塞网页的爬取

https://segmentfault.com/a/1190000002645467 http://doc.scrapy.org/en/latest/topics/item-pipeline.html

防止被 BAN

设置 USER_AGENTS 之后，从 Nginx 日志中可以明显看到请求的头发生了变化。

112.249.229.127 - - [24/Apr/2016:17:22:53 +0800] "GET / HTTP/1.1" 200 10725 "-" "Scrapy/1.0.3 (+http://scrapy.org)"
112.249.229.127 - - [24/Apr/2016:17:25:50 +0800] "GET / HTTP/1.1" 200 10725 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

对于请求频率不高的爬虫来说，这就足够了。

参考如何让你的scrapy爬虫不再被ban

关于作者 🌱

我是来自山东烟台的一名开发者，有感兴趣的话题，或者软件开发需求，欢迎加微信 zhongwei 聊聊，查看更多联系方式

生活	跑步清单足球鲁班探索孤独的美食家驾驶电视剧收纳奶爸健康 game 电影周末 joke
Geek	健身 Laravel Git Vim MySQL Linux UI Windows SVN 纪录片管理 Shell 游记工具手机 BackboneJS 自建博客 Mac DNS Tornado CDN Django Python AngularJS 理财前端 Nginx 爬虫 Redis Javascript Browser 浏览器推广 OAuth CSS PHP Social Networks 安全运维创业杂记 VueJS Android Image IDE Java ReactJS 数据分析 SQLite RESTful 读书笔记家电 ecshop Vagrant wordpress docker SEO GTD magento mongodb nodejs weex 冷知识 ruby iOS 微信小程序 AI CMS 快应用 backpack 广告联盟 OA 短信 UWP Win CSharp Tampermonkey graphviz 钉钉 WPS 数据字典微信公众号 Fuchsia Adobe XD SQL Server thinkphp 代码规范商业模式 Flutter 头痛的问题 serverless 视频制作国际化 golang 服务器 Kotlin 网站建设 5G 笔记本图片 spark spring 物联网 InfluxDB 图像识别 postgre rust 提示词
成长的烦恼	闲言碎语待产不睡觉写作程序员孙心然语录原则大鸿语录
地球	植物时间中文赚钱国家地理烟台一生伏首拜阳明 emoji 弟子规英文国际贸易

安装 Scrapy

Hello world! 爬取单个页面，热热身

Scrapy shell - 调试神器

比 Scrapy Shell 更方便的调试神器 - Chrome

Scrapy 代码的初始化

Scrapy 对于写入数据库的操作，如何保证不阻塞网页的爬取

防止被 BAN

关于作者 🌱

相关文章 🔍

所有分类

相关笔记

关于

应用及工具

骄傲地使用