golang colly 爬取网站所有页面的标题

由于接手了一个年久失修的网站，其网页的标题，keyword，description 异常混乱，很多页面的标题都是相同的，或者非常长，填充了大量冗余的关键词。

网站的板块，独立页非常多，完全靠检查代码，或者检查页面，很难找全所有有问题的页面。

为了快速找出有问题的页面，我决定用 golang colly 写个爬虫，自动检测全站的标题等 SEO 设置。

hello world, colly

以豆瓣网站为例：

package main

import (
	"fmt"
	"github.com/gocolly/colly"
	"github.com/gocolly/colly/extensions"
)

func main() {
	c := colly.NewCollector()
	extensions.RandomUserAgent(c)
	extensions.Referer(c)

	c.OnHTML("title", func(e *colly.HTMLElement) {
		fmt.Println(e.Text)
	})

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	//c.Visit("http://go-colly.org/")
	c.Visit("https://www.douban.com")
}

运行结果：

> go run main.go
Visiting https://www.douban.com
豆瓣
Visiting https://book.douban.com
豆瓣读书
Visiting https://accounts.douban.com/passport/login?source=book
登录豆瓣
...

但是，会发现结果中还有新浪微博的链接，需要限制访问的域名。

限制请求的域名

参考：http://go-colly.org/docs/examples/basic/

c := colly.NewCollector(
    colly.AllowedDomains("www.douban.com"),
)

如果需要精确到 URL 级别，参考 colly 的 url filter: http://go-colly.org/docs/examples/url_filter/

限速

因为是检测自己的网站，怕把自己的网站服务器爬挂掉。。。所以一定要限速。

c.Limit(&colly.LimitRule{
	DomainGlob:  "www.douban.com",
	Parallelism: 1,
	Delay:       2 * time.Second,
})

LimitRule 的详细参数参考：

type LimitRule struct {
    // DomainRegexp is a regular expression to match against domains
    DomainRegexp string
    // DomainGlob is a glob pattern to match against domains
    DomainGlob string
    // Delay is the duration to wait before creating a new request to the matching domains
    Delay time.Duration
    // RandomDelay is the extra randomized duration to wait added to Delay before creating a new request
    RandomDelay time.Duration
    // Parallelism is the number of the maximum allowed concurrent requests of the matching domains
    Parallelism int
    // contains filtered or unexported fields
}

自定义请求头

设置 PC 还是移动端请求头。

全部代码

放到 github 上了：

https://github.com/sunzhongwei/go_seo_checker

参考

colly 的官方文档：https://godoc.org/github.com/gocolly/colly
colly 的示例：http://go-colly.org/docs/examples/basic/

微信关注我哦 👍

大象工具微信公众号

我是来自山东烟台的一名开发者，有感兴趣的话题，或者软件开发需求，欢迎加微信 zhongwei 聊聊，查看更多联系方式

tags: golang colly

相关文章 🔍

所有分类

生活	跑步清单足球鲁班探索孤独的美食家驾驶电视剧收纳奶爸健康 game 电影周末 joke
Geek	健身 Laravel Git Vim MySQL Linux UI Windows SVN 纪录片管理 Shell 游记工具手机 BackboneJS 自建博客 Mac DNS Tornado CDN Django Python AngularJS 理财前端 Nginx 爬虫 Redis Javascript Browser 浏览器推广 OAuth CSS PHP Social Networks 安全运维创业杂记 VueJS Android Image IDE Java ReactJS 数据分析 SQLite RESTful 读书笔记家电 ecshop Vagrant wordpress docker SEO GTD magento mongodb nodejs weex 冷知识 ruby iOS 微信小程序 AI CMS 快应用 backpack 广告联盟 OA 短信 UWP Win CSharp Tampermonkey graphviz 钉钉 WPS 数据字典微信公众号 Fuchsia Adobe XD SQL Server thinkphp 代码规范商业模式 Flutter 头痛的问题 serverless 视频制作国际化 golang 服务器 Kotlin 网站建设 5G 笔记本图片 spark spring 物联网 InfluxDB 图像识别 postgre rust 提示词
成长的烦恼	闲言碎语待产不睡觉写作程序员孙心然语录原则大鸿语录
地球	植物时间中文赚钱国家地理烟台一生伏首拜阳明 emoji 弟子规英文国际贸易