博客反爬虫策略一——根据User Agent

Posted by: Phodal Huang April 9, 2015, 10:27 p.m.

看了看Webmaster Tools的时候，发现我博客的Indexed一直在掉，搜索结果也开始被一些爬虫网站干掉时。。

Anti Bad Bots

是时候反击了。当我意识到我的博客正在受到侵害的时候，让我们对爬虫宣战吧。

分析

Visitors	%	Name
58222	50.71%	Unknown
21231	18.49%	Safari
18422	16.04%	Chrome
7049	6.14%	Crawlers

至少有6%是Crawlers，没有办法确认50%的未知是什么情况，这些爬虫有:

Googlebot
bingbot
Googlebot-Mobile
Baiduspider
YisouSpider

当然还有一些坏的爬虫:

AhrefsBot
Python-urllib/2.6
MJ12bot/v1.4.5
Slurp

先让我们干掉这些Bug。。。

用User Agent禁用爬虫

于是在网上找到了一些nginx的配置

# Forbid crawlers such as Scrapy
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
    return 403;
}

# Disable specify UA and empty UA access
if ($http_user_agent ~ "FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
    return 403;
}

# Forbid crawlers except GET|HEAD|POST method
if ($request_method !~ ^(GET|HEAD|POST)$) {
    return 403;
}

顺手也把curl也禁用掉了，

curl -I -s www.phodal.com
HTTP/1.1 403 Forbidden
Server: mokcy/0.17.9
Date: Thu, 09 Apr 2015 14:22:42 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 169
Connection: keep-alive

当User Agent是上面的爬虫时，返回403。如果是Google的Bot的话:

curl -I -s -A 'Googlebot' www.phodal.com
HTTP/1.1 200 OK
Server: mokcy/0.17.9
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Vary: Accept-Encoding
Vary: Accept-Language, Cookie
Content-Language: en
X-UA-Compatible: IE=Edge,chrome=1
Date: Thu, 09 Apr 2015 14:39:11 GMT
X-Page-Speed: Powered By Phodal
Cache-Control: max-age=0, no-cache

符合预期

其他

当然，这只是开始，我们还有其他工作要做~~、

附

下面是一个Apache的配置

<Directory "/var/www">
    # Anti crawlers
    SetEnvIfNoCase User-Agent ".*(^$|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms)" BADBOT
    Deny from env=BADBOT
    Order Allow,Deny
    Allow from all
</Directory>

或许您还需要下面的文章:

关于我

Github: @phodal 微博:@phodal 知乎:@phodal

微信公众号(Phodal)

围观我的Github Idea墙, 也许，你会遇到心仪的项目

QQ技术交流群: 321689806

Feeds

RSS / Atom

关于作者

Phodal Huang

Engineer, Consultant, Writer, Designer

ThoughtWorks 技术专家

工程师 / 咨询师 / 作家 / 设计学徒

开源深度爱好者

出版有《前端架构：从入门到微前端》、《自己动手设计物联网》、《全栈应用开发：精益实践》

联系我: h@phodal.com

微信公众号: 最新技术分享

Github: @phodal
微博:@phodal
知乎:@phodal
SegmentFault:@phodal

Blog

Blog