Code Monkey home page Code Monkey logo

antispider's Introduction

antispider

记录一下碰到过的反爬虫措施和解决办法,欢迎交流!!!

第二级目录无限制


首次访问会出现js中间页跳转 估计是验证ip


页面加载时间特长


discuz论坛板块接口


需要验证referer


js跳转 changde.py


cookie加密验证天眼查 test_down_tianyancha.py


逗比验证码+%99验证失败

http://xygs.gsaic.gov.cn/gsxygs/pub!list.do


豆瓣FM及其他豆瓣网站 https 不严密的cookie参数 test_down_douban.py

js执行后url增加_dsign参数 get_dsign.py

访问显示安全检查中... 5秒后经过js跳转到正常页面

文字使用css样式代替

限制访问频率以及代理类型

  • https://m.guazi.com/bj/dazhong/
  • 访问频率要小于 0.5次/s
  • 如果使用代理的话 http协议要用http协议的代理 https要用https的代理,混用的话相当于没加代理

巧妙使用\r在不同平台的差异让爬虫开发者头疼

  • \r在linux下会被解释为回车,如果使用\r当做换行符,在网页和windows上显示都没有问题,但在linux下输出的时候测绘覆盖\r之前的字符,导致输出结果和网页上看到的少很多。。,如果不太明白\r含义的话,想必要调试很久很久很久很久吧。。。

爬虫技巧-西瓜视频MP4下载地址获取

antispider's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

antispider's Issues

文字顺序问题

请教一下:
这是我正则之后出来的JS一段。
(%E3%80%82%E4%B8%80%E4%B8%89%E4%B8%8A%E4%B8%8B%E4%B8%8D%E4%BA%86%E4%BA%8C%E4%BD%8E%E5%92%8C%E5%9C%B0%E5%A4%9A%E5%A4%A7%E5%A5%BD%E5%B0%91%E5%BE%88%E5%BE%97%E6%98%AF%E7%9A%84%E7%9D%80%E8%BF%91%E9%AB%98%EF%BC%81%EF%BC%8CNx_());=IK_((Nx_()1;11;18;23;13;17;3;0;6;8;22;9;5;19;20;15;12;7;10;4;2;21;16;14),hZ_(;))
解析出来的文字为(。一三上下不了二低和地多大好少很得是的着近高!,)所以想问一下1;11;18;23;13;17;3;0;6;8;22;9;5;19;20;15;12;7;10;4;2;21;16;14是表示文字的索引吗,是hs_kw(索引)_mainBf吗,

运行出错:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 918: illegal multibyte sequence

autohome.py运行没效果啊

span没有被替换:

&nbsp;&nbsp;&nbsp;&nbsp;自去年12月12日提车之后<span class='hs_kw0_mainmx'></span>基本<span class='hs_kw1_mainmx'></span>就没驾驶博越去<span class='hs_kw2_mainmx'></span>点<span class='hs_kw3_mainass='hs_kw4_mainmx'></span>方<span class='hs_kw5_mainmx'></span>这次<span class='hs_kw6_mainmx'></span>朋友商量之后<span class='hs_kw0_mainmx'></span>决定自驾去天津<span class='hs_kw0_mainmx'></s='hs_kw7_mainmx'></span>可以带<span class='hs_kw8_mainmx'></span><span class='hs_kw9_mainmx'></span>越越<span class='hs_kw0_mainmx'></span>去欣赏<span class='hs_kw10_mainmx'></span><span class='hainmx'></span>他乡<span class='hs_kw3_mainmx'></span>风光<span class='hs_kw5_mainmx'></span>由于<span class='hs_kw12_mainmx'></span>第<span class='hs_kw10_mainmx'></span>次去天津<span class='hs_k</span>所以道路<span class='hs_kw1_mainmx'></span><span class='hs_kw13_mainmx'></span>太熟悉<span class='hs_kw0_mainmx'></span>还<span class='hs_kw7_mainmx'></span>博越为我提供<span class='hs_kw1x'></span>精准<span class='hs_kw3_mainmx'></span>导航系统<span class='hs_kw0_mainmx'></span>跟随<span class='hs_kw8_mainmx'></span>博野<span class='hs_kw3_mainmx'></span>脚步<span class='hs_kw0_mn>踏<span class='hs_kw1_mainmx'></span>前往天津<span class='hs_kw3_mainmx'></span>征程<span class='hs_kw5_mainmx'></span><br />&nbsp;&nbsp;&nbsp;全程<span class='hs_kw15_mainmx'></span>速<span cl0_mainmx'></span>由保定北<span class='hs_kw1_mainmx'></span>京港澳<span class='hs_kw15_mainmx'></span>速北京方向<span class='hs_kw0_mainmx'></span>再转入荣乌<span class='hs_kw15_mainmx'></span>速'hs_kw5_mainmx'></span><span class='hs_kw10_mainmx'></span>路由朋友担当摄影<span class='hs_kw0_mainmx'></span>拍<span class='hs_kw3_mainmx'></span>照片都<span class='hs_kw12_mainmx'></span>路<spaw1_mainmx'></span><span class='hs_kw3_mainmx'></span>风景<span class='hs_kw5_mainmx'></span><span class='hs_kw15_mainmx'></span>速途中<span class='hs_kw10_mainmx'></span>路驾驶博越<span class='hsx'></span>给我<span class='hs_kw3_mainmx'></span>感觉非常稳重<span class='hs_kw0_mainmx'></span>方向指向精准<span class='hs_kw5_mainmx'></span><span class='hs_kw13_mainmx'></span><span class='hs_/span><span class='hs_kw13_mainmx'></span>说<span class='hs_kw0_mainmx'></span>吉利真<span class='hs_kw3_mainmx'></span><span class='hs_kw12_mainmx'></span>在用心造车<span class='hs_kw0_mainmx'><越<span class='hs_kw0_mainmx'></span>已经成为同级别车型中<span class='hs_kw3_mainmx'></span>标杆产品<span class='hs_kw5_mainmx'></span><br />&nbsp;&nbsp;<span class='hs_kw11_mainmx'></span>面<spamainmx'></span>就为<span class='hs_kw17_mainmx'></span>家奉<span class='hs_kw1_mainmx'></span>精美<span class='hs_kw17_mainmx'></span>图

a bug in antispider/autohome.py

The 270th line in antispider/autohome.py

# 获取所有变量
var_regex = "var\s+(\w+)=(.*?);\s"

should be:

# 获取所有变量
var_regex = "var\s+(\w+)\s*=\s*([\'\"].*?[\'\"]);\s*"

Since the following case exists.
var bs_=';12';

Thanks for your code. :)

Exception on autohome SUV [FIXED]

爬汽车之家的SUV车型时程序会报错,index out of range。
排查发现因为SUV是加密关键词,但是是个英文关键词所以没有URL转义。所以不能被正则抓取,导致字典长度少了3,所以在执行中索引会溢出字典导致错误。

例如

res = requests.get("http://car.autohome.com.cn/config/spec/1646.html")
res.encoding = 'gb18030'
item = get_params(res.text)
print json.dumps(item, ensure_ascii=False, indent=4)

其中反混淆得到的Js如下,SUV作为前三个字符因为没有采用%xx的形式没被抓到。

SUV%E4%B8%87%E4%B8%AD%E4%BA%AC%E4%BB%B7%E4%BC%98%E4%BD%93%E4%BE%9B%E4%BF%9D%E5%85%83%E5%85%A8%E5%87%86%E5%87%91%E5%88%97%E5%88%B6%E5%89%8D%E5%8A%9B%E5%8A%9F%E5%8A%A8%E5%8A%A9%E5%8C%97%E5%8D%8E%E5%8E%8B%E5%8F%B7%E5%90%88%E5%90%8D%E5%90%8E%E5%90%B8%E5%95%86%E5%96%B7%E5%99%A8%E5%9C%B0%E5%9E%8B%E5%A4%87%E5%A4%9A%E5%A4%A7%E5%A4%AE%E5%AD%90%E5%AE%9A%E5%AE%9E%E5%AE%B9%E5%AE%BD%E5%AF%B8%E5%AF%BC%E5%B0%BA%E5%B7%AE%E5%B9%B4%E5%BA%A6%E5%BC%8F%E5%BC%B9%E5%BE%84%E5%BE%B7%E6%82%AC%E6%88%96%E6%89%AD%E6%89%BF%E6%8C%87%E6%8E%92%E6%95%B0%E6%95%B4%E6%9C%80%E6%9C%BA%E6%9D%86%E6%9E%84%E6%9E%B6%E6%A0%87%E6%A0%BC%E6%A2%B0%E6%AC%A7%E6%AF%94%E6%B0%94%E6%B2%B9%E6%B5%8B%E6%B6%B2%E7%82%B9%E7%84%B6%E7%87%83%E7%8B%AC%E7%8E%87%E7%8E%AF%E7%94%B5%E7%9B%96%E7%9B%98%E7%9F%A9%E7%A6%BB%E7%A7%AF%E7%A7%B0%E7%A8%8B%E7%A8%B3%E7%AB%8B%E7%AE%B1%E7%B0%A7%E7%B4%A7%E7%BB%BC%E7%BC%A9%E7%BC%B8%E7%BD%AE%E8%80%97%E8%83%8E%E8%87%AA%E8%93%9D%E8%A1%8C%E8%A7%84%E8%B1%AA%E8%B4%A8%E8%B7%9D%E8%BD%A6%E8%BD%AC%E8%BD%AE%E8%BD%B4%E8%BD%BD%E8%BF%9B%E8%BF%9E%E9%80%9A%E9%80%9F%E9%85%8D%E9%87%8F%E9%93%81%E9%93%9D%E9%95%BF%E9%97%A8%E9%97%B4%E9%9A%99%E9%9B%85%E9%A3%8E%E9%A9%B1%E9%A9%BB%E9%AB%98%E9%BC%93C%

我怀疑里面的英文字母也会有问题。建议把这个问题修一修,改一下正则。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.