Code Monkey home page Code Monkey logo

ajax-crawler's Introduction

ajax-crawler

ajax-crawler
============
写个爬虫,多线程爬取目标网站,提取出所有 AJAX 请求地址与参数,如:

目标网站:qq.com、baidu.com、weibo.com 等(任选一个目标)。

技术选型:Python 2.7+/PhantomJS/MongoDB

基本需求:
1. 优美的设计与实现;
2. 你可以用 Python 内置模块与第三方优秀模块来加速你这个任务;
3. 把你的实现思路清晰记录在该项目的目录下:README.md;
4. 把程序的运行过程与结果相关截图保存在该项目的目录下:/screenshot/;
5. 整个代码过程请在 GitHub 上进行。

特别需求:
1. 需要并发机制提高爬虫效率:线程池,或者协程;
2. 针对该站尽可能多的子域名,保持 Cookies 会话的前提下,获取如 qq.com 整个域下尽可能多的 AJAX 请求地址与参数;
3. 把这些地址与参数及相关字段保存到 MongoDB 里;
4. 命令行 -h 可以查看程序运行帮助,-n 可以指定并发线程数(默认10个),-l 可以限制爬多少 AJAX 链接就结束(默认不限制)。

提交成果给我们时,请附上该题的GitHub地址,并附上你本地的研发环境与习惯。

----------------------------------
思考方案:
(舍弃)1.通过js分析来计算出ajax的地址和参数,分析了百度的js后发现需要还原完整请求地址难度较大,可以靠修改一个js解释器实现,但是费时费力
2.类似于XSS,通过phantomjs触发所有的ajax事件(onclick、onkeyboard等)然后得到url和参数(tips:jquery全局控制ajax请求$.ajaxPrefilter(function(){console.log(arguments)}) 
------------------------------------
结构与工具:
python   (-程序主体,可能还会采用scrapy)+selenium phantomjs (ajax数据析取部分)+mongoDB (数据存储)
--------------------------------------

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.