Code Monkey home page Code Monkey logo

gter_bbs_spider's Introduction

##留学论坛爬虫


这个爬虫,我主要是爬去一个留学论坛,这次爬取的是北美的offer结果版面,他的页面是这样的:

pic

爬虫使用了gevent异步进行,使用mongodb做最后的数据库存储,然后将内容导出成csv文件

爬取下来的数据是:go_america_to_study_data.csv文件,希望大家好好利用好

###usage


安装mongodb:

MacOS:

brew install mongodb

Linux:

Ubuntu/debian:

sudo apt-get install mongodb

CentOS:

sudo yum install mongodb

安装 python包依赖:

pip install -r requirements.txt

然后你去config.py,修改你的开始页面的url和页码,非常简单的配置,还有配置你的mongodb的collection名称

然后执行:

python engine.py

将Mongodb内的数据导出成CSV:

python dbToCsv.py

即可

PS:我用的最低配的阿里云好像跑了一个小时,所以有兴趣的同学可以使用multiprocess进行多进程爬取,主要是我的硬件水平低,没办法

###data analysis


我之后使用R语言对csv文件进行了最简单的处理,发现申请的大学,越好城市的申请的越多,然后本科院校一般也是985,211居多,最后是申请的学位还说ms好申,并且申请专业的前三是化学,经济学,和计算机,可见难度非常大

gter_bbs_spider's People

Contributors

salamer avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.