Code Monkey home page Code Monkey logo

douban_group's Introduction

我就是想看看豆瓣上有多少个群组

随笔

  • 用代理爬
  • 代理在各大免费网站上找
  • 数据存到mongodb,数据格式:{'name': '', 'gid': '', 'members': 1000, 'created_at': '2017-01-25', 'owner_name': 'xxxx', 'owner_id': 'xxxx'}
  • 群组页面规则: https://www.douban.com/group/(gid)
  • 群组会员页面规则:https://www.douban.com/group/(gid)/members
  • url队列用redis管理,三个key, urls, urls_success, urls_failed, 每个url重试次数为3次
  • 服务启动前的10分钟做代理池预热
  • 有三个主进程:一个爬代理,检测代理是否有效,一个去爬豆瓣
  • 主进程下开多个线程处理事务
  • 跑完后生成统计报告:用时、总数、失败总数

代码模块流程

  • runserver开启三个主进程:代理池的两个进程干活;爬豆瓣进程hold住,十分钟内不要动,用redis标识
  • 爬代理进程
    • 开5个线程,每个线程爬一个网站,优先爬高匿代理。如果被封就拿代理爬,最多试三个代理。
    • 把代理存到redis中, 先检测,直接存结果
  • 检测代理进程
    • 把代理全都拿出来,开10个线程去检测,如果有效就放回,无效就丢掉
  • 爬豆瓣进程
    • 开10个线程去跑,优先通过代理访问,最多重试3次
    • 如果页面有匹配的就存到urls, 爬成功就存到urls_success, 失败就存到urls_failed, 利用urls_success和urls_failed排重
    • 解析成功就存到mongodb

douban_group's People

Watchers

卢涛 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.