Code Monkey home page Code Monkey logo

mspider's Introduction

Mspider即将更新,敬请期待~

Mspider2.0 网页链接爬虫

爬虫功能

  1. 可控的线程数
  2. 可控的爬取深度
  3. 可控的爬取数量
  4. 可控的爬取时间
  5. 可控的域名聚焦、过滤(字符支持","(逗号)分割)
  6. 可控的关键字聚焦、过滤(字符支持","(逗号)分割)
  7. URL相似度过滤(可控开关)
  8. 3种下载模式
  9. 3种爬取策略:宽度优先、深度优先、随机优先
  10. 2种运行时的显示模式
  11. 数据存储(数据库为mongo)
  12. 内置起始URL字典
  13. 自动选择代理池(待完成)

v2.0 更新说明

本次更新主要完成了如下内容。

  1. 构建全局变量类
  2. 构建UrlRule规则类
  3. 优化爬虫流程
  4. 补全过滤标签
  5. 更新相似度检查函数
  6. gevent模型

6月26日 v2.0 技术更新

  1. 动态下载模式不下载图片(大幅提速)
  2. 动态下载模式可设置ua字段
  3. 页面提取链接正则加强

TODO

未来爬虫模块会整体迁移到Mscanner,作为其链接获取引擎。

希望大家对爬虫功能的一些问题提出宝贵意见

BUG提交、需求提交、批评意见

联系 乌云Manning

qq 408468023

参考文章

《爬虫技术浅析》—运用技术概述

《爬虫技术实战》—Mspider使用实例

效果截图

Usage: 
       MMMM   MMMM                              MM                                         
     MMMMMMMMMMMMMMM                          MM MMM       MMMMMMM                         
    MM      M      MM                         M   MM       MM   MM                         
    M               M     MMMMMM  MMMMMMMM    MMMMMM   MMMMMM   MM   MMMMMMMM     MMMMMM   
    M    MM   MM    M   MMM   MM MM      MMM  M   MM  MM    M   MM  MM      MMM  MM    M   
    M    MM   MM    M   M     MMMM         M  M   MM M      M   MM MM   MM    M MM     M   
    M    MM   MM    M  MM    MMMM   MMMM   MM M   MMMM   MMMM   MMMM   MM     MMMM   MMM   
    M    MM   MM    M MM    MM  M   MMMM   MM M   MM M   MMMM   MM M   MMMMMMMMMMM   M     
    M    MM   MM    M M     MM MM   M     MM  M   MM MM        MM  MM      MM   MM   M     
    M    MM   MM    MMM  MMMM  MM   MM   MM   M   MM  MMM    MMM    MMM    MMM  MM   M     
    MMMMMMMMMMMMMMMMM MMMM     MM   MMMMM     MMMMMM    MMMMMM        MMMMMM    MMMMMM     
                               MM   MM                                                     
                                MMMMMM                                                     
                                                                              by Manning

Options:
  Options:
  -h, --help            show this help message and exit
  -u MSPIDER_URL, --url=MSPIDER_URL
                        Start the domain name
  -t MSPIDER_THREADS_NUM, --threads=MSPIDER_THREADS_NUM
                        Number of threads
  --depth=MSPIDER_DEPTH
                        Crawling depth
  --count=MSPIDER_COUNT
                        Crawling number: The default download 100000000 pages
  --time=MSPIDER_TIME   Crawl time: The default crawl for 7 days
  --similarity=MSPIDER_SIMILARITY
                        Similarity check: True   False
  --storage=MSPIDER_STORAGE
                        Storage true save  false don't save
  --spider-model=MSPIDER_MODEL
                        Crawling mode: Static 0  Dynamic 1  Mixed 2
  --spider-policy=MSPIDER_POLICY
                        Crawling strategy: Breadth-first 0  Depth-first 1
                        Random-first 2
  --focus-keyword=MSPIDER_FOCUS_KEYWORD
                        Focus keyword in URL's path
  --filter-keyword=MSPIDER_FILTER_KEYWORD
                        Filter keyword in URL's path
  --filter-domain=MSPIDER_FILTER_DOMAIN
                        Filter domain
  --focus-domain=MSPIDER_FOCUS_DOMAIN
                        Focus domain
  --random-agent=MSPIDER_AGENT
                        like sqlmap --random-agent default is false: no random
  --print-all=MSPIDER_PRINT_ALL
                        mspider_print_all

mspider's People

Contributors

manning23 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.