Code Monkey home page Code Monkey logo

lego's Introduction

= About Lego =
    Lego is an advance web crawler library written in python. It
provides a number of methods to mine data from kinds of sites. You 
could use YAML to create crawler templates don't need to know write 
python code for a web crawl job.
This project is based on SuperMario (http://github.com/bububa/SuperMario/).

== License ==
BSD License
See 'LICENSE' for details.

== Requirements ==
Platform: *nix like system (Unix, Linux, Mac OS X, etc.)
Python: 2.5+
Storage: mongodb
Some other python models:
    - bububa.SuperMario

== Features ==
  + YAML style templates;
  + keywords IDF calculation
  + keywords COEF calculation
  + Sitemanager
  + Smart Crawling
  + 

== DEMO ==
Sample YAML file to crawl snipplr.com

#snipplr.yaml
run: !SiteCrawler
    check: !Crawlable
        yaml_file: sites.yaml
        label: snipplr
        logger:
            filename: snipplr.log
    crawler: !YAMLStorage
        yaml_file: sites.yaml
        label: snipplr
        method: update_config
        data: !DetailCrawler
            sleep: 3
            proxies: !File
                method: readlines
                filename: /proxies
            pages: !PaginateCrawler
                proxies: !File
                    method: readlines
                    filename: /proxies
                url_pattern: http://snipplr.com/all/page/{NUMBER}
                start_no: 0
                end_no: 0
                multithread: True
                wrapper: '<ol class="snippets marg">([^^].*?)</ol>'
                logger:
                    filename: snipplr.log
            url_pattern: '/view/\d+/[^^]*?/$'
            wrapper:
                title: '<h1>([^^]*?)</h1>'
                language: '<p class="nomarg"><span class="rgt">Published in: ([^^]*?)</span>'
                author: '<h2>Posted By</h2>\s+<p><a[^^]*?>([^^]*?)</a>'
                code: '<a rel="nofollow" href="/view.php\?codeview&amp;id=(\d+)">'
                comment: '<div class="description">([^^]*?)</div>'
                tag: '<h2>Tagged</h2>\s+<p>([^^]*?)</p>'
            essential_fields:
                - title
                - language
                - code
            multithread: True
            remove_external_duplicate: True
            logger:
                filename: snipplr.log
            page_callback: !Document
                label: snipplr
                method: write
                page: None
                logger:
                    filename: snipplr.log
            furthure: 
                tag: 
                    parser: !Steps
                        inputs: None
                        steps: 
                            - !Dict
                                dictionary: None
                                method: member
                                args: 
                                    - tag
                            - !Regx
                                string: None
                                pattern: <a[^^]*?>([^^]*?)</a>
                                multiple: True
                code: 
                    parser: !Steps
                        inputs: None
                        steps: 
                            - !Dict
                                dictionary: None
                                method: member
                                args: 
                                    - code
                            - !String
                                args: None
                                base_str: 'http://snipplr.com/view.php?codeview&id=%s'
                            - !Init 
                                inputs: None 
                                obj: !URLCrawler
                                    urls: None
                                params: 
                                    save_output: True
                                    wrapper: '<textarea[^^]*?class="copysource">([^^]*?)</textarea>'
                            - !Array
                                arr: None
                                method: member
                                args: 
                                    - 0
                            - !Dict
                                dictionary: None
                                method: member
                                args: 
                                    - wrapper

-------------------------------------------------------------------
#sites.yaml
snipplr: 
    duration: 1800
    end_no: 1
    last_updated_at: 1263883143.0
    start_no: 1
    step: 1

--------------------------------------------------------------------

Run the crawler in shell

python lego.py -c snipplr.yaml

The results are stored in mongodb. You could use inserter Module to insert the data into mysql database.



  

lego's People

Contributors

bububa avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

amoebafactor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.