Code Monkey home page Code Monkey logo

pyspider's Introduction

pyspider Build Status Coverage Status Try It Now!

A Powerful Spider(Web Crawler) System in Python. TRY IT NOW!

  • Write script in python with powerful API
  • Python 2&3
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • Javascript pages supported!
  • MySQL, MongoDB, SQLite as database backend
  • Task priority, retry, periodical, recrawl by age and more
  • Distributed architecture

Sample Code

from libs.base_handler import *

class Handler(BaseHandler):
    '''
    this is a sample handler
    '''
    @every(minutes=24*60, seconds=0)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10*24*60*60)
    def index_page(self, response):
        for each in response.doc('a[href^="http://"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
                "url": response.url,
                "title": response.doc('title').text(),
                }

Demo

Installation

  • python 2.6, 2.7, 3.3, 3.4
  • pip install --allow-all-external -r requirements.txt
  • ./run.py , visit http://localhost:5000/

if you are using ubuntu, try:

apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml

ro install binary packages first.

Documents

TODO

v0.3.0 (current)

  • as a package
  • run.py parameters
  • sortable projects list #12
  • Postgresql Supported via SQLAlchemy (with the power of SQLAlchemy, pyspider also support Oracle, SQL Server, etc)
  • benchmarking
  • python3 support
  • documents
  • pypi release version

v0.4.0

  • a visual scraping interface like portia

more

  • local mode, loading scripts from file.
  • edit script with local vim via WebDAV
  • in-browser debugger like Werkzeug

???

  • works as a framework (all components running in one process, no threads)
  • shell mode like scrapy shell

Contribute

License

Licensed under the Apache License, Version 2.0

pyspider's People

Contributors

binux avatar dodysw avatar eiriksm avatar buptsb avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.