Code Monkey home page Code Monkey logo

scrapekit's Introduction

scrapekit

Did you know the entire web was made of data? You probably did. Scrapekit helps you get that data with simple Python scripts. Based on requests, the library will handles caching, threading and logging.

See the full documentation.

Example

from scrapekit import Scraper

scraper = Scraper('example')

@scraper.task
def get_index():
  url = 'http://databin.pudo.org/t/b2d9cf'
  doc = scraper.get(url).html()
  for row in doc.findall('.//tr'):
    yield row

@scraper.task
def get_row(row):
  columns = row.findall('./td')
  print columns

pipeline = get_index | get_row
if __name__ == '__main__':
  pipeline.run()
  

Works well with

Scrapekit doesn't aim to provide all functionality necessary for scraping. Specifically, it doesn't address HTML parsing, data storage and data validation. For these needs, check the following libraries:

  • lxml for HTML/XML parsing; much faster and more flexible than BeautifulSoup.
  • dataset is a sister library of scrapekit that simplifies storing semi-structured data in SQL databases.

Existing tools

  • Scrapy is a much more mature and comprehensive framework for developing scrapers. On the other hand, it requires you to develop scrapers within its class system. This can be too heavyweight for a simple script to grab data off a web site.
  • scrapelib is a thin wrapper around requests that does throttling, retries and caching.
  • MechanicalSoup binds BeautifulSoup and requests into an imperative, stateful API.

Credits and license

Scrapekit is licensed under the terms of the MIT license, which is also included in LICENSE. It was developed through projects of ICFJ, ANCIR and ICIJ.

scrapekit's People

Contributors

pudo avatar iromli avatar rizziepit avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.