Code Monkey home page Code Monkey logo

django-kickstarter-scraper's Introduction

django-kickstarter-scraper

Scrapes kickstarter and stores to a mySQL database using django. Currently designed to be manually started, but could be useful as a cron job with a few tweaks, like the ability to parse 'recently added projects' pages.

As of early july 2013, its functional, but poorly documented, and probably written all wrong. I had used very little django or even python prior to this project. This project was supposed to teach me more about both, and it has.

This was a fun project, but I don't have any plans to update this. I'll be watching for pull requests.

###Installation I'm writing all this from memory. If I'm missing steps, please add them. Or just make it smarter ;)

Needs a bunch of libaries to work. This may help:

sudo easy_install django south django-tastypie BeautifulSoup4 urllib2 termcolor

Like any other django app, you'll need to set up a database. The current settings file is expecting a mySQL database named kickscrape on localhost. Don't forget to sync the database.

python manage.py syncdb

There might be a way for python to create files before writing to them, but I don't know it. Run these from root of the django structure. The same folder as the manage.py.

mkdir logs
cd logs
touch full.log process.log queue.log url.log error.log

South might require some additional stuff. I don't remember. You can always disable it in the enabled apps section of settings. Or check it out here

####DB dump

Here's a DB dump from my last scrape. It has 87K projects that you won't have to download. You'll still have to seed it with search pages in the URLQueue table.

Kickscrape DB Dump

###Use This command starts the scraper

python manage.py kickthreads

This command stops the scrper

python manage.py stopscrape

The scraper runs until the URLQueue table is empty. This also means that you need to fill said table with kickstarter URLs. Search pages or project pages work best.

###Other Notes

  • Pretty much everything happens in that one kickthreads file. If I knew more about python, I'd probably break it out into modules.
  • Supports the following pages: Projects, Project Backers, Backer Profiles, Search Results
  • Doesn't support the home page or curator pages or any others you can find. Yet ;)
  • The last time I ran it was in May 2013. Kickstarter periodically changes their HTML, and it's likely some of the parsing will need to be rewritten

django-kickstarter-scraper's People

Contributors

neight-allen avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.