Code Monkey home page Code Monkey logo

simple-reddit-crawler's Introduction

(͡ ° ͜ʖ͡°) Simple Reddit Crawler

Lightweight Reddit crawler using Python and MySQL

Saving Threads:

Run python reader/reader.py /r/yoursubreddithere

Saving Comments:

Run python reader/reader.py --get-comments

How to build

  1. git clone this repository.

  2. Run the create-database.sql script in your MySQL instance

  3. Install Python pip using sudo apt-get install python-pip

  4. Install PyMySQL using sudo pip install PyMySQL

  5. Open reader/reader.py, search for userAgent = "" and enter an User-Agent there. Skipping this step will cause Reddit to block your requests.

How the Crawler works

The crawler runs in 2 steps: threads and comments.

When reading Threads:
  1. The script reads all the new threads in your subreddit of choice. Reddit limits /new to 25 threads, so only 25 threads are read at a time.

  2. Then, it inserts all the threads found in the "threads" table.

  3. By checking the ID of the thread given by Reddit (thread_id column in our "threads" table), we detect if that thread has already been read. Duplicate threads are ignored.

When reading Comments:
  1. The script loops through all the threads stored in the "threads" table and makes one json request for the comments of each thread.

  2. Inserts all the comments in the "comments" table.

  3. By checking the ID of the comment given by Reddit (comment_id column in our "comments" table), we detect if that comment has already been read. Duplicate comments are ignored.

Important:

Since Reddit limits the number of json requests to one every two seconds, the process of reading comments becomes increasingly long as more and more threads are posted. This ends up making the coments reading take so long that more than 25 threads are posted in the meantime, making us lose some threads.

To avoid this, we need to first read all the threads during a certain period of time and only after all the threads are in the database, we read their comments.

To do that, run python reader/reader.py /r/yoursubreddithere to store only the new threads. Leave this script running for as long as you need.

Then, stop it and run python reader/reader.py --get-comments to store only the comments from the threads read above. Note that this script will run repeatedly to get new comments, so stop its execution when enough comments have been captured.

You can check the result of each run in the logs table.

simple-reddit-crawler's People

Contributors

lucas-tulio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.