GrabIt

GrabIt is a tool built to archive self-posts, images, gifs and videos from subreddit and users from Reddit. This program works through the command line and uses Python 3.

Installation

Get your Reddit API credentials.

Install all the dependencies.

pip3 install -r requirements.txt

Add the Reddit API client ID and secret through the terminal as shown below, replace the string in quotes with your credentials:

python3 RedditGrabber.py --reddit_id "client_id_here" --reddit_secret "client_secret_here"

If you do not wish to enter them through the terminal you can also enter the client id and secret in the config.json file in the resources folder.

Usage and Arguments

Subreddits and users or a submission url are positional arguments and must be entered at the start. Subreddits must be entered without any prefix whereas users must be untered with a "u/" before the username. To download from a single subreddit, in this case /r/diy

python3 RedditGrabber.py diy

You can also pass in a list of subreddits and users in the form of a txt file, which contains each subreddit or user on a newline.

python3 RedditGrabber.py subs.txt

Below are all the optional arguments that you can use:

-h, --help                      show this help message and exit

-p POSTS, --posts POSTS         Number of posts to grab on each cycle
--search SEARCH                 Search for submissions in a subreddit
--sort SORT                     Sort submissions by "hot", "new", "top", or "controversial"
--time_filter TIME_FILTER       Filter sorted submission by "all", "day", "hour", "month", 
                                "week", or "year"
-w WAIT, --wait WAIT            Wait time between subreddits in seconds
-c CYCLES, --cycles CYCLES      Number of times to repeat after wait time
-o OUTPUT, --output OUTPUT      Set base directory to start download
-t OUTPUT_TEMPLATE, --output_template OUTPUT_TEMPLATE
                                Specify output template for download
--allow_nsfw                    Include nsfw posts too
-v, --verbose                   Sets verbose
--pushshift                     Only use pushshift to grab submissions
--ignore_duplicate              Ignore duplicate media submissions
--blacklist BLACKLIST           Avoid downloading a user or subreddit
--search SEARCH                 Search for submissions in a subreddit
--reddit_id REDDIT_ID           Reddit client ID
--reddit_secret REDDIT_SECRET   Reddit client secret
--imgur_cookie IMGUR_COOKIE     Imgur authautologin cookie
--db_location                   Set location of database file

Output Template

By default the program saves by subreddit then user, if you would like to change this you can specify an output template.

The default can be represented by -t '%(subreddit)s/%(author)s/%(id)s-%(title)s.%(ext)s'. If you would like to only save by author and name the file by title, you can do the following -t '%(author)s/%(title)s.%(ext)s'.

Note, if you ues this parameter you must specify a template for the filename and use %(ext)s if you wish the files to save properly. If you only wish to change the output directory you can use the --output parameter.

Below are the available tags

Tags	Description
author	The author of the submission
subreddit	The subreddit of the submission
id	ID of the submission
created_utc	Time the submission was created
title	Title of the submission
ext	File extension

Blacklist

If you wish to avoid downloading a specific user or subreddit you can blacklist them. Below is an example of how you would blacklist the user "Gallowboob" and the subreddit "r/Documentaries".

python3 RedditGrabber.py --blacklist u/GallowBoob
python3 RedditGrabber.py --blacklist r/Documentaries

Search

You can search a subreddit using keywords along with sorting and time filters. Below are examples of a simple search on r/all for "breakfast cereal".

python3 RedditGrabber.py all --search "breakfast cereal"

If you do not use the "--sort" flag then it will default to sorting by relevance, otherwise you can use "hot", "top", "new" or "comments". While using the search you can also get links by time using the "--time_filter" flag with "all", "day", "hour", "month", "week", or "year". Below is an example searching r/DataHoader for "sata fire" sorted by top submissions retrieving links only from the past year.

python3 RedditGrabber.py DataHoarder --search "sata fire" --sort top --time_filter year

Imgur Cookie

Imgur requires users to login to view NSFW content on their site, therefore if you wish to download such content that has been posted to Reddit you will need to provide the cookie used to verify an Imgur login.

Using the flag provide the authautologin cookie data. You can find this cookie in your browser's storage inspector (Chrome, Edge, Firefox, Safari).

python3 RedditGrabber.py --imgur_cookie "abcdefghi9876%jklmnop54321qrstu"

The cookies is then stored in the config.json file for future use. If you wish to update the cookie use the command above with the new value.

Connection timeout on saveAlbum

May be due to ratelimiting, didn't happen until last month.

Traceback (most recent call last):
  File "/usr/lib/python3.5/urllib/request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.5/http/client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/usr/lib/python3.5/http/client.py", line 1252, in connect
    super().connect()
  File "/usr/lib/python3.5/http/client.py", line 849, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.5/socket.py", line 711, in create_connection
    raise err
  File "/usr/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "RedditGrabber.py", line 165, in <module>
    main(subR, posts)
  File "RedditGrabber.py", line 115, in main
    grabber(subR, direct, posts)
  File "RedditGrabber.py", line 65, in grabber
    saveAlbum(albumId, str(submission.author), str(submission.subreddit), title, direct)
  File "/home/boxy/Documents/RedditImageBackup/handlers/ImgurDownloader2.py", line 53, in saveAlbu$
    urllib.request.urlretrieve(image.link, os.path.join(folder, "(" + str(counter) + ") " + str(im$ge.id) + type))
  File "/usr/lib/python3.5/urllib/request.py", line 188, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
    '_open', req)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 1297, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/usr/lib/python3.5/urllib/request.py", line 1256, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 110] Connection timed out>

lamelemon / grabit Goto Github PK

grabit's Introduction

GrabIt

Installation

Usage and Arguments

Output Template

Blacklist

Search

Imgur Cookie

grabit's People

Contributors

Stargazers

Watchers

Forkers

grabit's Issues

Connection timeout on saveAlbum

ValueError: Invalid values

praw outdated

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent