pistocop / subreddit-comments-dl Goto Github PK
View Code? Open in Web Editor NEWDownload subreddit comments
Home Page: https://www.pistocop.dev/posts/subreddit_downloader/
License: GNU General Public License v3.0
Download subreddit comments
Home Page: https://www.pistocop.dev/posts/subreddit_downloader/
License: GNU General Public License v3.0
Hi,
I have decided to use your application to download data for my project, this is all comments from a single small (80k subscribers) subreddit from the past 5 years.
I found the framework very easy to use, but I couldn't find a reliable way to ensure that all comments are downloaded. I'm currently running a process with 1024 batch size and 1000 laps, after 2 days 27 laps have been processed but it's impossible to know how many more I need.
Would you be able to advise on this?
Hi,
First of all, thank you for this great program!
I have used your code successful for scraping a subreddit from specific utc date ranges. However, I have encountered a problem where I can't scrape anything past the UTC: 1670743183
my input to terminal:
python src/subreddit_downloader.py --reddit-id --reddit-secret --reddit-username --debug --batch-size 500 --utc-after 1670743183
The error is below. I have no idea why this is occurring, any advice would be greatly appreciated! Thank you.
subreddit_downloader.py 308
typer.run(main)
main.py 859 run
app()
main.py 214 call
return get_command(self)(*args, **kwargs)
core.py 829 call
return self.main(*args, **kwargs)
core.py 782 main
rv = self.invoke(ctx)
core.py 1066 invoke
return ctx.invoke(self.callback, **ctx.params)
core.py 610 invoke
return callback(*args, **kwargs)
main.py 497 wrapper
return callback(**use_params) # type: ignore
contextlib.py 79 inner
return func(*args, **kwds)
subreddit_downloader.py 299 main
assert utc_lower_bound < utc_upper_bound, f"utc_lower_bound '{utc_lower_bound}' should be " \
TypeError:
'<' not supported between instances of 'NoneType' and 'str'
Thank you for the very useful code!
Not an issue per se, but a question: If I want to download only specific threads from a subreddit, do you have any suggestions for the best way to do it?
Thanks :)
Dear pistocop
Hi, this is Bogyeom Kim. First of all, thanks for your code. It help our project a lot.
While I use, I just wonder how to know utc date which is input for argument of subreddit_downloader.py.
For example, you showed that in order to download the News comments after 1 January 2021, you used --utc-after 1609459201.
It would be a great help if you let me know how to turn ordinary data into utc date form.
Thank you.
Best regards,
Bogyeom Kim
Implement a converter to pass as input (--utc-after
) a human date format (e.g. 2021/05/21) instead of UTC timestamp (e.g. 1609459200).
Data downloaded but the script throws me the following error after executing !python subreddit-comments-dl/src/dataset_builder.py. I ran this on Google Colab
dataset_builder.py 191 Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 728, in excepthook
writer.write_location(code.co_filename, traceback.tb_lineno, code.co_name)
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 491, in write_location
self.config.function_color, function
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 428, in output_text
if count == 0 or count % line_length != 0 or self.config.full_line_newline:
ZeroDivisionError: integer division or modulo by zero
Original exception was:
Traceback (most recent call last):
File "subreddit-comments-dl/src/dataset_builder.py", line 191, in
typer.run(main)
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 859, in run
app()
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "subreddit-comments-dl/src/dataset_builder.py", line 171, in main
header, rows = csv_reader(csv_path)
File "subreddit-comments-dl/src/dataset_builder.py", line 127, in csv_reader
for row_id, row in enumerate(file_reader):
_csv.Error: line contains NUL
Does this program scrape comments of a given post in the order of their occurrence without messing with the hierarchy? The praw library helps in scraping all the comments but they are not in order. Please let me know if this program can do that and the command I should use.
Thanks for putting this together! I was getting UnicodeEncodeErrors (in line 99) of the dictlist_to_csv method. Adding encoding='utf-8' to the with open(file_path, 'w', newline='') as output_file (line 96) fixed that for me!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.