pistocop / subreddit-comments-dl Goto Github PK

View Code? Open in Web Editor NEW

88.0 2.0 26.0 34 KB

Download subreddit comments

Home Page: https://www.pistocop.dev/posts/subreddit_downloader/

License: GNU General Public License v3.0

Python 100.00%

reddit subreddit praw scraper pushshift data

subreddit-comments-dl's People

Contributors

Stargazers

Watchers

subreddit-comments-dl's Issues

Pragmatic to download all comments from subreddit after date?

Hi,

I have decided to use your application to download data for my project, this is all comments from a single small (80k subscribers) subreddit from the past 5 years.

I found the framework very easy to use, but I couldn't find a reliable way to ensure that all comments are downloaded. I'm currently running a process with 1024 batch size and 1000 laps, after 2 days 27 laps have been processed but it's impossible to know how many more I need.

Would you be able to advise on this?

can't scrape past a certain date

Hi,
First of all, thank you for this great program!
I have used your code successful for scraping a subreddit from specific utc date ranges. However, I have encountered a problem where I can't scrape anything past the UTC: 1670743183

my input to terminal:
python src/subreddit_downloader.py --reddit-id --reddit-secret --reddit-username --debug --batch-size 500 --utc-after 1670743183

The error is below. I have no idea why this is occurring, any advice would be greatly appreciated! Thank you.

subreddit_downloader.py 308
typer.run(main)

main.py 859 run
app()

main.py 214 call
return get_command(self)(*args, **kwargs)

core.py 829 call
return self.main(*args, **kwargs)

core.py 782 main
rv = self.invoke(ctx)

core.py 1066 invoke
return ctx.invoke(self.callback, **ctx.params)

core.py 610 invoke
return callback(*args, **kwargs)

main.py 497 wrapper
return callback(**use_params) # type: ignore

contextlib.py 79 inner
return func(*args, **kwds)

subreddit_downloader.py 299 main
assert utc_lower_bound < utc_upper_bound, f"utc_lower_bound '{utc_lower_bound}' should be " \

TypeError:
'<' not supported between instances of 'NoneType' and 'str'

How to download only particular threads from subreddit?

Thank you for the very useful code!

Not an issue per se, but a question: If I want to download only specific threads from a subreddit, do you have any suggestions for the best way to do it?

Thanks :)

how to know utc_date

Dear pistocop

Hi, this is Bogyeom Kim. First of all, thanks for your code. It help our project a lot.
While I use, I just wonder how to know utc date which is input for argument of subreddit_downloader.py.
For example, you showed that in order to download the News comments after 1 January 2021, you used --utc-after 1609459201.
It would be a great help if you let me know how to turn ordinary data into utc date form.

Thank you.

Best regards,
Bogyeom Kim

[feature] Use human date format instead UTC timestamp

Implement a converter to pass as input (--utc-after) a human date format (e.g. 2021/05/21) instead of UTC timestamp (e.g. 1609459200).

ZeroDivisionError: integer division or modulo by zero

Data downloaded but the script throws me the following error after executing !python subreddit-comments-dl/src/dataset_builder.py. I ran this on Google Colab

dataset_builder.py 191 Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 728, in excepthook
writer.write_location(code.co_filename, traceback.tb_lineno, code.co_name)
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 491, in write_location
self.config.function_color, function
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 428, in output_text
if count == 0 or count % line_length != 0 or self.config.full_line_newline:
ZeroDivisionError: integer division or modulo by zero

Original exception was:
Traceback (most recent call last):
File "subreddit-comments-dl/src/dataset_builder.py", line 191, in
typer.run(main)
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 859, in run
app()
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "subreddit-comments-dl/src/dataset_builder.py", line 171, in main
header, rows = csv_reader(csv_path)
File "subreddit-comments-dl/src/dataset_builder.py", line 127, in csv_reader
for row_id, row in enumerate(file_reader):
_csv.Error: line contains NUL

scraping all the comments in order

Does this program scrape comments of a given post in the order of their occurrence without messing with the hierarchy? The praw library helps in scraping all the comments but they are not in order. Please let me know if this program can do that and the command I should use.

UnicodeEncodeError

Thanks for putting this together! I was getting UnicodeEncodeErrors (in line 99) of the dictlist_to_csv method. Adding encoding='utf-8' to the with open(file_path, 'w', newline='') as output_file (line 96) fixed that for me!

pistocop / subreddit-comments-dl Goto Github PK

subreddit-comments-dl's People

Contributors

Stargazers

Watchers

Forkers

subreddit-comments-dl's Issues

Pragmatic to download all comments from subreddit after date?

can't scrape past a certain date

How to download only particular threads from subreddit?

how to know utc_date

[feature] Use human date format instead UTC timestamp

ZeroDivisionError: integer division or modulo by zero

scraping all the comments in order

UnicodeEncodeError

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent