Code Monkey home page Code Monkey logo

subreddit-comments-dl's People

Contributors

pistocop avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

subreddit-comments-dl's Issues

Pragmatic to download all comments from subreddit after date?

Hi,

I have decided to use your application to download data for my project, this is all comments from a single small (80k subscribers) subreddit from the past 5 years.

I found the framework very easy to use, but I couldn't find a reliable way to ensure that all comments are downloaded. I'm currently running a process with 1024 batch size and 1000 laps, after 2 days 27 laps have been processed but it's impossible to know how many more I need.

Would you be able to advise on this?

can't scrape past a certain date

Hi,
First of all, thank you for this great program!
I have used your code successful for scraping a subreddit from specific utc date ranges. However, I have encountered a problem where I can't scrape anything past the UTC: 1670743183

my input to terminal:
python src/subreddit_downloader.py --reddit-id --reddit-secret --reddit-username --debug --batch-size 500 --utc-after 1670743183

The error is below. I have no idea why this is occurring, any advice would be greatly appreciated! Thank you.

subreddit_downloader.py 308
typer.run(main)

main.py 859 run
app()

main.py 214 call
return get_command(self)(*args, **kwargs)

core.py 829 call
return self.main(*args, **kwargs)

core.py 782 main
rv = self.invoke(ctx)

core.py 1066 invoke
return ctx.invoke(self.callback, **ctx.params)

core.py 610 invoke
return callback(*args, **kwargs)

main.py 497 wrapper
return callback(**use_params) # type: ignore

contextlib.py 79 inner
return func(*args, **kwds)

subreddit_downloader.py 299 main
assert utc_lower_bound < utc_upper_bound, f"utc_lower_bound '{utc_lower_bound}' should be " \

TypeError:
'<' not supported between instances of 'NoneType' and 'str'

how to know utc_date

Dear pistocop

Hi, this is Bogyeom Kim. First of all, thanks for your code. It help our project a lot.
While I use, I just wonder how to know utc date which is input for argument of subreddit_downloader.py.
For example, you showed that in order to download the News comments after 1 January 2021, you used --utc-after 1609459201.
It would be a great help if you let me know how to turn ordinary data into utc date form.

Thank you.

Best regards,
Bogyeom Kim

ZeroDivisionError: integer division or modulo by zero

Data downloaded but the script throws me the following error after executing !python subreddit-comments-dl/src/dataset_builder.py. I ran this on Google Colab

dataset_builder.py 191 Error in sys.excepthook:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 728, in excepthook
writer.write_location(code.co_filename, traceback.tb_lineno, code.co_name)
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 491, in write_location
self.config.function_color, function
File "/usr/local/lib/python3.7/dist-packages/pretty_errors/init.py", line 428, in output_text
if count == 0 or count % line_length != 0 or self.config.full_line_newline:
ZeroDivisionError: integer division or modulo by zero

Original exception was:
Traceback (most recent call last):
File "subreddit-comments-dl/src/dataset_builder.py", line 191, in
typer.run(main)
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 859, in run
app()
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 214, in call
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/typer/main.py", line 497, in wrapper
return callback(**use_params) # type: ignore
File "subreddit-comments-dl/src/dataset_builder.py", line 171, in main
header, rows = csv_reader(csv_path)
File "subreddit-comments-dl/src/dataset_builder.py", line 127, in csv_reader
for row_id, row in enumerate(file_reader):
_csv.Error: line contains NUL

scraping all the comments in order

Does this program scrape comments of a given post in the order of their occurrence without messing with the hierarchy? The praw library helps in scraping all the comments but they are not in order. Please let me know if this program can do that and the command I should use.

UnicodeEncodeError

Thanks for putting this together! I was getting UnicodeEncodeErrors (in line 99) of the dictlist_to_csv method. Adding encoding='utf-8' to the with open(file_path, 'w', newline='') as output_file (line 96) fixed that for me!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.