Code Monkey home page Code Monkey logo

pushshiftdumps's Introduction

This repo contains example python scripts for processing the reddit dump files created by pushshift. The files can be downloaded from here or torrented from here.

  • single_file.py decompresses and iterates over a single zst compressed file
  • iterate_folder.py does the same, but for all files in a folder
  • combine_folder_multiprocess.py uses separate processes to iterate over multiple files in parallel, writing lines that match the criteria passed in to text files, then combining them into a final zst compressed file

pushshiftdumps's People

Contributors

pde avatar watchful1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pushshiftdumps's Issues

Error when processing the "url" field of RS_2022-08 and RS_2022-09

I am able to use the combine_folder_multiprocess script on the "selftext" field, but trying to match on the "url" field of RS_2022-08 and RS_2022-09 results in the following error:

WARNING: File failed /submissions/submissions/RS_2022-08.zst: 'NoneType' object has no attribute 'lower'

Use of File in Academic Manuscript (DOI Request)

Hello Watchful1,

I am a student at the University of Pittsburgh and am conducting a study that would benefit from using your script to extract the data from the Pushshift archive files.

Could you generate a DOI for your work? This would allow me to cite your work in my paper correctly and give you recognition for your very well-designed script.

I have included a link to the instructions for creating a DOI below:
https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content#issuing-a-persistent-identifier-for-your-repository-with-zenodo

Thank you,
TYB25

How to filter combinations of keywords instead of just single keywords? (about combine_folder_multiprocess.py)

Thank you for sharing your code. It's been incredibly helpful in extracting the specific Reddit data I need!

However, I've encountered an issue. I can successfully use the command:
python3 combine_folder_multiprocess.py reddit --field title --value cold,fever --output pushshift
to fetch submissions with titles containing either "cold" or "fever".

But, when I try to search for specific keyword combinations like "common cold" or "fever symptoms" using:
python3 combine_folder_multiprocess.py reddit --field title --value common cold,fever symptoms --output pushshift
I encounter the following error:

combine_folder_multiprocess.py: error: unrecognized arguments: cold,fever symptoms

Could you advise on how to filter for phrases (some words with space) instead of single words?

Looking forward to your guidance, and thanks in advance!

Encoding error with script combine_folder_multiprocess.py

Hi there,

I am using combine_folder_multiprocess.py to extract submission data from RS_2018-01.zst to RS_2021-12.zst. The script appears to work correctly until RS_2021-07, at which point I receive the error:

INFO: File D:\.Datasets\pushshift\reddit\submissions\RS_2021-07.zst errored 'charmap' codec can't encode characters in position 1893-1897: character maps to <undefined>

If I try to rerun the script, it reports that it processed a majority of the files:

INFO: Processed 42 of 54 files with 234.46 of 330.14 gigabytes

But then it will error out again once it tries to pick up from where it left out, but this time indicating the error is with a different .zst file:

 WARNING: File failed D:\.Datasets\pushshift\reddit\submissions\RS_2021-11.zst: 'charmap' codec can't encode character '\U0001f4a9' in position 2001: character maps to <undefined>

I noticed a few months ago there was a similar issue reported in the Pushshift subreddit. Any help assistance you can provide here is greatly appreciated.

Thanks!

Very confused on how to use the combine multiple file script.

In the script you have given 2 illustrations which are

  1. python3 combine_folder_multiprocess.py reddit/comments --value wallstreetbets
  2. python3 combine_folder_multiprocess.py reddit --field author --value Watchful1,spez --output pushshift

However this is not working for
My script is - python combine_folder_multiprocess.py subreddit --value AOC --output pushshift

My data is inside the folder 'subreddits' I am getting
2023-06-21 20:01:08,025 - INFO: Loading files from: subreddits
2023-06-21 20:01:08,026 - INFO: Writing output to working folder
2023-06-21 20:01:08,028 - INFO: Checking field subreddit for value aoc_comments
2023-06-21 20:01:08,187 - WARNING: Args don't match args from json file. Delete working folder

Could you please provide a better illustration on how to use the scripts. In the script you also say that this script assumes that the files are having prefix RS_ and RC_ which is not the case as you can see the files that I downloaded. Please a response would be appreciated thank you.

scripts
files
errors

using the --field argument properly

I'm confused on how to use the '--field' argument in the cmd line when I run the [combine_folder_multiprocess.py] file. I'm wondering if we have control over the fields that are saved to csv. For example, could I have it write the author, ID, date, subreddit, votes, awards, etc. to the output .csv file?

Understanding the memory requirements of combine_folder_multiprocess.py

While attempting to run through some of the Pushshift dumps using combine_folder_multiprocess.py, the script consumes all of my memory and is halted by the OOM killer. This occurs even if I set the --processes value to 2.

Is it possible to estimate the potential memory consumption required to run through any particular file in order to calculate the most efficient number of processes that should be used?

No data is extracted?

Hello I am trying to extract data from a _comments.zst and _submissions.zst file, but I am getting following your script (single_file.py) is lines in the cmd logs and nothing else is outputed:

D:\historicArchives\PushshiftDumps\scripts>py single_file.py ..._comments.zst
2022-10-05 07:11:33 : 100,000 : 0 : 22,807,050:53%
2022-11-15 22:50:16 : 200,000 : 0 : 34,865,950:81%
2022-12-21 22:43:45 : 300,000 : 0 : 42,810,910:100%
Complete : 330,112 : 0

D:\historicArchives\PushshiftDumps\scripts>py single_file.py ..._submissions.zst
Complete : 41,052 : 0

How and what can I do to get the data? I am confused. Thanks.

Also I would like to inform you that the 2 links you showed in the descriptioin of the page, are no longer available:
....The files can be downloaded from [here](https://files.pushshift.io/reddit/) or torrented from [here](https://academictorrents.com/details/f37bb9c0abe350f0f1cbd4577d0fe413ed07724e).

multiprocessing.pool vs ProcessPoolExecutor

I've been having an issue using the multiprocessing to filter down the entire 2005-2022 dataset, and i wont be able to limit it to just one subreddit. and im currently working through an issue where combine_folder_multiprocess will hang. i ran into that a few times with smaller chunks of the reddit data, but i was able to just kill and restart it and it was able to pick up where it left off. but not with the 2tb dataset. and the processing of debugging this is made harder by multiprocessing.pool having a tendency to silently fail (especially if the OOM killer kicks in), where ProcessPoolExecutor will give a BrokenProcessPool exception. the two have effectively the same features, but ProcessPoolExecutor is probably what's going to get the most updates going forward.
https://stackoverflow.com/questions/65115092/occasional-deadlock-in-multiprocessing-pool
https://bugs.python.org/issue22393#msg315684
https://stackoverflow.com/questions/24896193/whats-the-difference-between-pythons-multiprocessing-and-concurrent-futures

Other than that suggestion (and I'll send a PR if i end up porting it over and it works well), I'll update this on what works. but, how much RAM does the system where you process the entire dataset have? right now the machine I'm using has 32gb, and i gave it 20 workers because i have 24 cores and wanted to use my computer at the same time it was running. i could easily give the machine more, it's a WSL vm currently assigned half my system memory. Would you expect 10 vs 20 workers, 32 vs 64gb of ram, etc, to have major effects on whether the script completes?

unable to understand error msg when run combine_folder_multiprocess.py script

hi, this is the error message I get when I try running the script combine_folder_multiprocess.py. both the script and the data files (e.g. RC_2020_01.zst) is also in same folder. Is there something that I have not configured correctly, it doesn't seem to be loading the compressed files? thanks

python combine_folder_multiprocess.py reddit/comments --value wallstreetbets
2023-03-06 13:03:40,653 - INFO: Loading files from: reddit/comments
2023-03-06 13:03:40,653 - INFO: Writing output to working folder
2023-03-06 13:03:40,653 - INFO: Checking field subreddit for value wallstreetbets
2023-03-06 13:03:40,805 - INFO: Existing input file was read, if this is not correct you should delete the pushshift_working folder and run this script again
2023-03-06 13:03:40,805 - INFO: Processed 0 of 0 files with 0.00 of 0.00 gigabytes
Traceback (most recent call last):
File "D:\py\combine_folder_multiprocess.py", line 390, in
log.info(f"{total_lines_processed:,}, {total_lines_errored} errored : {(total_bytes_processed / (2**30)):.2f} gb, {(total_bytes_processed / total_bytes) * 100:.0f}% : {files_processed}/{len(input_files)}")
ZeroDivisionError: division by zero

How to export content to csv?

Hey @Watchful1 , I ran the script to iterate over the contents of the zst dumps but the output shows the number of lines it has iterated, how do I export the contents to a csv file so that I can start using it for analysis and model building?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.