watchful1 / pushshiftdumps Goto Github PK

Example scripts for the pushshift dump files

License: MIT License

Python 100.00%

pushshiftdumps's Introduction

This repo contains example python scripts for processing the reddit dump files created by pushshift. The files can be downloaded from here or torrented from here.

single_file.py decompresses and iterates over a single zst compressed file
iterate_folder.py does the same, but for all files in a folder
combine_folder_multiprocess.py uses separate processes to iterate over multiple files in parallel, writing lines that match the criteria passed in to text files, then combining them into a final zst compressed file

pushshiftdumps's People

Contributors

Stargazers

Watchers

Forkers

andrew-chen-wang speropoulos pde devberge tactlabs sean-doody misbahkhan ghostwaffles dherzey abhishekdhankar95 mastercaster9000 theblazeman999 remrama d3athrow twinkundeath 1nj0k ashrithb injekim scriptsandthings emirkmo gvieiraaa carlowisse sonoff2 hongkai040 ashsimmonds nik-hz mroresi markydso maria-pro leonmoonen yjin-chae usmanashraf678 hbcbh1999 mathiasbruun jhabermas philenal adityakiran12 msaaksjarvi ug-team-data-science murtaza-nasir nayavi liuhoward ashikshezan ywang1224 anhzhang1994 m-walters mikim25 industrious1 jretamales roee16

pushshiftdumps's Issues

I just want to search submission from 2017 to 2019 by keywords

I dont understand how to use those file. Can anyone help me to o search submission from 2017 to 2019 by keywords?

new to python but need to get data

sorry! i was an idiot (and idiot i am!)
LOVE YOU!

Error when processing the "url" field of RS_2022-08 and RS_2022-09

I am able to use the combine_folder_multiprocess script on the "selftext" field, but trying to match on the "url" field of RS_2022-08 and RS_2022-09 results in the following error:

WARNING: File failed /submissions/submissions/RS_2022-08.zst: 'NoneType' object has no attribute 'lower'

Use of File in Academic Manuscript (DOI Request)

Hello Watchful1,

I am a student at the University of Pittsburgh and am conducting a study that would benefit from using your script to extract the data from the Pushshift archive files.

Could you generate a DOI for your work? This would allow me to cite your work in my paper correctly and give you recognition for your very well-designed script.

I have included a link to the instructions for creating a DOI below:
https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content#issuing-a-persistent-identifier-for-your-repository-with-zenodo

Thank you,
TYB25

How to filter combinations of keywords instead of just single keywords? (about combine_folder_multiprocess.py)

Thank you for sharing your code. It's been incredibly helpful in extracting the specific Reddit data I need!

However, I've encountered an issue. I can successfully use the command:
python3 combine_folder_multiprocess.py reddit --field title --value cold,fever --output pushshift
to fetch submissions with titles containing either "cold" or "fever".

But, when I try to search for specific keyword combinations like "common cold" or "fever symptoms" using:
python3 combine_folder_multiprocess.py reddit --field title --value common cold,fever symptoms --output pushshift
I encounter the following error:

combine_folder_multiprocess.py: error: unrecognized arguments: cold,fever symptoms

Could you advise on how to filter for phrases (some words with space) instead of single words?

Looking forward to your guidance, and thanks in advance!

Encoding error with script combine_folder_multiprocess.py

Hi there,

I am using combine_folder_multiprocess.py to extract submission data from RS_2018-01.zst to RS_2021-12.zst. The script appears to work correctly until RS_2021-07, at which point I receive the error:

INFO: File D:\.Datasets\pushshift\reddit\submissions\RS_2021-07.zst errored 'charmap' codec can't encode characters in position 1893-1897: character maps to <undefined>

If I try to rerun the script, it reports that it processed a majority of the files:

INFO: Processed 42 of 54 files with 234.46 of 330.14 gigabytes

But then it will error out again once it tries to pick up from where it left out, but this time indicating the error is with a different .zst file:

 WARNING: File failed D:\.Datasets\pushshift\reddit\submissions\RS_2021-11.zst: 'charmap' codec can't encode character '\U0001f4a9' in position 2001: character maps to <undefined>

I noticed a few months ago there was a similar issue reported in the Pushshift subreddit. Any help assistance you can provide here is greatly appreciated.

Thanks!

Very confused on how to use the combine multiple file script.

In the script you have given 2 illustrations which are

python3 combine_folder_multiprocess.py reddit/comments --value wallstreetbets
python3 combine_folder_multiprocess.py reddit --field author --value Watchful1,spez --output pushshift

However this is not working for
My script is - python combine_folder_multiprocess.py subreddit --value AOC --output pushshift

My data is inside the folder 'subreddits' I am getting
2023-06-21 20:01:08,025 - INFO: Loading files from: subreddits
2023-06-21 20:01:08,026 - INFO: Writing output to working folder
2023-06-21 20:01:08,028 - INFO: Checking field subreddit for value aoc_comments
2023-06-21 20:01:08,187 - WARNING: Args don't match args from json file. Delete working folder

Could you please provide a better illustration on how to use the scripts. In the script you also say that this script assumes that the files are having prefix RS_ and RC_ which is not the case as you can see the files that I downloaded. Please a response would be appreciated thank you.

using the --field argument properly

I'm confused on how to use the '--field' argument in the cmd line when I run the [combine_folder_multiprocess.py] file. I'm wondering if we have control over the fields that are saved to csv. For example, could I have it write the author, ID, date, subreddit, votes, awards, etc. to the output .csv file?

Understanding the memory requirements of combine_folder_multiprocess.py

While attempting to run through some of the Pushshift dumps using combine_folder_multiprocess.py, the script consumes all of my memory and is halted by the OOM killer. This occurs even if I set the --processes value to 2.

Is it possible to estimate the potential memory consumption required to run through any particular file in order to calculate the most efficient number of processes that should be used?

No data is extracted?

Hello I am trying to extract data from a _comments.zst and _submissions.zst file, but I am getting following your script (single_file.py) is lines in the cmd logs and nothing else is outputed:

D:\historicArchives\PushshiftDumps\scripts>py single_file.py ..._comments.zst
2022-10-05 07:11:33 : 100,000 : 0 : 22,807,050:53%
2022-11-15 22:50:16 : 200,000 : 0 : 34,865,950:81%
2022-12-21 22:43:45 : 300,000 : 0 : 42,810,910:100%
Complete : 330,112 : 0

D:\historicArchives\PushshiftDumps\scripts>py single_file.py ..._submissions.zst
Complete : 41,052 : 0

How and what can I do to get the data? I am confused. Thanks.

Also I would like to inform you that the 2 links you showed in the descriptioin of the page, are no longer available:
....The files can be downloaded from [here](https://files.pushshift.io/reddit/) or torrented from [here](https://academictorrents.com/details/f37bb9c0abe350f0f1cbd4577d0fe413ed07724e).

Removal requests

Regarding your Academic Torrents dumps, could you implement a user removal system similar to Pushshift's / PullPush's / Arctic Shift's, for general privacy / CCPA / GDPR reasons?

multiprocessing.pool vs ProcessPoolExecutor

I've been having an issue using the multiprocessing to filter down the entire 2005-2022 dataset, and i wont be able to limit it to just one subreddit. and im currently working through an issue where combine_folder_multiprocess will hang. i ran into that a few times with smaller chunks of the reddit data, but i was able to just kill and restart it and it was able to pick up where it left off. but not with the 2tb dataset. and the processing of debugging this is made harder by multiprocessing.pool having a tendency to silently fail (especially if the OOM killer kicks in), where ProcessPoolExecutor will give a BrokenProcessPool exception. the two have effectively the same features, but ProcessPoolExecutor is probably what's going to get the most updates going forward.
https://stackoverflow.com/questions/65115092/occasional-deadlock-in-multiprocessing-pool
https://bugs.python.org/issue22393#msg315684
https://stackoverflow.com/questions/24896193/whats-the-difference-between-pythons-multiprocessing-and-concurrent-futures

Other than that suggestion (and I'll send a PR if i end up porting it over and it works well), I'll update this on what works. but, how much RAM does the system where you process the entire dataset have? right now the machine I'm using has 32gb, and i gave it 20 workers because i have 24 cores and wanted to use my computer at the same time it was running. i could easily give the machine more, it's a WSL vm currently assigned half my system memory. Would you expect 10 vs 20 workers, 32 vs 64gb of ram, etc, to have major effects on whether the script completes?

unable to understand error msg when run combine_folder_multiprocess.py script

hi, this is the error message I get when I try running the script combine_folder_multiprocess.py. both the script and the data files (e.g. RC_2020_01.zst) is also in same folder. Is there something that I have not configured correctly, it doesn't seem to be loading the compressed files? thanks

python combine_folder_multiprocess.py reddit/comments --value wallstreetbets
2023-03-06 13:03:40,653 - INFO: Loading files from: reddit/comments
2023-03-06 13:03:40,653 - INFO: Writing output to working folder
2023-03-06 13:03:40,653 - INFO: Checking field subreddit for value wallstreetbets
2023-03-06 13:03:40,805 - INFO: Existing input file was read, if this is not correct you should delete the pushshift_working folder and run this script again
2023-03-06 13:03:40,805 - INFO: Processed 0 of 0 files with 0.00 of 0.00 gigabytes
Traceback (most recent call last):
File "D:\py\combine_folder_multiprocess.py", line 390, in
log.info(f"{total_lines_processed:,}, {total_lines_errored} errored : {(total_bytes_processed / (2**30)):.2f} gb, {(total_bytes_processed / total_bytes) * 100:.0f}% : {files_processed}/{len(input_files)}")
ZeroDivisionError: division by zero