Comments (36)
Sent to both of you. Hope @several27 will upload it soon.
from fakenewscorpus.
Hi all, thank you very much for your patience.
Thanks to @pmacinec the dataset can be now downloaded from: https://storage.googleapis.com/researchably-fake-news-recognition/news_cleaned_2018_02_13.csv.zip
from fakenewscorpus.
@lerpier to read the dataset with minimal RAM usage use the ‘chunksize’ parameter in pandas.
E.g.: https://cmdlinetips.com/2018/01/how-to-load-a-massive-file-as-small-chunks-in-pandas/
from fakenewscorpus.
@pmacinec Well now I feel incredibly dumb, haha. That makes sense. I actually remember trying to use the strategy of appending chunks before though, and I ran into the problem that at some point everything just started ending up in the wrong columns and stuff. I will try it again tomorrow though. Hopefully I can figure it out and don't need to bother you any longer.
That said, the subset you are describing sounds very useful though, especially since I likely won't be able to work with the entire thing using my own resources anyway. If you could send it to me at [email protected], I would be very grateful.
from fakenewscorpus.
I am trying to download as well. The AWS link is showing following error
from fakenewscorpus.
hie, is the data still available for downloading?
from fakenewscorpus.
Apologies for the inconvenience, the download is temporarily down. I'm working on bringing it back.
If anyone had stored a local copy of the dataset, I'd appreciate them sending it over.
from fakenewscorpus.
I have a local copy probably, so I can upload it somewhere. I have also a subset of only fake and reliable news, that someone mentioned in another issue if needed. I can upload it probably on OneDrive if it is possible to upload that large files. Or do you have any idea where to upload it?
from fakenewscorpus.
I've just loss my copy this morning... Maybe you could use AWS free tier to upload the dataset to a bucket?
from fakenewscorpus.
@pmacinec If you can upload it to OneDrive and send me over a link, that'd be amazing (maciej[at]researchably.com) I'll copy it over to a new cloud and share a public free link here. Thanks!
from fakenewscorpus.
@several27 I have already sent you a link to download, so I hope anyone can download it soon from the new cloud.
from fakenewscorpus.
@pmacinec , would you mind sending me the link to download it too?
Appreciate it!
from fakenewscorpus.
Yes, of course, just send me your email. Or should I send it to email that u have public on your profile?
from fakenewscorpus.
from fakenewscorpus.
@pmacinec ,could you please send me the link to download it too? I am very urgently to use it. Thank you! My email is [email protected]
from fakenewscorpus.
from fakenewscorpus.
Hi @pmacinec, can trouble you to send the link to me as well? Thanks! [email protected]
from fakenewscorpus.
Hey @pmacinec sorry for bugging but could you send me the link as well? Thank you very much :) [email protected]
from fakenewscorpus.
@pmacinec I would appreciate it if it could be sent to my email address at [email protected]
Thank you once again! :)
from fakenewscorpus.
Would it be possible to also upload the subset of only fake and reliable articles that @pmacinec discussed?
from fakenewscorpus.
If @several27 wants, I can upload also this subset or do it just for u (or maybe share u a code to get only those messages from whole dataset, processed in chunks). Just let me know.
from fakenewscorpus.
An upload of the subset would be great. I don't have easy access to resources to easy process the entire dataset, which is why such a subset would be very convenient, practically. Let's wait for @pmacinec to see if he agrees with you sharing this, either publicly or privately. Many thanks in advance!
from fakenewscorpus.
@several27 I tried something like this and it mostly worked, but ran into some issues after several chunks (not sure why). Would it be okay for @pmacinec to share the subset with only fake and real articles? Either with me privately with a link per email or a link publicly here?
from fakenewscorpus.
@pmacinec @several27 I don't mean to be a bother but is this still an option (uploading the fake/reliable subset?) I could post an email that you could share a link to so you don't have to share it publicly? It might be my code but when I try to process it myself I run into problems, so that would be a huge help.
from fakenewscorpus.
Maybe you can at first share your code and your problems.
To all that will ever want only subset of data with specific labels, please, try the following code to get only fake and reliable news.
chunksize = 200000 # depending on your memory, can be much more bigger but also smaller
for chunk in pd.read_csv('data/data.csv', chunksize=chunksize, encoding='utf-8', engine='python'):
x = chunk[(chunk['type']=='reliable') | (chunk['type']=='fake')]
...
Hope this will help.
from fakenewscorpus.
Hm. This code is somewhat different from the one I tried. I'll try it tomorrow/monday and see if it works. Thanks! :)
from fakenewscorpus.
@pmacinec I tried running your code but it just gives me a df of 129194 articles by the new york times. No other sources and no fake articles at all. I also tried reading in the entire file in chunks which still raised a memory error. Reading in the entire file as is nearly blew up my pc (as expected, haha). Reading in just some rows using nrows works just fine, though.
@several27 what is the code you used to extract just fake and real? Is it the same that @pmacinec posted or did you do something different? What setup did you use to process it? I'm running python on a local jupyter server using a python 3.5 environment.
I would love to work with a (reasonably large, but not complete) subset of the real/fake articles in this dataset since none of the other fake news datasets suit my definitions of 'fake news' as well. However, I think the sheer size of the data is unfortunately causing some issues for me here. A subset of fake and reliable articles would be an absolute lifesaver right now, just so I don't have to process the entire file on my poor laptop. If you could possibly share that with me, I would be eternally grateful :).
(I'm really sorry if I come off as 'pushy'. I'm a bit stressed about this project).
from fakenewscorpus.
@Ierpier it is probably because you didn't finish the code above. In x variable are stored all fake and reliable news of current chunk. When another chunk is processed, it is overwritten. To not loose current data stored in x, you have to add it somewhere (e.g. new dataframe), where it will not be overwritten in loop.
append function can be helpful to you: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
Note: be careful, also only fake and reliable news need a lot of memory!
Or if you are ok to have only 100 000 of news (50k reliable, 50k fake), write me your email and I will share it to you. I can share you all reliable and fake news approximately next weekend. But please, at first, try the advice above.
from fakenewscorpus.
@pmacinec
Hello Peter,
could you please share with me the link of the dataset as the Link provided by @several27 is not working now. My email is [email protected]
from fakenewscorpus.
Hello @impawan, I dont have the data available on google disk anymore. Probably, I have back up of this dataset on my disk, but I dont have it with myself. I am able to upload it approximately in 2 weeks.
from fakenewscorpus.
Thanks @pmacinec for helping me with this. I will wait for an update from you. 👍
from fakenewscorpus.
Hello @impawan. I have uploaded the dataset again, I can share with some of you, just write me or give me your email.
But please, @several27, are you going to upload the data again? Will be data available in the future for others? I created only temporary solution now (as previously).
from fakenewscorpus.
Hello @pmacinec. You've done a lot to keep the dataset "alive". Would you, please, share with me the link of the dataset as the link provided by @several27 is not working or a piece of data, example 100 000/200 000 of news (50/100k reliable, 50/100k fake). My email is: [email protected]. I'll appreciate your help. Thank you!
from fakenewscorpus.
Hello @lgotsev . The link that @several27 provided should be working, because the data are uploaded to Github as a part of release (https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0).
Let me know, if the dataset is for any reason not working again (fortunately, I still have a copy).
from fakenewscorpus.
Thank you @pmacinec for your quick answer. I've tried several times on opening the files using 7zip but unpacking gives an error after an extraction of almost 3 GB of data. Probably another tool or using command prompt can help. I've noticed a new issue about the problem on July 2020 which is still open. So, please give an advice how to deal with it or probably once again there is an issue with it. Thank you!
from fakenewscorpus.
If there is a problem with unpacking multiple zip files, maybe it makes sense to divide the data into chunks and zip each chunk respectively, then upload each zip here on github.
@several27 what do you think? If needed, I can do something like that and prepare pull-request.
from fakenewscorpus.
Related Issues (17)
- Not able to open file HOT 5
- Put a sample in the readme HOT 1
- Getting the date of when each article was published HOT 2
- Not able to get the zip file from wget command in Google colab HOT 4
- Unable to extract the complete dataset HOT 3
- Label "rumor"
- CSV File Error HOT 1
- Data Labelling HOT 3
- Field describing when an article was written
- File Corrupt HOT 1
- Download data HOT 1
- NYTimes data? HOT 1
- Citing the dataset HOT 1
- How to split the file HOT 1
- wrong openmagazines.com content HOT 2
- Cannot save data to database HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fakenewscorpus.