Code Monkey home page Code Monkey logo

Comments (36)

pmacinec avatar pmacinec commented on May 22, 2024 1

Sent to both of you. Hope @several27 will upload it soon.

from fakenewscorpus.

several27 avatar several27 commented on May 22, 2024 1

Hi all, thank you very much for your patience.

Thanks to @pmacinec the dataset can be now downloaded from: https://storage.googleapis.com/researchably-fake-news-recognition/news_cleaned_2018_02_13.csv.zip

from fakenewscorpus.

several27 avatar several27 commented on May 22, 2024 1

@lerpier to read the dataset with minimal RAM usage use the ‘chunksize’ parameter in pandas.

E.g.: https://cmdlinetips.com/2018/01/how-to-load-a-massive-file-as-small-chunks-in-pandas/

from fakenewscorpus.

Ierpier avatar Ierpier commented on May 22, 2024 1

@pmacinec Well now I feel incredibly dumb, haha. That makes sense. I actually remember trying to use the strategy of appending chunks before though, and I ran into the problem that at some point everything just started ending up in the wrong columns and stuff. I will try it again tomorrow though. Hopefully I can figure it out and don't need to bother you any longer.

That said, the subset you are describing sounds very useful though, especially since I likely won't be able to work with the entire thing using my own resources anyway. If you could send it to me at [email protected], I would be very grateful.

from fakenewscorpus.

nabanita- avatar nabanita- commented on May 22, 2024

I am trying to download as well. The AWS link is showing following error
screenshot from 2019-02-24 15-48-15

from fakenewscorpus.

ushetaks avatar ushetaks commented on May 22, 2024

hie, is the data still available for downloading?

from fakenewscorpus.

several27 avatar several27 commented on May 22, 2024

Apologies for the inconvenience, the download is temporarily down. I'm working on bringing it back.

If anyone had stored a local copy of the dataset, I'd appreciate them sending it over.

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

I have a local copy probably, so I can upload it somewhere. I have also a subset of only fake and reliable news, that someone mentioned in another issue if needed. I can upload it probably on OneDrive if it is possible to upload that large files. Or do you have any idea where to upload it?

from fakenewscorpus.

AIRLegend avatar AIRLegend commented on May 22, 2024

I've just loss my copy this morning... Maybe you could use AWS free tier to upload the dataset to a bucket?

from fakenewscorpus.

several27 avatar several27 commented on May 22, 2024

@pmacinec If you can upload it to OneDrive and send me over a link, that'd be amazing (maciej[at]researchably.com) I'll copy it over to a new cloud and share a public free link here. Thanks!

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

@several27 I have already sent you a link to download, so I hope anyone can download it soon from the new cloud.

from fakenewscorpus.

gao-xian-peh avatar gao-xian-peh commented on May 22, 2024

@pmacinec , would you mind sending me the link to download it too?

Appreciate it!

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

Yes, of course, just send me your email. Or should I send it to email that u have public on your profile?

from fakenewscorpus.

ushetaks avatar ushetaks commented on May 22, 2024

from fakenewscorpus.

juewang1996 avatar juewang1996 commented on May 22, 2024

@pmacinec ,could you please send me the link to download it too? I am very urgently to use it. Thank you! My email is [email protected]

from fakenewscorpus.

ushetaks avatar ushetaks commented on May 22, 2024

from fakenewscorpus.

 avatar commented on May 22, 2024

Hi @pmacinec, can trouble you to send the link to me as well? Thanks! [email protected]

from fakenewscorpus.

Kerrah avatar Kerrah commented on May 22, 2024

Hey @pmacinec sorry for bugging but could you send me the link as well? Thank you very much :) [email protected]

from fakenewscorpus.

gao-xian-peh avatar gao-xian-peh commented on May 22, 2024

@pmacinec I would appreciate it if it could be sent to my email address at [email protected]

Thank you once again! :)

from fakenewscorpus.

Ierpier avatar Ierpier commented on May 22, 2024

Would it be possible to also upload the subset of only fake and reliable articles that @pmacinec discussed?

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

If @several27 wants, I can upload also this subset or do it just for u (or maybe share u a code to get only those messages from whole dataset, processed in chunks). Just let me know.

from fakenewscorpus.

Ierpier avatar Ierpier commented on May 22, 2024

An upload of the subset would be great. I don't have easy access to resources to easy process the entire dataset, which is why such a subset would be very convenient, practically. Let's wait for @pmacinec to see if he agrees with you sharing this, either publicly or privately. Many thanks in advance!

from fakenewscorpus.

Ierpier avatar Ierpier commented on May 22, 2024

@several27 I tried something like this and it mostly worked, but ran into some issues after several chunks (not sure why). Would it be okay for @pmacinec to share the subset with only fake and real articles? Either with me privately with a link per email or a link publicly here?

from fakenewscorpus.

Ierpier avatar Ierpier commented on May 22, 2024

@pmacinec @several27 I don't mean to be a bother but is this still an option (uploading the fake/reliable subset?) I could post an email that you could share a link to so you don't have to share it publicly? It might be my code but when I try to process it myself I run into problems, so that would be a huge help.

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

Maybe you can at first share your code and your problems.

To all that will ever want only subset of data with specific labels, please, try the following code to get only fake and reliable news.

chunksize = 200000 # depending on your memory, can be much more bigger but also smaller
for chunk in pd.read_csv('data/data.csv', chunksize=chunksize, encoding='utf-8', engine='python'):
    x = chunk[(chunk['type']=='reliable') | (chunk['type']=='fake')]
    ...

Hope this will help.

from fakenewscorpus.

Ierpier avatar Ierpier commented on May 22, 2024

Hm. This code is somewhat different from the one I tried. I'll try it tomorrow/monday and see if it works. Thanks! :)

from fakenewscorpus.

Ierpier avatar Ierpier commented on May 22, 2024

@pmacinec I tried running your code but it just gives me a df of 129194 articles by the new york times. No other sources and no fake articles at all. I also tried reading in the entire file in chunks which still raised a memory error. Reading in the entire file as is nearly blew up my pc (as expected, haha). Reading in just some rows using nrows works just fine, though.

@several27 what is the code you used to extract just fake and real? Is it the same that @pmacinec posted or did you do something different? What setup did you use to process it? I'm running python on a local jupyter server using a python 3.5 environment.

I would love to work with a (reasonably large, but not complete) subset of the real/fake articles in this dataset since none of the other fake news datasets suit my definitions of 'fake news' as well. However, I think the sheer size of the data is unfortunately causing some issues for me here. A subset of fake and reliable articles would be an absolute lifesaver right now, just so I don't have to process the entire file on my poor laptop. If you could possibly share that with me, I would be eternally grateful :).

(I'm really sorry if I come off as 'pushy'. I'm a bit stressed about this project).

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

@Ierpier it is probably because you didn't finish the code above. In x variable are stored all fake and reliable news of current chunk. When another chunk is processed, it is overwritten. To not loose current data stored in x, you have to add it somewhere (e.g. new dataframe), where it will not be overwritten in loop.

append function can be helpful to you: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html

Note: be careful, also only fake and reliable news need a lot of memory!

Or if you are ok to have only 100 000 of news (50k reliable, 50k fake), write me your email and I will share it to you. I can share you all reliable and fake news approximately next weekend. But please, at first, try the advice above.

from fakenewscorpus.

impawan avatar impawan commented on May 22, 2024

@pmacinec
Hello Peter,
could you please share with me the link of the dataset as the Link provided by @several27 is not working now. My email is [email protected]

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

Hello @impawan, I dont have the data available on google disk anymore. Probably, I have back up of this dataset on my disk, but I dont have it with myself. I am able to upload it approximately in 2 weeks.

from fakenewscorpus.

impawan avatar impawan commented on May 22, 2024

Thanks @pmacinec for helping me with this. I will wait for an update from you. 👍

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

Hello @impawan. I have uploaded the dataset again, I can share with some of you, just write me or give me your email.

But please, @several27, are you going to upload the data again? Will be data available in the future for others? I created only temporary solution now (as previously).

from fakenewscorpus.

lgotsev avatar lgotsev commented on May 22, 2024

Hello @pmacinec. You've done a lot to keep the dataset "alive". Would you, please, share with me the link of the dataset as the link provided by @several27 is not working or a piece of data, example 100 000/200 000 of news (50/100k reliable, 50/100k fake). My email is: [email protected]. I'll appreciate your help. Thank you!

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

Hello @lgotsev . The link that @several27 provided should be working, because the data are uploaded to Github as a part of release (https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0).

Let me know, if the dataset is for any reason not working again (fortunately, I still have a copy).

from fakenewscorpus.

lgotsev avatar lgotsev commented on May 22, 2024

Thank you @pmacinec for your quick answer. I've tried several times on opening the files using 7zip but unpacking gives an error after an extraction of almost 3 GB of data. Probably another tool or using command prompt can help. I've noticed a new issue about the problem on July 2020 which is still open. So, please give an advice how to deal with it or probably once again there is an issue with it. Thank you!

from fakenewscorpus.

pmacinec avatar pmacinec commented on May 22, 2024

If there is a problem with unpacking multiple zip files, maybe it makes sense to divide the data into chunks and zip each chunk respectively, then upload each zip here on github.

@several27 what do you think? If needed, I can do something like that and prepare pull-request.

from fakenewscorpus.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.