Code Monkey home page Code Monkey logo

fakenewscorpus's Introduction

OpenSources Data

Fake News Corpus

This is an open source dataset composed of millions of news articles mostly scraped from a curated list of 1001 domains from http://www.opensources.co/. Because the list does not contain many reliable websites, additionally NYTimes and WebHose English News Articles articles has been included to better balance the classes. Corpus is mainly intended for use in training deep learning algorithms for purpose of fake news recognition. The dataset is still work in progress and for now, the public version includes only 9,408,908 articles (745 out of 1001 domains).

Downloading

https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0

How was the corpus created?

The corpus was created by scraping (using scrapy) all the domains as provided by http://www.opensources.co/. Then all the pure HTML content was processed to extract the article text with some additional fields (listed below) using the newspaper library. Each article has been attributed the same label as the label associated with its domain. All the source code is available at FakeNewsRecognition and will be made more “usable” in the next few months.

Formatting

The corpus is formatted as a CSV and contains the following fields:

  • id
  • domain
  • type
  • url
  • content
  • scraped_at
  • inserted_at
  • updated_at
  • title
  • authors
  • keywords
  • meta_keywords
  • meta_description
  • tags
  • summary
  • source (opensources, nytimes, or webhose)

Available types More information on http://www.opensources.co

Type Tag Count (so far) Description
Fake News fake 928,083 Sources that entirely fabricate information, disseminate deceptive content, or grossly distort actual news reports
Satire satire 146,080 Sources that use humor, irony, exaggeration, ridicule, and false information to comment on current events.
Extreme Bias bias 1,300,444 Sources that come from a particular point of view and may rely on propaganda, decontextualized information, and opinions distorted as facts.
Conspiracy Theory conspiracy 905,981 Sources that are well-known promoters of kooky conspiracy theories.
State News state 0 Sources in repressive states operating under government sanction.
Junk Science junksci 144,939 Sources that promote pseudoscience, metaphysics, naturalistic fallacies, and other scientifically dubious claims.
Hate News hate 117,374 Sources that actively promote racism, misogyny, homophobia, and other forms of discrimination.
Clickbait clickbait 292,201 Sources that provide generally credible content, but use exaggerated, misleading, or questionable headlines, social media descriptions, and/or images.
Proceed With Caution unreliable 319,830 Sources that may be reliable but whose contents require further verification.
Political political 2,435,471 Sources that provide generally verifiable information in support of certain points of view or political orientations.
Credible reliable 1,920,139 Sources that circulate news and information in a manner consistent with traditional and ethical practices in journalism (Remember: even credible sources sometimes rely on clickbait-style headlines or occasionally make mistakes. No news organization is perfect, which is why a healthy news diet consists of multiple sources of information).

List of domains You can find the full list of domains in websites.csv.

Limitations

The dataset was not manually filtered, therefore some of the labels might not be correct and some of the URLs might not point to the actual articles but other pages on the website. However, because the corpus is intended for use in training machine learning algorithms, those problems should not pose a practical issue.

Additionally, when the dataset will be finalised (as for now only about 80% was cleaned and published), I do not intend to update it, therefore it might quickly become outdated for other purposes than content-based algorithms. However, any contributions are welcome!

Contributing

Because there’s currently only myself working on this corpus, I’d really appreciate all the contributions. If you have found wrong labels associated with any articles, weirdly formatted content or URLs that are not pointing to any articles, feel free to post an issue with the problem and exact article id and I will do my best to respond promptly. Because of the size of the corpus, I could not host it on GitHub, therefore, unfortunately, for now, pull requests cannot be used to collaboratively work on the data, however, I’m open to any ideas 🙂

Acknowledgments

fakenewscorpus's People

Contributors

renatosc avatar several27 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fakenewscorpus's Issues

download the data

Is the data still open for public? I click the link but can't download the data.

File Corrupt

This file from the series: news.csv.z01 seems to be corrupt when extracting the file series. Should i skip this file by making a dummy or is there a replacement/fix?

Unable to extract the complete dataset

Used this command to create a combined zip from the split parts in a directory:
zip -F (name of last part of archive, which will end with .zip, not .z0X) --out (output name of compiled archive).zip
and unzip (archive name).zip to unpack it.Unpacking gives an error after an extraction of 2.8 GB of data.
Also used p7zip to unpack individual files or the combined file and get the error
Screenshot from 2020-02-11 18-49-11
@several27 It'd be great if you could help me with this.

Download data

Hi,

Would it be possible for you to host the dataset somewhere else as a more accessible download?

I have tried downloading the dataset via awscli, but it throws an error that indicates a permission and/or region mismatch. Is the bucket still public?

wrong openmagazines.com content

Hello!

Thank you for this huge dataset!
Currently working with it ("fake" and "reliable" labels only for now), I will probably find some problems in it, the majors of which I will post here :).

First one :
contents of openmagazines.com articles are mainly the following:
"This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept"
for 1023 entries out of 1081.
As a fast fix, automatically replacing this content with None or "" would be cleaner :).

Cannot save data to database

Hi,

I am trying to save the data into Cassandra database but it cannot interpret the CSV. After that I tried to check it with Excel and the pandas dataframe but both said that the file does not have a valid CSV format.

Could you help me to find the way how to store the data to Cassandra?

Getting the date of when each article was published

Hi @several27 ,

I am currently using your corpus for an NLP project and was wondering if you had the dates for when each article was published available. Otherwise I was wondering if you had the raw HTML for each article that I can download; I can retrieve these dates from the raw HTML. This is because a lot of the domains are dead and I can't look on the internet for these dates anymore.

Thanks,
Changxiao

Put a sample in the readme

Any chance of getting the head of the dataset in the readme. The file is so big I've been wasting a lot of time just to access and organize it. Having the first 5-10 rows would have saved me a headache. I'm sure someone else could benefit from this.

CSV File Error

Hello, after I extracted the news.csv.zip file and I opened it in Excel, it only showed me a grayed out screen. Excel did not recognize a file being open since I could not Save As or do any other actions. I have tried opening the CSV using another method from the Data tab then get external data, which did not work as well. I believe that the files you have put maybe corrupted. I would appreciate it if you could update it. Or if anyone knows a solution to my problem do respond.

Thank You

How to split the file

Hello can someone can give me some tips on how to split the file into smaller files?

Thanks.

Label "rumor"

Awesome work, but in the files we can see a type "rumor" that is not documented in the readme...

Citing the dataset

Hi @several27!

I'm doing some research on fake news datasets for automatic detection and your corpus is the most complete I've found, it can be really useful for my study! But I don't know if you have a desirable way to cite your work of if it's published somewhere?
Do you have an email that I can contact you?

Thanks!

Not able to get the zip file from wget command in Google colab

Hi,

When I am trying to download the file using wget command on google colab and I am getting below error:-

--2020-01-03 04:14:40-- https://storage.googleapis.com/researchably-fake-news-recognition/news_cleaned_2018_02_13.csv.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.97.128, 2404:6800:4008:c03::80
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.97.128|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-01-03 04:14:40 ERROR 403: Forbidden.

NYTimes data?

Dear Maciej,

Thanks a lot for making this amazing dataset available! :D I have one quick question and a comment.

  1. I found this dataset includes 1.5M NYTimes articles, can you elaborate little more how you collect them?
  2. I'd love to use this dataset for research. But the lack of details on data collection procedure (e.g., when the collection started and ended, what is the time range of collected news articles) makes it really hard to use this data for academic purposes. If you can describe how you collected this data, it would be gratefully helpful!

Thanks,
Jisun

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.