Code Monkey home page Code Monkey logo

nela-gt's Introduction

NELA-GT repository

This repository contains usage examples for the NELA-GT-2020 data set with Python 3.

NELA-GT-2022

Metadata
Dataset name nela-gt-2022
Formats Sqlite3,JSON
No. of articles 1778361
No. of sources 361
No. of embedded tweets 346283
No. of articles w/ tweets 137150
Collection period 2022-01-01 to 2022-12-31

NELA-GT-2021

If you use this dataset in your work, please cite us as follows:

@misc{
    gruppi2020nelagt2021,
    title={NELA-GT-2021: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles},
    author={Maurício Gruppi and Benjamin D. Horne and Sibel Adalı},
    year={2021},
    eprint={---},
    archivePrefix={arXiv},
    primaryClass={cs.CY}
}

Data

Metadata
Dataset name nela-gt-2021
Formats Sqlite3,JSON
No. of articles 1856509
No. of sources 367
No. of embedded tweets 405449
No. of articles w/ tweets 153663
Collection period 2021-01-01 to 2021-12-31

Download

Limitations

Since the articles collected from news sources may be copyrighted, we apply a transformation to the original text so that it cannot be used for their originally intended purpose, i.e., that of being read by individuals to consume journalistic information.

We modify the text so that it cannot properly be used for news consumption but that can still be used for text analysis via a transformation.

For articles with more than 200 tokens, we replace 7 tokens with @ every 100 tokens. For articles with fewer than 200 tokens, we replace 5 consecutive tokens with @ every 20 tokens. This transforms the articles so that it is unlikely that a user will read NELA-GT to consume news while still keeping most of the content that is useful for analysis (~7% for larger articles).

Tables

Table: Newsdata

Each data point collected corresponds to an article and contains the fields described below.

Field Type Description
id string ID of the article.
date string date of publication (YYYY-MM-DD).
source string name of the source.
title string article's headline.
content string article's body text.
author string author who signed the article.
published string date time string as provided by source.
published_utc integer unix timestamp of publication.
collection_utc integer unix timestamp of collection date.
url string url of the paper.

Table: Tweet

Each entry corresponds to an embedded tweet observed in the article with id article_id.

Field Type Description
id string ID of the embedded tweet.
article_id string ID of the article that contains the embedded tweet.
embedded_tweet string ID/URL of the embedded tweet.

Aggregated labels

We provide aggregated labels based on Media Bias/Fact Check reports, classifying each source as:

  • Reliable - class 0
  • Unreliable - class 1
  • Mixed - class 2
  • Null - invalid label, -1 or null

These labels can be found in labels.csv

Note: the labels used in this aggregation were collected from Media Bias/Fact Check on Mar 20, 2020.

Examples

Please refer to these examples for details on how to use our dataset using Python3 and Pandas.

load-sqlite3.py

  • How to load the data from the Sqlite3 database using SQL queries.
    • Loading data from single or multiple sources from the database
    • Loading data from the database into a Pandas dataframe

Usage:

python3 load-sqlite3.py <path-to-database>

load-json.py

  • How to load NELA in JSON format with Python 3.
    • Loading a single source's JSON
    • Loading a directory of NELA JSON files - WARNING: this consumes a lot of memory

Usage:

python3 load-json.py <path-to-file>

About NELA-GT-2020

Citation

If you use this dataset in your work, please cite us as follows:

@misc{
    gruppi2020nelagt2020,
    title={NELA-GT-2020: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles},
    author={Maurício Gruppi and Benjamin D. Horne and Sibel Adalı},
    year={2021},
    eprint={---},
    archivePrefix={arXiv},
    primaryClass={cs.CY}
}

Data

We release our main news dataset NELA-GT-2020 along with two subsets, created by doing keyword searches on the main dataset. We introduce the NELA-GT-ELECTIONS dataset, containing articles related to the 2020 U.S. Presidential Elections, and the NELA-GT-COVID19 subset, which contains articles related to the COVID-19 pandemic.

Metadata
Dataset name NELA-GT-2020 NELA-GT-ELECTIONS NELA-GT-COVID19
Formats Sqlite3,JSON Sqlite3, JSON Sqlite3, JSON
No. of articles 1779127 294504 699803
No. of sources 519 403 493
No. of embedded tweets 410784 107771 158855
Collection period 2020-01-01 to 2020-12-31 2020-01-01 to 2020-12-31 2020-01-01 to 2020-12-31

Download

  • News Data

  • Source Labels: CSV

    • This file contains the credibility label for news sources in the dataset (reliable, unreliable, mixed).

For more details about this dataset, see the paper.

nela-gt's People

Contributors

benjamindhorne avatar mgruppi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

nela-gt's Issues

Source level labels as proxy for news label

First off, thank you so much for your efforts in making available NELA GT datasets.. I would like to extend the use case of the dataset to news classification. Does it make sense use source-level labels directly as news-level label? I understand that labelling wont be 100% accurate but would it be accurate enough to be used for fake news classification problems?

Labels - NELA-GT-2020

Hello,

I'm currently working with the news data from NELA-GT-2020. I'm using the labels.csv file to assign labels to the articles. However, I've encountered an issue where I can't perfectly match the sources' scores with the sources mentioned in the articles. To clarify, there are 519 sources in the dataset, but the labels.csv file, which can be found here, only contains 336 sources.

My question is, should I consider the missing sources as unlabeled?

Additionally, I've noticed that there are sources without labels that seem very similar to others that are labeled. For instance, there are sources without labels like "chicagosun-times" and "thehuffingtonpost," while there are labeled sources like "chicagosuntimes" and "huffingtonpost."

I would like to express my gratitude in advance for any help you can provide regarding this issue.
Alexandra Silva

Data preprocessing details

Could you give more details on how you preprocess the data? I noticed underscore characters are present instead of some special characters, for example.

It would be ideal if you could share the code you used to preprocess the data. I am comparing another dataset to NELA and I need to apply the same preprocessing steps to make sure the discriminators don't pick up preprocessing differences between the datasets.

Thank you for your help!

About the '@@@@@@@' in contents

I have read your paper 'NELA-GT-2020: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles' and downloaded the dataset from https://doi.org/10.7910/ DVN/CHMUYZ. I found that each article has ‘@@@@@@@’ popped up instead of normal words which brings great trouble for research. Can I have the true context of every news article? By the way, can I know what keyword dataset did you use to get the 2020 U.S. Presidential Election subset?

outgoing links

Great dataset, thanks for your work. I wonder if you can also provide outgoing links in the articles?It would be also interesting to see tweets that link those news besides the embedded ones in the articles.

Bests
ZP

2021 Version of the dataset?

Hello, will you release the newer version of the dataset? This resource is very useful and having the pu to date version would be great.
Thank you!

Keywords used for the 2020 election split

Thanks for your work!
I'm trying to base my study on your valuable corpus and found your keywords provided to filter the two subsets in the 2021 corpus helpful.
I'm wondering if you could provide the keywords you used to filter the 2020 covid/election subsets, like the ones you provided in the 2021 version. I found the 2021 covid keywords all-encompassing while the 2021 capital riot keywords very limiting to that event.
If you still have that, could you please share the file? I'll really appreciate it:)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.