Code Monkey home page Code Monkey logo

heard's Introduction

Early Rumor Detection Using Neural Hawkes Process with a New Benchmark Dataset (Accepted at NAACL 2022)

This is the official repository of EArly Rumor Detection (EARD) model HEARD and the early detection benchmark dataset BEARD.

  • BEARD is the first EARD-oriented dataset, collected by including early-stage information relevant to the concerned claims from a fact-checking website Snopes. BEARD contains 1,198 rumors and non-rumors reported during 2015/03-2021/01 with around 3.3 million relevant posts.
  • HEARD is a novel EARD model based on the Neural Hawkes Process to automatically determine an optimal time point for stable early detection.

Requirements

To install requirements:

pip install -r requirements.txt

Training

To train HEARD with default hyperparameters, run this command:

python -u Main.py

After training, it will automatically output the results in terms of accuracy, recall, precision, recall, early rate and SEA. You may also set the hyperparameters and data dir in config.json to train your custom dataset and hyperparameters.

Evaluation

By setting evaluate_only in config.jsonto true, you can use the same command python -u Main.py to test the trained model without training.

Datasets

Dataset Class #instance #post #AvgLen (hrs)
TWITTER rumor 498 182,499 2,538
non-rumor 494 48,064 1,456
PHEME rumor 1,972 31,230 10
non-rumor 3,830 71,210 19
BEARD rumor 531 2,644,807 1,432
non-rumor 667 657,925 1,683

Results

HEARD achieves the following performance on general classification metrics(accuracy and F1 score) and EARD-specific metrics(early rate and SEA):

Dataset Acc F1 ER SEA
TWITTER 0.716 0.714 0.348 0.789
PHEME 0.823 0.805 0.284 0.841
BEARD 0.789 0.788 0.490 0.765

Results of Early Detection Accuracy Over Time:

TWITTER
img-name

PHEME
img-name

BEARD
img-name

BEARD Dataset

This section aims to help users obtain the public BEARD dataset, which consists of claims, queries and relevant tweets (in the form of Tweet IDs). and prepare for the input data for our HEARD model.

Obtain BEARD dataset

Relevant files can be found in BEARD.zip

BEARD_claims.json

This file provides the claim information and search query for Twitter search we used to gather the dataset. Each key-value pair of an instance contains: instance id, claim content, query and claim publish time. The data format is:

{
   "instance id": {"claim": claim, "seed query":[sub_query,sub_query...],"time":time }
}

BEARD_posts.json

This file contains tweet ids for 1198 instances. The data format is as follows:

{
   "instance id": {"eid": instance id, "post_ids":[id,id...] }
}

Note that we cannot release the specific content of tweets due to the terms of use of Twitter data. Users can download the content via Twitter API or by following this blog.

Prepare input data

After obtaining the tweets content, user needs to prepare input data for the HEARD model. User can follow BERTweet[code] to pre-process the text and this example to generate tf-idf vectors. Alternatively, user can refer to our example code in data_process.py.

After the above steps, input instance should be in the following format that will be fed into our HEARD model:

 {
 "eid": {
         "label": "1", # 1 for rumor, 0 for non-rumor
         "merge_seqs": { 
             "merge_times": [[timestamp,timestamp,...], [timestamp,timestamp,...], ...],
             "merge_tids": [[post_id,post_id,...], [post_id,post_id,...], ...],
             'merge_vecs': [[...], [...], ...], # tf-idf vecs[1000] for each interval, so the shape of merge_vecs should be [num of intervals,1000] 
             }}
 ...
 }

Citation

If you use this code in your research, please cite our paper.

@inproceedings{zeng-gao-2022-early,
    title = "{E}arly Rumor Detection Using Neural {H}awkes Process with a New Benchmark Dataset",
    author = "Zeng, Fengzhu  and
      Gao, Wei",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.302",
    pages = "4105--4117"
}

Contact for issues

References & Open sources

heard's People

Contributors

znhy1024 avatar iswgao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.