Code Monkey home page Code Monkey logo

revdet's Introduction

RevDet

RevDet is an algorithm for robust and efficient event detection and tracking in large news feeds. It adopts an iterative clustering approach for tracking events. Even though many events continue to develop for many days or even months, RevDet is able to detect and track those events while utilizing only a constant amount of space on main memory. It takes as input news articles data (with two necessary columns: a list of locations and heading) in the form of per day files (sorted by ascending timestamp of the event), window size and threshold for birch clustering algorithm. It then forms event chains and outputs each chain in a separate file.

The figure below shows per day active event chains of an year formed by our RevDet algorithm vs the ground truth chains. To form these chains, RevDet only utilized memory required for storing eight days data.

Dataset

The event chain algorithm has been run on the w2e_gkg dataset, which has been prepared as below:

Dataset Link: https://drive.google.com/file/d/1Xc_9FJkaYsCcNPMatlHvHmyGr7NJAPSN/view?usp=sharing

Running RevDet

First, some pre-processing needs to be performed on the w2e_gkg dataset for removal of redundant (duplicate) news articles. Then it has to be transformed into per day files, which will serve as the input to the algorithm. Both these steps can be done by running prepare_data.py like this:

python3 prepare_data.py

You can now run the script run_revdet.py to run RevDet on the formed dataset and evaluate the formed chains on the ground truth chains. The plot of precision, recall, f-measure for different window sizes can be generated through:

python3 run_revdet.py --plotgraph

A plot of macro comparison between ground-truth and the formed chains can be generated as below:

python3 run_revdet.py --plotactivechains

Other options for run_revdet.py

Setting input and output directories

  • --inputchains: Directory for redundancy removed input event chains. Default is redundancy_removed_chains/.
  • --outputchains: Directory for output event chains. Default is output_chains/.
  • --perdaydata: Directory for per day data. Default is per_day_data/.

Algorithm Options

  • --birch_thresh: Threshold for the birch algorithm. Default is 2.3.
  • --window_size: Window size for the revdet algorithm. Default is 8.`.

Reference

Azeemi, A. H., Sohail, M. H., Zubair, T., Maqbool, M., Younas, I., & Shafiq, O. (2021). RevDet: Robust and Memory Efficient Event Detection and Tracking in Large News Feeds. International Workshop on Advanced Analytics and Learning on Temporal Data @ ECML PKDD, 2021 (In-Press).

Preprint: arXiv:2103.04390.

revdet's People

Contributors

ahazeemi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

revdet's Issues

Dataset access not public

Hello,
Thank you for your contribution.
I tried to access the dataset, but I need to do an access request. Can you please make it public because it's better for me or other people who want to access it. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.