Code Monkey home page Code Monkey logo

twitter-fake-claim-detection's Introduction

Introduction

Introduction

This is a submission for term project of the course CS 529 - Topics and Tools in Social Media Data Mining from Indian Institute of Technology, Guwahati. It provides solution to the problem of fake news detection of tweets. For this, we use the following information present in a tweet:

  1. Text content of the tweet
  2. Related tweets on the basis hashtag
  3. Related tweets on the basis of mentions

Back to top

Running the Project

  • Dependencies

    Command:

      pip3 install requirements.txt
    

    Purpose:

    This would install all the dependencies of the project. Use Python3 for running the project.


  • Dataset

    Link to the dataset: https://drive.google.com/drive/folders/1gSx4S9i6Haul4TQRkoNQtj3sRHVwGFQ3

    Instructions:

    The dataset is to be downloaded and placed at the same level as the source code. First extract the downloaded zip files, and then name them as "politifact_fake" and "politifact_real".

    Description of Dataset:

    It consists of two types of tweets: fake and real (present in two different folders). Each folder (fake/real) consists of many sub-folders which contain many of files of which the files with the name (tweets.json) is of our interest which contains the tweets in the form of a python list with the elements being json objects with the following keys:

    1. text :- The content of the tweet.
    2. user_id :- Twitter account id.
    3. created_at :- Date of creation of tweet (in milliseconds).
    4. tweet_id :- Tweet id.
    5. user_name :- Username of the account that created the tweet.

  • Data Mining

    Command:

      python3 tweet_mining.py
    

    NOTE:

    I. This step has already been executed, and you can skip it unless, you wish to use a different dataset than one given in the previous section.

    II. Before executing the command, ensure that the dataset root folder is present at the source file level.

    III. Since the dataset is huge, it has been submitted directly and is not available in a cloud storage as of now.


    Output:

    Two folders named politifact_fake_related_tweets and politifact_real_related_tweets are created which contain subfolders corresponding to each tweet that store the related tweets as described below.

    Purpose:

    The purpose of this step is to extract and store all the realted tweets (using hashtags and mentions) of a particular tweet. For this, we iterate over all the files of the dataset and also over all the tweets within each file (tweets.json) and use the Tweepy API to perform the following:

    1. For each mention in the tweet, extract 20 tweets since the date of creation of it created by the same Twitter handle that is mentioned and store it in a csv file named mention_<mention_text>.csv under the folder named tweet_<tweet_id>.
    2. For each hashtag in the tweet, extract 20 tweets since the date of creation of it that contain the same hashtag and store it in a csv file named hashtag_<hashtag_text>.csv under the folder named tweet_<tweet_id>.

  • Data Pre-processing

    Command:

      python3 preprocess.py
    

    Output:

    This step takes around 2 days. The result is a csv file containing with around 1,090,000 tweets with the following columns in it:

    1. Tweet :- The tweet text.
    2. TweetCreatedAt :- Date of creation of tweet
    3. TopKHashAndMentionTweets :- The concatanation of content of five most relavant hashtag and mention tweets related to the tweet in first column.
    4. Label :- Can be one of fake or real depending on whether the tweet (in the first column) is fake or real.

    Purpose:

    The purpose of this step is to find the five most relavant tweets of the dataset containing the realted tweets. Also, it aims to buffer the training data as a csv file which would be used in the next step for training, testing and prediction.


  • Training and Prediction

    Command:

      python3 main.py
    

    Output:

    This command takes a huge amount of time (on an average of 5 minutes per tweet). It produces a file named training_data.csv which buffers the retreived Wikipedia and New York Times articles' five most important sentences that make up the evidence set which is used to make prediction.

    This command broadly involves the following sub-steps:

    1. Retreival of evidence set and cretaion of training_data.csv. This step also prints the accuracy of pre-trained model after each iteration.
    2. Training of model using training_data.csv.
    3. Testing of the trained model is done in parallel while training after every five epochs and the accuracy is reported.

Back to top

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.