Running the Project

Introduction

This is a submission for term project of the course CS 529 - Topics and Tools in Social Media Data Mining from Indian Institute of Technology, Guwahati. It provides solution to the problem of fake news detection of tweets. For this, we use the following information present in a tweet:

Text content of the tweet

Related tweets on the basis hashtag

Related tweets on the basis of mentions

Running the Project

Dependencies

Command:
```
  pip3 install requirements.txt
```
Purpose:

This would install all the dependencies of the project. Use Python3 for running the project.

Dataset

Link to the dataset: https://drive.google.com/drive/folders/1gSx4S9i6Haul4TQRkoNQtj3sRHVwGFQ3

Instructions:

The dataset is to be downloaded and placed at the same level as the source code. First extract the downloaded zip files, and then name them as "politifact_fake" and "politifact_real".

Description of Dataset:

It consists of two types of tweets: fake and real (present in two different folders). Each folder (fake/real) consists of many sub-folders which contain many of files of which the files with the name (tweets.json) is of our interest which contains the tweets in the form of a python list with the elements being json objects with the following keys:
1. text :- The content of the tweet.
2. user_id :- Twitter account id.
3. created_at :- Date of creation of tweet (in milliseconds).
4. tweet_id :- Tweet id.
5. user_name :- Username of the account that created the tweet.

Data Mining

Command:
```
  python3 tweet_mining.py
```
NOTE:

I. This step has already been executed, and you can skip it unless, you wish to use a different dataset than one given in the previous section.

II. Before executing the command, ensure that the dataset root folder is present at the source file level.

III. Since the dataset is huge, it has been submitted directly and is not available in a cloud storage as of now.

Output:

Two folders named politifact_fake_related_tweets and politifact_real_related_tweets are created which contain subfolders corresponding to each tweet that store the related tweets as described below.

Purpose:

The purpose of this step is to extract and store all the realted tweets (using hashtags and mentions) of a particular tweet. For this, we iterate over all the files of the dataset and also over all the tweets within each file (tweets.json) and use the Tweepy API to perform the following:
1. For each mention in the tweet, extract 20 tweets since the date of creation of it created by the same Twitter handle that is mentioned and store it in a csv file named mention_<mention_text>.csv under the folder named tweet_<tweet_id>.
2. For each hashtag in the tweet, extract 20 tweets since the date of creation of it that contain the same hashtag and store it in a csv file named hashtag_<hashtag_text>.csv under the folder named tweet_<tweet_id>.

Data Pre-processing

Command:
```
  python3 preprocess.py
```
Output:

This step takes around 2 days. The result is a csv file containing with around 1,090,000 tweets with the following columns in it:
1. Tweet :- The tweet text.
2. TweetCreatedAt :- Date of creation of tweet
3. TopKHashAndMentionTweets :- The concatanation of content of five most relavant hashtag and mention tweets related to the tweet in first column.
4. Label :- Can be one of fake or real depending on whether the tweet (in the first column) is fake or real.
Purpose:

The purpose of this step is to find the five most relavant tweets of the dataset containing the realted tweets. Also, it aims to buffer the training data as a csv file which would be used in the next step for training, testing and prediction.

Training and Prediction

Command:
```
  python3 main.py
```
Output:

This command takes a huge amount of time (on an average of 5 minutes per tweet). It produces a file named training_data.csv which buffers the retreived Wikipedia and New York Times articles' five most important sentences that make up the evidence set which is used to make prediction.

This command broadly involves the following sub-steps:
1. Retreival of evidence set and cretaion of training_data.csv. This step also prints the accuracy of pre-trained model after each iteration.
2. Training of model using training_data.csv.
3. Testing of the trained model is done in parallel while training after every five epochs and the accuracy is reported.

varenyambakshi / twitter-fake-claim-detection Goto Github PK

twitter-fake-claim-detection's Introduction

Introduction

Running the Project

Dependencies

Command:

Purpose:

Dataset

Instructions:

Description of Dataset:

Data Mining

Command:

Output:

Purpose:

Data Pre-processing

Command:

Output:

Purpose:

Training and Prediction

Command:

Output:

Recommend Projects

Recommend Topics

Recommend Org