Code Monkey home page Code Monkey logo

zergtant / fake-news-detection-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shuheng-liu/fake-news-detection-pipeline

0.0 3.0 1.0 54 MB

Pipeline for detecting fake news, covering data ingestion, doc embedding, classifier hypertuning & model ensembling. Quick walkthrough available in README. Execution logs on my FloydHub page.

Home Page: https://www.floydhub.com/wish1104/projects/fake-news/jobs

License: Apache License 2.0

Python 100.00%

fake-news-detection-pipeline's Introduction

migrated from this repo of mine

Fake News Detection Pipeline

Collaborators Shuheng Liu, Qiaoyi Yin, Yuyuan Fang

Group project materials for fake news detection at Hollis Lab, GEC Academy

Project Plan

a

Table of Contents

Notice for Collaborators

Doing Train-Test Split

Specifying random_state in sklearn.model_selection.train_test_split() ensures same split on different datasets (of the same length), and on different machines. (See this link)

For purpose of this project, we will be using random_state=58 for each split.

While grid/random searching for the best set of hyperparameters, a 75%-25% train-test-split is used. A 5-Fold cross-validation is used in the training phase on the 75% samples.

Directory to Push Models

There is a model/ directory nested under the project. Please name your model as model_name.py, and place it under the model/ directory (e.g. model/KNN.py) before pushing to this repo.

Downloadables

Before trying to reproduce our result, please know that pre-computed embeddings can be downloaded from the URLs below. Consider downloading them and storing them into the pretrained/ folder under this repository, which will save a lot of time.

URL for Different Embeddings Precomputed on Cloud

Hypyertuning Logs, Codes, and Stats

The logs, codes, and stats of hypertuning all simple models (that is, excluding Ensemble model) can be found here.

Quick Walkthrough (Presentation)

Below is the final presentation, originally implemented in jupyter notebook. To see the original presentation file, checkout the following command in your terminal

git log --  "UCB Final Project.ipynb"

or,

git checkout f7e1c41

Alternatively, visit this link which takes you back in history.

Infrastructure for Embeddings

The following classes DocumentSequence and DocumentEmbedder can be found in sub-package doc_utils/. Different ways of computing embeddings (doc2vec, naive doc2vec, one-hot) and their choices of hyperparameters are encapsulated in these files. Below is a snapshot of these classes their methods.

class DocumentSequence:
    def __init__(self, raw_docs, clean=False, sw=None, punct=None): ...
    # setters (only to be called internally)
    def _set_tokenized(self, clean=False, sw=None, punct=None): ...
    def _set_tagged(self): ...
    def _set_dictionary(self): ...
    def _set_bow(self): ...
    # getters (exposed)
    def get_dictionary(self): ...  
    dictionary = property(get_dictionary)  # property field of get_dictionary()
    def get_tokenized(self): ...  
    tokenized = property(get_tokenized)  # property field of get_tokenized()
    def get_tagged(self): ...  
    tagged = property(get_tagged)  # property field of get_tagged()
    def get_bow(self): ...  
    bow = property(get_bow)  # property field of get_bow()
class DocumentEmbedder:
    def __init__(self, docs: DocumentSequence, pretrained_word2vec=None): ...
    # setters (only to be called internally)    
    def _set_word2vec(self): ...
    def _set_doc2vec(self, vector_size=300, window=5, min_count=5, dm=1, epochs=20): ...
    def _set_naive_doc2vec(self, normalizer='l2'): ...
    def _set_tfidf(self): ...
    def _set_onehot(self, scorer='tfidf'): ...
    # getters (exposed)
    def get_onehot(self, scorer='tfidf'): ...  
    onehot = property(get_onehot)  # property field of get_onehot()
    def get_doc2vec(self, vectors_size=300, window=5, min_count=5, dm=1, epochs=20): ... 
    doc2vec = property(get_doc2vec)  # property field of get_doc2vec()
    def get_naive_doc2vec(self, normalizer='l2'): ...  
    naive_doc2vec = property(get_naive_doc2vec)  # propery field of get_naive_doc2vec()
    def get_tfidf_score(self): ...  
    tfidf = property(get_tfidf_score)  # property field of get_tfidf_score()
import pandas as pd
from string import punctuation
from nltk.corpus import stopwords

df = pd.read_csv("./fake_or_real_news.csv")

# obtain the raw news texts and titles
raw_text = df['text'].values
raw_title = df['title'].values
df['label'] = df['label'].apply(lambda label: 1 if label == "FAKE" else 0)

# build two instances for preprocessing raw data
from doc_utils import DocumentSequence
texts = DocumentSequence(raw_text, clean=True, sw=stopwords.words('english'), punct=punctuation)
titles = DocumentSequence(raw_title, clean=True, sw=stopwords.words('english'), punct=punctuation)

df.head()
Unnamed: 0 title text label title_vectors
0 8476 You Can Smell Hillary’s Fear Daniel Greenfield, a Shillman Journalism Fello... 1 [ 1.1533764e-02 4.2144405e-03 1.9692603e-02 ...
1 10294 Watch The Exact Moment Paul Ryan Committed Pol... Google Pinterest Digg Linkedin Reddit Stumbleu... 1 [ 0.11267698 0.02518966 -0.00212591 0.021095...
2 3608 Kerry to go to Paris in gesture of sympathy U.S. Secretary of State John F. Kerry said Mon... 0 [ 0.04253004 0.04300297 0.01848392 0.048672...
3 10142 Bernie supporters on Twitter erupt in anger ag... — Kaydee King (@KaydeeKing) November 9, 2016 T... 1 [ 0.10801624 0.11583211 0.02874823 0.061732...
4 875 The Battle of New York: Why This Primary Matters It's primary day in New York and front-runners... 0 [ 1.69016439e-02 7.13498285e-03 -7.81233795e-...

Embedding Computation

URLs

from doc_utils import DocumentEmbedder

try:
    from embedding_utils import EmbeddingLoader

    loader = EmbeddingLoader("pretrained/")
    news_embeddings = loader.get_d2v("concat", vec_size=300, win_size=23, min_count=5, dm=0, epochs=500)
    labels = loader.get_label()

except FileNotFoundError as e:
    print(e)
    print("Cannot find existing embeddings, computing new ones now")

    pretrained = "./pretrained/GoogleNews-vectors-negative300.bin"
    text_embedder = DocumentEmbedder(texts, pretrained_word2vec=pretrained)
    title_embedder = DocumentEmbedder(titles, pretrained_word2vec=pretrained)

    text_embeddings = text_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500)
    title_embeddings = title_embedder.get_doc2vec(vectors_size=300, window=23, min_count=5, dm=0, epochs=500)
    
    # concatenate title vectors and text vectors
    news_embeddings = np.concatenate((title_embeddings, text_embeddings), axis=1)
    labels = df['label'].values

Embedding Visualization

from embedding_utils import visualize_embeddings

# visualize the news embeddings in the graph
# MUST run in command line "tensorboard --logdir visual/" and visit localhost:6006 to see the visualization
visualize_embeddings(embedding_values=news_embeddings, label_values=labels, texts = raw_title)
print("visit https://localhost:6006 to see the result")
# ATTENTION: This cell must be manually stopped
visit https://localhost:6006 to see the result

Some screenshots of the tensorboard are shown below. We visuallize the embeddings of documents with T-SNE projection on 3D and 2D spaces. Each red data point indicates a piece of FAKE news, and each blue one indicates a piece of real news. These two categories are well-separated as can be seen from the visualization.

2D T-SNE

red for fake ones, blue for real ones

jpg

3D T-SNE

red for fake ones, blue for real ones

jpg

Visualizing Bigram Statistics

import itertools
import nltk
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

## Get tokenized words of fake news and real news independently
real_text = df[df['label'] == 0]['text'].values
fake_text = df[df['label'] == 1]['text'].values
sw = [word for word in stopwords.words("english")] + ["``", "“"]
other_puncts = u'.,;《》?!“”‘’@#¥%…&×()——+【】{};;●,。&~、|\s::````'
punct = punctuation + other_puncts
fake_words = DocumentSequence(real_text, clean=True, sw=sw, punct=punct)
real_words = DocumentSequence(fake_text, clean=True, sw=sw, punct=punct)

## Get cleaned text using chain
real_words_all = list(itertools.chain(*real_words.get_tokenized()))
fake_words_all = list(itertools.chain(*fake_words.get_tokenized()))

## Drawing histogram
def plot_most_common_words(num_to_show,words_list,title = ""):
    bigrams = nltk.bigrams(words_list)
    counter = Counter(bigrams)
    labels = [" ".join(e[0]) for e in counter.most_common(num_to_show)]
    values = [e[1] for e in counter.most_common(num_to_show)]

    indexes = np.arange(len(labels))
    width = 1
    
    plt.title(title)
    plt.barh(indexes, values, width)
    plt.yticks(indexes + width * 0.2, labels)
    plt.show()
plot_most_common_words(20, fake_words_all, "Fake News Most Frequent words")
plot_most_common_words(20, real_words_all, "Real News Most Frequent words")

png

png

Binary Classification

Train-Val-Test Split

(with 75% of data for 5-fold Random CV, 25% for testing)

from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.model_selection._search import BaseSearchCV
import pickle as pkl

seed = 58

# perform the split which gets us the train data and the test data
news_train, news_test, labels_train, labels_test = train_test_split(news_embeddings, labels,
                                                                    test_size=0.25,
                                                                    random_state=seed,
                                                                    stratify=labels)

Hypertuned Classifiers

We used RandomSearch on different datasets to get the best hyper-parameters.
The following exhibits every classifier with almost optimal parameters in our experiments.
The RandomSearch process is omitted.

from model.hypyertuned_models import mlp, knn, qda, gdb, svc, gnb, rf, lg
from model.hypyertuned_models import classifiers as classifiers_list

We list the best-performing hyperparameters in the following chart.

from sklearn.metrics import classification_report

# print details of testing results
for model in classifiers_list:
    model.fit(news_train, labels_train)
    labels_pred = model.predict(news_test)
    
    # Report the metrics
    target_names = ['Real', 'Fake']
    print(model.__class__.__name__)
    print(classification_report(y_true=labels_test, y_pred=labels_pred, target_names=target_names, digits=3))
MLPClassifier
             precision    recall  f1-score   support

       Real      0.956     0.950     0.953       793
       Fake      0.950     0.956     0.953       791

avg / total      0.953     0.953     0.953      1584

KNeighborsClassifier
             precision    recall  f1-score   support

       Real      0.849     0.905     0.876       793
       Fake      0.898     0.838     0.867       791

avg / total      0.874     0.872     0.872      1584

QuadraticDiscriminantAnalysis
             precision    recall  f1-score   support

       Real      0.784     0.995     0.877       793
       Fake      0.993     0.726     0.839       791

avg / total      0.889     0.860     0.858      1584

GradientBoostingClassifier
             precision    recall  f1-score   support

       Real      0.921     0.868     0.894       793
       Fake      0.875     0.925     0.899       791

avg / total      0.898     0.896     0.896      1584

SVC
             precision    recall  f1-score   support

       Real      0.944     0.939     0.942       793
       Fake      0.940     0.944     0.942       791

avg / total      0.942     0.942     0.942      1584

GaussianNB
             precision    recall  f1-score   support

       Real      0.848     0.793     0.820       793
       Fake      0.805     0.857     0.830       791

avg / total      0.826     0.825     0.825      1584

RandomForestClassifier
             precision    recall  f1-score   support

       Real      0.868     0.805     0.835       793
       Fake      0.817     0.877     0.846       791

avg / total      0.843     0.841     0.841      1584

LogisticRegression
             precision    recall  f1-score   support

       Real      0.921     0.929     0.925       793
       Fake      0.929     0.920     0.924       791

avg / total      0.925     0.925     0.925      1584

Histogram of CV/Test Scores

jpg

TF-IDF

Getting sparse matrix

def bow2sparse(tfidf, corpus):
    rows = [index for index, line in enumerate(corpus) for _ in tfidf[line]]
    cols = [elem[0] for line in corpus for elem in tfidf[line]]
    data = [elem[1] for line in corpus for elem in tfidf[line]]
    return csr_matrix((data, (rows, cols)))
from gensim import corpora, models
from scipy.sparse import csr_matrix 

tfidf = models.TfidfModel(texts.get_bow())
tfidf_matrix = bow2sparse(tfidf, texts.get_bow())

## split the data
news_train, news_test, labels_train, labels_test = train_test_split(tfidf_matrix, 
                                                                    labels,
                                                                    test_size=0.25,
                                                                    random_state=seed)
dictionary is not set for <tools.DocumentSequence object at 0x11766bac8>, setting dictionary automatically
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

# LogisticRegression
lg = LogisticRegression(C=104.31438384172546, penalty = 'l2')

# Naive Bayes
nb = MultinomialNB(alpha = 0.01977091215797838)

classifiers_list = [lg, nb]

from sklearn.metrics import classification_report

# print details of testing results
for model in classifiers_list:
    model.fit(news_train, labels_train)
    labels_pred = model.predict(news_test)
    
    # Report the metrics
    target_names = ['Real', 'Fake']
    print(str(model))
    print(classification_report(y_true=labels_test, y_pred=labels_pred, target_names=target_names, digits=3))
LogisticRegression(C=104.31438384172546, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
             precision    recall  f1-score   support

       Real      0.964     0.913     0.938       820
       Fake      0.912     0.963     0.937       764

avg / total      0.939     0.938     0.938      1584

MultinomialNB(alpha=0.01977091215797838, class_prior=None, fit_prior=True)
             precision    recall  f1-score   support

       Real      0.899     0.930     0.914       820
       Fake      0.922     0.887     0.905       764

avg / total      0.910     0.910     0.910      1584

Feature Ranking with Logistic Coefficients

# LogisticRegression
lg = LogisticRegression(C=104.31438384172546, penalty = 'l2')

# Using whole data set
lg.fit(tfidf_matrix, labels)

# map the coeffients with word and sort the coeffients
abs_features = []
num_features = tfidf_matrix.shape[0]
for i in range(num_features):
    coef = lg.coef_[0,i]
    abs_features.append(((coef), texts.get_dictionary()[i]))
        
sorted_result = sorted(abs_features, reverse = True)
fake_importance = [x for x in sorted_result if x[0] > 3]
real_importance = [x for x in sorted_result if x[0] < -4]
from wordcloud import WordCloud, STOPWORDS

def print_wordcloud(df, title=''):
    wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', width=1200, height=1000).generate(
        " ".join(df['word'].values))
    plt.imshow(wordcloud)
    plt.title(title)
    plt.axis('off')
    plt.show()

Words with inclination to predict 'FAKE' news

df2 = pd.DataFrame(fake_importance, columns=['importance', 'word'])
df2.head(30)
importance word
0 13.781102 0
1 13.562957 2016
2 13.490582 october
3 13.062496 hillary
4 11.192181
5 9.829864 article
6 9.411360 election
7 8.903777 november
8 8.181044 share
9 7.564924 print
10 7.507189 source
11 7.418819 via
12 7.150410 fbi
13 6.939386 establishment
14 6.752492 us
15 6.549759 please
16 6.421927 28
17 6.111584 wikileaks
18 5.914297 russia
19 5.777677 4
20 5.701762
21 5.701082 email
22 5.633363 war
23 5.461951 corporate
24 5.432547 26
25 5.248264 photo
26 5.205658 1
27 5.178585 healthcare
28 5.066447 google
29 5.055815 free
print_wordcloud(df2,'FAKE NEWS')

png

Words with inclination to predict 'REAL' news

df3 = pd.DataFrame(real_importance, columns=['importance', 'word'])
df3.tail(30)
importance word
48 -5.819761 march
49 -5.820939 state
50 -5.911077 attacks
51 -5.911102 deal
52 -5.918800 monday
53 -5.937717 saturday
54 -6.068661 president
55 -6.108548 conservatives
56 -6.197634 sanders
57 -6.316225 continue
58 -6.577535 ``
59 -6.595120 polarization
60 -6.629481 fox
61 -6.644741 gop
62 -6.681231 ohio
63 -6.899471 convention
64 -7.051062 jobs
65 -7.260832 debate
66 -7.274652 friday
67 -7.580725 tuesday
68 -7.847131 cruz
69 -8.058610 candidates
70 -8.348688 conservative
71 -8.440797 says
72 -8.828907 islamic
73 -10.438137
74 -10.851531 --
75 -14.864650 ''
76 -14.912260 said
77 -16.351588 's
print_wordcloud(df3,'REAL NEWS')

png

Ensemble Learning

Besides, we used ensemble vote classifier to model the train data and try to obtain a better prediction from ensemble learning.

png

from model.ensemble_learning import EnsembleVoter

d2v_500 = loader.get_d2v(corpus="concat", win_size=23, epochs=500)
d2v_100 = loader.get_d2v(corpus="concat", win_size=13, epochs=100)
onehot = loader.get_onehot(corpus="concat", scorer="tfidf")
labels = loader.get_label()

d2v_500_train, d2v_500_test, d2v_100_train, d2v_100_test, onehot_train, onehot_test, labels_train, labels_test = \
    train_test_split(d2v_500, d2v_100, onehot, labels, test_size=0.25, stratify=labels, random_state=seed)

classifiers = [mlp, svc, qda, lg]
Xs_train = [d2v_500_train, d2v_100_train, d2v_100_train, onehot_train]
Xs_test = [d2v_500_test, d2v_100_test, d2v_100_test, onehot_test]

ens_voter = EnsembleVoter(classifiers, Xs_train, Xs_test, labels_train, labels_test)
ens_voter.fit()
print("Test score of EnsembleVoter: ", ens_voter.score())
Test score of MLPClassifier: 0.9526515151515151
Test score of SVC: 0.9425505050505051
Test score of QuadraticDiscriminantAnalysis: 0.9463383838383839
Test score of LogisticRegression: 0.9513888888888888
Fittng aborted because all voters are fitted and not using refit=True
Test score of EnsembleVoter:  0.963901203293

fake-news-detection-pipeline's People

Contributors

shuheng-liu avatar stevenfyy avatar

Watchers

 avatar  avatar  avatar

Forkers

avi0gaur

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.