babakhemmatian / gay_marriage_corpus_study Goto Github PK

View Code? Open in Web Editor NEW

0.0 3.0 0.0 132.69 MB

LDA and RNN for Reddit comments

Python 100.00%

gay_marriage_corpus_study's Introduction

Think of the Consequences: A Decade of Discourse About Same-sex Marriage

Collaborators:

Babak Hemmatian

Brown University, Department of Cognitive, Linguistic and Psychological Sciences

Sabina J. Sloman

Carnegie Mellon University, Department of Social and Decision Sciences

Steven A. Sloman, Uriel Cohen Priva

Brown University, Department of Cognitive, Linguistic and Psychological Sciences

Citation

Hemmatian, B., Sloman, S.J., Cohen Priva, U., & Sloman, S.A. (2019). Think of the consequences: A decade of discourse about same-sex marriage. Behavior Research Methods. https://doi.org/10.3758/s13428-019-01215-3.

Outline

This repository holds Python and R code that can be used to extract information about trends in Reddit comments that relate to same-sex marriage since 2006. Trivial changes to the filter at the end of lda_defaults.py can make it possible to use the code for extracting and analyzing comments related to any topic.

This repository was developed in the process of preparing an academic manuscript published in the peer-reviewed journal Behavior Research Methods, accessible here. Results reported in the manuscript and included in this repository are based on Reddit data from January 2006 to September 2017.

More details about the contents of each directory in this repository can be found in readme.txt files included in the relevant folder.

Dataset

The code is written to work with a pre-existing corpus composed of all posts on Reddit from 2006 until the present. This compressed corpus ignores hierarchical associations between comments. Sample files from the dataset have been included in the repository and can be in the Sample_Original_Reddit_Data folder. The repository includes some functions for downloading and managing the corpus data. It can be used to retrieve certain high-level statistics, pre-process the data, or draw random equally-large subsamples of comments from different years for further analysis. The outputs of functions are written to file and can be readily imported for future iterations.

Original text of posts in our reported corpus, as well as pre-processed versions of them can be found in the Corpus folder.

Analysis Tools and Results

Topic Modeling

You can use Latent Dirichlet Allocation (LDA) to create and examine topic models of the corpus via Reddit_LDA_Analysis.py. Default hyperparameters for the model to be trained can be set using lda_defaults.py, and can be overridden by assigning the desired value in lda_config.py. Functions for determining top topics via contribution to contextual word-topic assignments, sampling top comments associated with each topic that pass certain criteria, and extracting temporal trends are included. Unless the default path variable is changed, this file should be in the same directory as Utils.py.

Models developed for the manuscript using this module can be found in the Learned_Models folder.

A sample of comments most representative of each topic in the LDA model reported in the manuscript can be found in the Most_Repr_Comments folder. The following IDs can be used to identify the reported topics:

4 employer attitude and regulations
12 religious arguments
14 cultural and historical status
16 forcing vs. allowing behaviors
22 politicians' stance
27 children of same-sex parents
28 same-sex marriage as a policy issue
33 personal anecdotes
48 freedom of belief
49 LGBT rights

The set of top words associated with models with various numbers of topics (discussed in the manuscript) can be found in Top_Words.

Python files with "impactful" in the filename were used to find the most popular and unpopular posts in the corpus (based on upvotes) and gather ratings for their association with two specific classes of arguments (see the manuscript for more details). The set of sampled comments that were used in our study and their associated ratings can be found in the Impactful_Comments folder.

Word-based model

You can use nullmodel.py to train a keyword-based binary classifier on a set of rated comments (in the case of our manuscript, ratings of most impactful posts in the corpus). The model uses KL-divergence between word-document co-occurence distributions to find the keywords that best represent one of the two classes.

Regression based on topics and keywords

R code was developed to examine the predictive capacity and classification accuracy of LDA and the word-based model for predicting ratings of impactful comments using linear and mixed-effects regressions. The code, as well as associated data and results reported in the manuscript, can be found in Regression_Analyses. This folder also includes code for potting the contribution of top topics to the LDA model.

Recurrent Neural Networks (in progress)

You can set hyperparameters and call functions to create recurrent Neural Network (NN) models of the corpus using Combined_NN_Model.py. Based on the hyperparameters, this code can be used to create a language model of the corpus (trained on predicting the upcoming word in a given comment), or a comment classifier. The language model can be used as pre-training for the classifier if training data is scarce. The current default version of the code predicts the sign of each comment's upvotes (negative, neutral, positive). This file also needs to be in the same directory as Utils.py. A separate neural network for regression over human interval ratings of comments will be added to the repository.

Facebook Comments

The collaborators plan to extend this project by including comments from major news outlets on Facebook. The Facebook folder holds the unfinished code for scraping and analyzing Facebook comments. Old_Code also includes unfinished code that is archived for the internal use of the collaborators.

gay_marriage_corpus_study's People

Contributors

Watchers

gay_marriage_corpus_study's Issues

Memory issue

Hi Sabina,

Since I couldn't get gensim to run on my Windows, I decided to go with the lda package. It requires a document-term cooccurrence matrix as input and I spent all day today writing code to create that matrix. While it worked on a toy body of text I created, the computer apparently runs out of memory before creating such a matrix for two files from the corpus. Do you know how I can get around this issue? I added the code to the parsereddit file. I have also added the lda code (copied from the package, might need changes) as a comment at the end of the code.

Thanks!

Calculate topic coherence metrics for topic models with varying num_topics

To address the difference in "quality" between LDA models with different numbers of topics, calculate the topic coherence of models with 25, 50, 75 and 100 topics (example here).

Monthly topic contributions

Monthly topic contributions
Any topic passing a threshold of average 2% contribution over the years considered top topic
Remove shared variables from the multicore function to speed up processing

Discuss the regex used to filter posts/comments

A couple things I noticed/questions about the regex:

It doesn't match the string r"[same-sex|same sex] marriage".
Should we include patterns like "LGBTQ rights"?

Linear regression parameter estimates for top topic trends

For each top topic, run a linear regression model on the time-series of the topic's contribution over time. This will allow us to calculate:

If the slope estimates are significantly different from 0
If the slope estimates are significantly different from each other

Parallelize parsing and sampling

Sample training data for the LDA evenly across years

Add a function to Utils.py that writes to file n random comments from write_original files for a set of given years (when year==2008, include data from 2006 and 2007, too).

Calculate mean and SD of the length of comments assigned to each top topic

When should we filter posts that mention gay marriage?

From what I understand you're working on something that:

Reads JSON data from a file,
Creates a Python object that stores the line of the file that match a regex, and
Then lemmatizes/analyzes the Python object.
I propose filtering posts before saving the response from the API to a file. To be more specific, we could filter the posts that get passed from json_obj (JSON-formatted API response) to data (JSON data to be written to a file) in lines 42-43 of get_all_posts.py.

Does that make sense? If so, I can definitely implement that. Also, did you tell me that you've already done some work writing the regex we'll use to filter posts?

Regression NN

Rewrite the NN code to train it for regression on values-consequences-preferences human ratings
The code should work with the ratings stored in a CSV file, formatted like the sampled comments resulting from the LDA analysis
Retain the ability to use the network for classification, if only to compare performance with regression

Develop HIT for classifying comments as consequentialist or values-based

I committed a skeleton in 3c56c27, but it's super rough and we're under no obligation to build on it.

Bootstrapping LDA models

Create a separate module to create n bootstrapped LDA models, each using a subset of fixed size of our data. These models can then be used to generate confidence intervals around the model as a whole (using metrics like model alignment), and around trends in the prevalence of any given topic over time.

Relatedly, think about other ways of aligning two topics from different models, e.g.

The number of comments for which the topics are both considered the "most probable" in their respective models
The MSE between the probability assigned to them by individual comments, i.e. sum([ (prob(model1, topic1, comment) - prob(model2, topic2, comment))**2 for comment in comments ])

Break up Utils

I think it would be nice (i.e. more usable and readable) to break up Utils into a few separate modules, e.g. a Parser, an LDATrainer, an NNTrainer, a TopicContributionCalculator, etc. @BabakHemmatian, what do you think? I'd be more than happy to take this on if you think it's a good idea.

Edit: As of 1716955, this is done except for one outstanding thing that's still bothering me: Top topics are still calculated directly from Reddit_LDA_Analysis. I think this should be incorporated into the LDAModel class as a class method (and the report object incorporated as a LDAModel class attribute).

Also, I'd like to collect all filenames into a set of configurable class attributes, rather than having them be hard-coded in the bodies of various class methods.

Rating predictors

Which topics predict the value vs. consequence vs. preference human judgments best? This can be done using linear models to confirm the results of LDA.

Update the parser

Update list of stopwords, so that words with deontic value are not removed from the original comments
Make sure removal of special characters does not hinder interpretability of preprocessed text
Ensure that the hashsum check does not result in infinite loops
Ensure that it is easy to continue parsing if the Parser is stopped after processing a specific file

Add arg to select_random_comments to only select from comments with length > min_comm_length

See title.

Download Reddit comments "as-needed"

Add support to:

Download Reddit comments to disk as-needed
Clear them once the cleaned text has been written as a separate file

Calculate monthly instead of yearly top topic contributions

See title.

Why are we getting comment indices from RC_Count_List during indexing for training and test sets?

The variable indices created in Define_Sets and passed to Create_New_Sets is read from RC_Count_List (which is a list of counts per month IIRC):

ipdb> indices
[0, 0, 3, 11, 4, 17, 30, 8, 12, 14, 38, 15, 14, 28, 26, 27, 23, 34, 57, 68, 99, 95, 69, 94, 76, 92, 71, 79, 178, 185, 104, 166, 177, 366, 1394, 371]

However, it seems to be being treated as though it's a list of all comment indices. In Create_New_Sets it's used to determine the comment indices to sample the train and eval sets from:

67    num_comm = indices[-1]
68    indices = range(num_comm)
...
126    LDA_sets['eval'] = sample(indices,num_eval)
127    LDA_sets['train'] = set(indices).difference(set(LDA_sets['eval']))

How can we respond to uneven amounts of data across media sources and years?

I just put two DataFrames in the repo that should give us a sense of how our data is distributed across platforms and years. You can open them like this:

import pandas as pd
pd.read_pickle('n_posts.pandas')

n_posts.pandas shows the number of posts we have grouped by source and year, and n_comments.pandas shows the number of comments we have grouped by source and year.

As I wrote in my email, the only caveat to keep in mind is that the program crashed while collecting data for 2015 posts from Fox News--because there was so much of it. So while looking at these numbers, assume that we could also have tens of thousands of comments from Fox News in both 2015 and 2016.

One thing I noticed is that we have no data from 2008, and very little from 2009.

So I have two questions (well, two categories of questions...):

I haven't really studied to what extent there's an unequal distribution if we just group by source partisanship (in other words, it's totally possible that, for example, the thousands of FoxNews comments are balanced out by thousands of nytimes comments). Do you notice anything about these numbers that we should (or shouldn't) correct for? What else should I be asking about the distribution of this data, other than how it groups by source/year?
Given the potential for an uneven distribution across sources and years, how and do you think we should re-weight or re-sample our data? Some vaguely-formulated proposals I have are to:

Sample randomly and evenly across platforms (this would not account for the lack of data from '08).
Almost equivalently, re-weight observations so comments from sources (and years?) with less data are more heavily weighted.
If either of these approaches makes it so we don't have enough data, or are heavily weighting ridiculously small amounts of data, we could also look into retrieving data from additional news sources.

Edit: Here is the infographic that inspired my initial choice of platform: http://wilkins.law.harvard.edu/projects/2017-08_mediacloud/Graphics/Fig7_11.pdf. It's taken from this paper: Faris, Robert and Roberts, Hal and Etling, Bruce and Bourassa, Nikki and Zuckerman, Ethan and Benkler, Yochai, Partisanship, Propaganda, and Disinformation: Online Media and the 2016 U.S. Presidential Election (August 2017). Berkman Klein Center Research Publication 2017-6. Available at SSRN. Quoting from the paper, "[t]he size of each media source node is in proportion to the number of other media sources that link to that source at least once" and "[t]he colors on the map reflect the partisan pattern of attention to the media sources based on the sharing
behavior of Twitter users who have clear partisan allegiances".

Topic similarity across various iterations

Function for similarity between top topics

Test different topic contribution calculations

Quantify the difference between the original "one-hot" topic contribution calculation, and a topic contribution calculation function that records entire probability distributions over topics.

***The dev module for this issue is dev/issue19.py.***

Raw comment files contain files from other months

Just posting this for the sake of documentation: Sometimes raw comment files contain comments from other months, e.g. the file for 2007-12 contains two (relevant) comments from 2007-11. This explains discrepancies between RC_Count_List and RC_Count_Dict.

Random subsample bug

The random subsampling function seems to have a bug. Many of the numbers get written to file multiple times, so that the sets of unique indices comprising the training and evaluation sets are much smaller than they should be.
I made a lot of changes to different parts of the code. But I didn't touch your function. So I don't think this is my doing.
I will email you the correctly parsed dataset right away so that you won't have to run the parser again and jump through all the hoops I had to in the past few days.

Diverse samples

Random comment samples
Probabilities of all top topics for each sample
Different categories of top topics based on the values/consequences distinction

Patch the topic contribution function

Fix the topic contribution function so that it uses comment-specific topic probabilities for calculations.

babakhemmatian / gay_marriage_corpus_study Goto Github PK

gay_marriage_corpus_study's Introduction

Think of the Consequences: A Decade of Discourse About Same-sex Marriage

Collaborators:

Babak Hemmatian

Brown University, Department of Cognitive, Linguistic and Psychological Sciences

Sabina J. Sloman

Carnegie Mellon University, Department of Social and Decision Sciences

Steven A. Sloman, Uriel Cohen Priva

Brown University, Department of Cognitive, Linguistic and Psychological Sciences

Citation

Outline

Dataset

Analysis Tools and Results

Topic Modeling

Word-based model

Regression based on topics and keywords

Recurrent Neural Networks (in progress)

Facebook Comments

gay_marriage_corpus_study's People

Contributors

Watchers

gay_marriage_corpus_study's Issues

Recommend Projects

Recommend Topics

Recommend Org