Code Monkey home page Code Monkey logo

rare-disease-social-media-project's Introduction

NCATS Logo

RDSMproj

RDSMproj (Rare Diseases Social Media Project) for the National Center for Advancing Translational Sciences at the NIH. This project looks at mining information from social media (Reddit) and finding subreddits that are related to different rare diseases found in the GARD database. The project matches rare diseases to Reddit subreddits, downloads the post and comment data, and then analyzes the text data to find the different topics that people are talking about.

Overview

The project is split into four packages as part of rdsmproj:

  1. mapper is a python package that maps text to a rare disease(s) using nltk and spaCy. An alternate name for this package is NormMap V2.
  2. sm_reddit is a collection of scripts that utilizes pmaw to download Reddit post and comment text data for use in topic modeling or other text analyses.
  3. tm_t2v is a python package that creates topic models of text using Top2Vec.
  4. tm_lda is a (legacy) python package that creates topic models of text primarily using LDA as implemented by Gensim. This package was used in this paper.

Installation

Ensure that you have up to date copies of pip, setuptools, and wheel prior to installation.

pip install --upgrade pip setuptools wheel

For now, each package above is installed separately. Installation can be done using pypi

pip install rdsmproj[mapper]
pip install rdsmproj[sm_reddit]
pip install rdsmproj[tm_t2v]
pip install rdsmproj[tm_tlda]

Quick Start

For more information view the API guide.

Examples using sm_reddit

sm_reddit.GetPosts

from rdsmproj import sm_reddit

pmaw_args = {'limit':1000}
# Example subreddit 'MachineLearning'.
# Passes pmaw arguments to search_submissions.
sm_reddit.GetPosts(name='MachineLearning', silence=False, pmaw_args=pmaw_args)

sm_reddit.GetRedditComments

from rdsmproj import utils
from pathlib import Path

# Default path to where the post data is located.
path = utils.get_data_path('posts')
data = utils.load_json(Path(path,'MachineLearning_posts.json'))
# Example passes pmaw arguments to search_submission_comment_ids.
sm_reddit.GetRedditComments(data=data, silence=False, pmaw_args=pmaw_args)

Example using preprocess to process text data.

preprocess.Preprocess

from rdsmproj import preprocess as pp

# Example processes the comment data for use with tm_lda or tm_t2v.
data = pp.PreProcess(name='MachineLearning')
documents, tokenized_documents, id2word, corpus = data()

Example using tm_t2v to create and analyze a top2vec model.

tm_t2v.Top2VecModel

from rdsmproj import tm_t2v

embedding_model = 'doc2vec'
name = 'MachineLearning'
clustering_method = 'leaf'
i = 0

# Creates and saves a model.
model = tm_t2v.Top2VecModel(name,
                            f'{name}_{embedding_model}_{clustering_method}_{i}',documents=documents,
                            embedding_model=embedding_model,
                            speed='fast-learn'
                            ).fit()

tm_t2v.AnalyzeTopics

# Analyzes model and records the results.
tm_t2v.AnalyzeTopics(model=model,
                     model_name=f'{name}_{embedding_model}_{clustering_method}_{i}',
                     subreddit_name=name,
                     tokenized_docs=tokenized_documents,
                     id2word=id2word,
                     corpus=corpus,
                     model_type='Top2Vec')

To Do

  • Test package install from TestPyPI.
  • Update main README.md Quick Start with examples for most packages.
  • Create sm_reddit README.md.
  • Create tm_t2v README.md.
  • Create tm_lda README.md.
  • Create API guide and documentation pages.
  • Add visualizations and flowcharts to the readme files.
  • Upload to PyPI.

rare-disease-social-media-project's People

Contributors

bkaras1 avatar devonleadman avatar wzkariampuzha avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

devonleadman

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.