data4democracy / drug-spending Goto Github PK

View Code? Open in Web Editor NEW

68.0 24.0 46.0 33.6 MB

Project to understand pharmaceutical spending, currently focused on US government programs.

Python 0.77% R 0.88% HTML 68.09% Jupyter Notebook 30.27%

medicare data-analysis data-science civic-tech healthcare

drug-spending's Introduction

drug-spending

Work on this project has ended. Check out conclusions and lessons learned in WRAP-UP.md.

Slack: #drug-spending

Project Leads:

Current: @darya.akimova, @chaya.stern
Former: @mattgawarecki, @jenniferthompson

Maintainers (people with commit access): @dhuppenkothen, @skirmer

Project Description: At its heart, this project seeks to gain a deeper understanding of where and how Medicare tax dollars are being spent. Healthcare is an increasingly important issue for many Americans; the Centers for Medicare and Medicaid Services estimate over 41 million Americans were enrolled in Medicare prescription drug coverage programs as of October 2016.

Because healthcare spending is a very real concern, we want to make it real -- not just for people who like reading graphs and looking at statistics, but for everybody. We're harnessing the power of data and modern computing to find answers to the questions people keep asking, and to make those answers easily understandable for anyone who wants to know more; questions like:

Which conditions are we spending the most to treat?
How much are people paying out of their own pockets for prescription drugs?
What could Medicare and the American people do to save money, while also ensuring the same quality of care?

In conducting this research, we hope to gain new insights and create a positive impact for healthcare consumers and providers across the United States. For more details, head to our objectives.

Getting started

If you haven't already, read this first. Then:

Things you should know about

"First-timers" are welcome! Whether you're trying to learn data science, hone your coding skills, or get started collaborating over the web, we're happy to help. (Sidenote: with respect to Git and GitHub specifically, our github-playground repo and the #github-help Slack channel are good places to start.)
We've got (GitHub) Issues. Ready to dive in and do some good? Check out our weekly update and our issues board. Issues are how we officially keep track of the work we're doing, what we've done, and what we'd like to do next. If you'd like to work on something, comment on the issue and/or ping a lead on Slack so we can make assignments.

You can identify different issue types by their tags. If you're new to either Github or data science, pay special attention to:
- first-pr: smaller issues to cut your teeth on as a first-time contributor
- beginner-friendly: issues suitable for those with less experience or in need of mentorship
We believe good code is reviewed code. All commits to this repository are approved by project maintainers and/or leads (listed above). The goal here is not to criticize or judge your abilities! Rather, sharing insights and achievements this way ensures that we all continue to learn and grow. Code reviews help us continually refine the project's scope and direction, as well as encourage the discussion we need for it to thrive.
This README belongs to everyone. If we've missed some crucial information or left anything unclear, edit this document and submit a pull request. We welcome the feedback! Up-to-date documentation is critical to what we do, and changes like this are a great way to make your first contribution to the project.

Currently utilized skills

Take a look at this list to get an idea of the tools and knowledge we're leveraging. If you're good with any of these, or if you'd like to get better at them, this might be a good project to get involved with!

Python 3 (scripting, analysis, Jupyter notebooks, visualization)
R (analysis, R Markdown notebooks, visualization)
JavaScript (visualization)
Data extraction/ETL
Data cleaning
Data analysis

FAQ and other useful info

Downloading this repository

To download the code and data inside this repository, you'll need Git. Once you've got the necessary tools, open a command prompt and run git clone https://github.com/data4democracy/drug-spending.git to start downloading your own working copy. Once the command finishes, you should see a new drug-spending directory in the current directory's file listing. That's where you'll find it!

Project structure (or, "how do I find `thing`?")

Data: all our datasets are housed in our repo on data.world, which both keeps our Github repo streamlined and allows us to take advantage of data.world features like querying and discussion. If you're using R or Python, data.world has query clients for both. (R client; Python client)
Documentation:
- See our docs directory for general documentation, including more detailed objectives and (coming soon) a glossary of terms. We'll add other docs there as we go.
- Our datadictionaries directory contains an overview of our current available datasets, as well as detailed data dictionaries for each and tips on how to most effectively contribute more data.
Source code and notebooks: We currently have one directory each for Python and R code, with subdirectories for analyses/visualizations; notebooks; apps (eg, Flask/Shiny); and data collection/cleaning scripts.

Core data sets

https://data.world/data4democracy/drug-spending

Performing data analysis

There are many ways to analyze the data in this repository, but "notebook" formats like Jupyter and R Markdown are the most common. The setup process for these tools is in-depth enough to be outside the scope of this README, so please refer to documentation at the aforementioned links if necessary. If something isn't working quite right for you, that's okay! Continue reading to see how you can reach out for assistance.

Talking to people/asking for help

If you have questions or you'd like to discuss something on your mind, reach out to us in the #drug-spending channel on Slack. Project leads and maintainers are available for troubleshooting, brainstorming, mentorship, and just about anything else you might need.

System requirements (suggested)

Git (check out the github-playground repository if you need a good place to get accustomed)
An analytical language of your choice (Python, R, Julia, etc.)
Python 3 (for Jupyter/.ipynb notebook files)
RStudio (for R Markdown/.Rmd notebook files)

drug-spending's People

Contributors

Stargazers

Watchers

Forkers

jenniferthompson kylerbrown margaretmf sgalletta213 jacobcoblentz chrisjewell dhuppenkothen davidlibland cduvallet skirmer mattgrieser keytond rkahne scottschwalm kathy0305 nikitasingh981 rflprr kasiarachuta-zz andypicke ziyadnazem darya-akimova mattbrown88 jlistman etcadinfinitum fanying2015 amandasmith2 proof-by-accident n2itn chayast anandkarthick prlakhani kylestahl jmm4138 brainy749 boston123456 d-ghale bsipin20 lazuraslong veena-v-g fagan2888 batterysnoopy mentors4edu mandar-karhade ashwinmoorkoth1 diargot

drug-spending's Issues

Tidy drug_list.json

Status

Assigning this to myself. Currently working on formatting the nested list of therapeutic areas into a workable format.

Task

Tidy and/or possibly explore the drug_list.json dataset, found on data.world
Data dictionary: https://github.com/Data4Democracy/drug-spending/blob/master/datadictionaries/drug_list.md
Tidy format reference: https://ramnathv.github.io/pycon2014-r/explore/tidy.html

What we're looking for

Tidying:

Convert the .json to a .csv
Convert to tidy format, particularly paying attention to the drug classes
Separate the name column into a brand_name and generic_name, or similar, where appropriate
Cleanup the approval_status column, so that the date can be easily converted to date format

Other:

Explore how many of the drugs can be matched to the Medicare spending data?
How many drugs have multiple categories? Could the information in this dataset be useful for categorizing drugs based on therapeutic use?

How this will help

The drug_list.json and the usp_drug_classification.csv files seem to include the most accessible drug category information, as in, the classification systems lean more towards therapeutic classification, rather than scientific/pharmacological like some of the others. However, the drug_list.json needs some tidying to convert it into a more user-friendly format. Another issue with this dataset is that the specific_treatment column will need some language processing in order to make this column usable. Need to know if the work will be worth it, hence need to know how many of the drugs from this file are in the Medicare spending files.

Collect and clean Medicare Part B spending data

Task

Download the Part B spending data from CMS and convert it to an accessible format that's easier to analyze.

How this will help

Part D is a large component of drug spending with respect to Medicare, but Part B is significant, as well. If we incorporate Part B statistics into our decision-making and analyses, we can make comparisons and get a better picture for Medicare drug spending as a whole.

Create Github doc listing current and potential data sources

As the project evolves, we're collecting lots of disparate sources of data, and getting leads to look into others. We'd love for someone to put together a list of

what we're using currently
what has been suggested but not yet looked into
what has been looked into and is unlikely to be helpful

to help prevent effort duplication and better document what we have and what we need.

Design dashboard UI

Currently blocked: Need to decide which information will be displayed by the dashboard

As we define requirements for our dashboard, we need to sketch out a friendly, intuitive user interface to display the information it holds.

Find drugs with the largest year-over-year increases in claims, spending

Task

Analyze all available data to determine which drugs and/or types of drugs show the biggest year-over-year increases in claims and spending (total and/or per-user).

How this will help

If we can understand which medicines are being used more often, it gives us more ideas as to how America's health issues continue to evolve. Using this information, we can ask deeper questions and search for ways to improve the quality of healthcare for millions of Americans.

Tidy and upload Medicare Part B Drug Spending data (2011-2015)

Status

Assigning this to myself.

Task

Medicare Part B Spending data .xlsx format, 2011-2015: https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Information-on-Prescription-Drugs/2015MedicareData.html

What we're looking for

Tidy data that can be uploaded on data.world, along side with the original data file.
Create data dictionaries for all files uploaded

How this will help

While Medicare Part D covers outpatient drugs (drugs that you would administer yourself at home), Part B covers drugs that would typically be administered in a hospital setting (through an IV pump, for example). Many cancer drugs would be covered by Part B, for example, because they must be administered with a special pump. These datasets for Part B spending can help us better understand overall Medicare spending on drugs.

Visualize/describe how payroll taxes are allocated when funding Medicare Part A

Medicare Part A is the only piece of Medicare funded directly by payroll taxes. We'd like to show how each tax dollar is spent on different categories of expenditure: home health care, skilled nursing facilities, hospital stays, etc.

Add datasets collected & cleaned by `read_data.py` to data.world

read_data.py collects and tidies several datasets not currently available on data.world - we need to add these to the data.world repo + document them here for use by all.

Visualize/model relationship between lobbying expenditures and brand name prices

We have data from OpenSecrets on lobbying expenditures from pharmaceutical companies, and keys are in progress to join this with Medicare Part D spending data (see issue #37). Once we can join these datasets, we'd like to see if there is a relationship between lobbying expenditures and Medicare costs for those companies' medications.

Find more Medicare Part D Spending data (if it exists)

Status

Currently being discussed in the comments down below.

More data is available for 2010, but only for 40 drugs that were selected according to cms.gov criteria for the 2014 dashboard
D4D Slack member Gilsha shared a research article on potential medicare savings if brand name drugs are exchanged for generics ( https://jamanetwork.com/journals/jamainternalmedicine/article-abstract/2668631?redirect=true ). The data presented in this paper suggests that they also only had access to the 2011-2015 data, which may mean that they could not find/access data from earlier years.

Task

Investigate if Medicare Part D Spending datasets exist for other years, such as 2016 or from before 2011, and upload it to data.world.

May be related to Issue #49

What we're looking for

Data for Medicare Part D spending in .csv format, with variables similar to the datasets already on data.world (see any of the spending-201x.csv files on data.world for reference). It's enough if someone can find where this data can be found, even if you're not sure how to download and/or tidy it yourself.

How this will help

More data across the years can help us better understand how Medicare Part D spending has changed over time since its implementation.

Add context to CMS data on Medicare Part D

We're currently working mainly with Medicare Part D claims & spending data, which is informative on its own; however, we need context in order to have a good idea of the broader picture. Some ideas:

What % of overall Medicare spending does Part D comprise? What % of the overall federal and HHS budgets?
What % of Medicare recipients opt for Part D coverage?
What significant events or legislation might help us understand either overall spending patterns or usage patterns for specific drugs/drug classes? (example: benzodiazepine prescriptions were not covered until 2013)
How do prices paid by Medicare Part D plans compare to prices paid by non-Medicare commercial plans? Self-pay patients?
How have Medicare Part D premiums changed during the time that we have claims data?

CMS may have some of this data available, or we may need to look at other sources.

More general context: What are these drugs? A straightforward way to add context to names of generics (eg, "alprazolam" is better known as "Xanax", or anything with "metformin" is a class of diabetes medications) would be helpful in all of these projects.

Create glossary of terms found in our datasets/context

Healthcare has lots of abbreviations, acronyms and terms. Most of us don't know them all. We need a glossary to help us understand them. Ideally, we'd have a format like:

term: quick description. Link(s) for more info.
RxNorm: normalized names for clinical drugs to enable linking between sources. More info: NIH/National Library of Medicine.

Terms we've come across so far:

RxNorm
NDC
Medicare
Medicaid
HCPCS

Add others below and I'll update the above list.

[WIP] Join Medicare data to USP drug classification

Status

Assigned to myself

Task

We want to join the Medicare spending data sets with the USP drug classification so that we can relate drug spending to therapeutic indications. I will be starting with the USP drug classification because we find that their description is more accessible than ATC (which is more technical).

What we're looking for

I want to join the Medicare drugname_generic with either usp_drug or drug_example. The usp_drug is the active ingredient and drug_example is the salt and/or formulations. I think the drugname_generic is also just the active ingredient.

How this will help

Joining these datasets will allow us to group therapeutic indication to Medicaid drug spending.

Consolidate a list of data sources for drugs and their uses

Task

Create a list of data sources linking drug names to their uses and/or other similar drugs. Post the end result in this issue so we'll have a record we can look back on.

How this will help

Issue #6 also relates to drug uses; specifically, it seeks to link the drugs in our Medicare Part D data set to their respective purposes. While we started #6 with a good set of data to work from, there's a growing list of places we can look to get more information. With the right set(s) of eyes, we might be able to cover the CMS drug list more comprehensively. Before we can do that, though, we need to actually list out all the sources we're aware of.

Breakdown of drug spending by use(s)/purpose(s)

Task

Analyze Medicare Part D spending to build an understanding of the most treated conditions, as well as what's the most expensive to treat.

This is your chance to break out the tables, graphs, and charts!

How this will help

If we can get an idea of what Part D is being used to treat, we can do further exploration to find ways of reducing cost and improving overall quality of care.

Move drug spending data to data.world

Task

Upload drug spending data to data.world for long-term storage so it can be queried and downloaded easily and independently.

How this will help

By making our data available on data.world, we accomplish a few things:

GitHub is meant for storing code; data.world is much better suited to storing and querying data
We'll be keeping the source code repository "clean" and free of potentially large, unchanging files
Likewise, anyone solely interested in our data will not have to download our code to look at it
Putting our data on data.world will expose it to a larger and more focused community more in tune with data science and analysis

Tidy, document and submit data from OpenPaymentsData.CMS.gov

OpenPaymentsData.CMS.gov has data available on payments received by private physicians and teaching hospitals, broken down by type of expense, company, etc. We need someone to download this data, make sure it's in a tidy format, document it, and upload it for project use to our data.world repo. For details on the data contribution process, see our /datadictionaries README and data dictionary template.

Associate drugs with their therapeutic uses

Task

We currently have a listing of drug names (both brand and generic) and a separate list of ATC codes. We'd like to find a way to associate these two data sets in such a way that we can look up a given drug's therapeutic uses using its name.

How this will help

If we can establish a link between drug names and their uses, we'll be able to learn a ton about which diseases and conditions Medicare is spending money to treat. Among other things, this also opens the door to comparing cost and popularity of different drugs within the same class over time.

Things you need to know

Based on prior efforts, this task may take some significant effort to complete. Drugs often go by various names (even chemically/generically), so doing a simple text search may not be viable. Expect to have to deal with lots of exceptions and edge cases. We may even need to acquire more comprehensive data, which could require you to solicit other organizations.

Scrape Merck Manuals for drug names and uses

Task

The Merck Manuals website contains a listing of drugs, mapping generic names to brand names and listing usage indications (i.e., what the drug is prescribed for) with each one. We'd like to gather this data to build on our efforts to map drugs to their uses.

Start here: Merck Manuals Professional Version - Drug Information

This issue was spun off from #14.

Things you should know

The Merck Manuals website defaults to its consumer version. To see the professional version, one must select it explicitly. Hotlinks to the professional version redirect to the consumer version unless this selection is done beforehand. This issue can be circumvented by setting the HTTP Referer header to the value http://www.merckmanuals.com/professional.

Retrieving usage indicators for a drug may prove more complex than simply getting its name. Usage indicators are contained in a modal pop-up that appears when the user clicks on a drug name. Because the modal is controlled via JavaScript, the markup containing the desired information may not be visible to a basic "naive" scraper. This modal is a definitive guide to the drug in fine-grained detail, so some substantial text parsing may also be necessary.

What we're looking for

Output from this task should be one or more data files (CSV, feather, or otherwise). In this output, the following information should be recorded for each drug: generic name, brand name, and usage indicator(s).

How this will help

A robust dataset that correlates drugs with the conditions they're used to treat will prove invaluable as we start to dig into Medicare data. With the detail the Merck Manuals provide, we may be able to provide the clearest picture to date as to trends in Medicare drug spending and create snapshots that show how the Medicare population's health has changed over time.

Explore and extract data from data.cms.gov

Data.CMS.gov is the Centers for Medicare and Medicaid Spending data repository. Every time we go to the site we find more data available. We need people to look for potentially helpful datasets there, crosscheck against what we already have available, and post when they find promising data so that we can create new issues for collecting/tidying/uploading specific datasets.

Investigate + add data from drugbank.ca

@cduvallet made us aware of a site (drugbank.ca) that looks like it has very promising data! We need someone to

Contact them to make sure it's OK if we download and process the data and include it in our data.world repo, of course giving proper citations (their TOS look promising)
Determine exactly what data is available and what would be most helpful in our context
Download, tidy and add that data to our data.world repo, along with a data dictionary

Inventory gathered data and document relationships among sets

Status

The details of this issue are currently being discussed in the comments below. This issue may contain elements where development work is helpful, but is not primarily code-driven.

Task

We should take a look at all the data we've gathered to document how and by which fields various datasets are interconnected.

What we're looking for

To ensure it fits in with all our existing documentation, the result of work on this issue should go into a Markdown file in the /docs directory of the repo. This file should list out the following:

all the data sources we've gathered: what they're called, and a one-line description of what they contain
any "key" fields that join together one or more datasets: names of the field(s) and a one-line description of what they represent

Optionally, it would be nice to have a graphical representation of how our datasets interconnect. This can be done programmatically, through the use of a graph visualization tool, or manually.

How this will help

Knowing which data sets are related makes it much easier for people to think about what insights can be gathered from them. It also identifies gaps in our understanding of the data we have and shows us what we should try to collect in the future.

Investigate new data source: Medical Expenditure Panel Survey

Task

Data is annually collected by the Agency for Healthcare Research and Quality that is used to make nationally representative estimates of health care use, expenditures, sources of payment, and health insurance coverage. Investigate the website housing all of this data, meps.ahrq.gov, and find any data files which include details on prescription drug spending.

How this will help

The data available on the site could help us understand the trends in prescription drug spending over the last decade. It could help add context to our analysis as well, as the data is not limited to Medicaid and Medicare recipients.

What we want to get out of this

The data we're interested on is a subset of what is available, so a first task might be to compile a list of names and descriptions of the data of interest. Some of the data sets have nearly 2000 features in the data, so it would also be very helpful if fields of interest were identified for each data set and clearly documented.

Investigate new data source: Chronic Conditions Data Warehouse

Before you continue

NOTE: We believe this website is only reachable from a United States IP. If you're working outside the US, you may not be able to work on this issue.

Task

The Centers for Medicare and Medicaid Services have another website that contains all sorts of information geared toward reducing spending on chronic diseases and conditions. Explore the Chronic Conditions Data Warehouse website and investigate any public data sets to see if they could help us put together interesting new data analyses.

How this will help

Sourcing new data that's related to Medicare drug spending -- especially different ways of looking at and categorizing spending -- will give our team the tools they need to build useful insights that people will want to see.

What we want to get out of this

To keep a record of the information you find, we'd like a short write-up in this issue thread explaining any available data sets that might be useful to us.

Create keys to join Medicare Part D spending, manufacturer, and lobbying info

We need to be able to join related datasets (stored at data.world) that currently don't have keys in common. Prime candidates currently include:

drugdata_clean.csv
Pharma_Lobby.csv
all spending-201x.csvs

Join Medicare Part D spending data to ATC Classification System

Status

@darwinyfu has expressed interest in working on this problem
Ongoing: @Anandkarthick has made progress on this issue by matching the spending files to drug_uses.csv (on data.world) by the drugname_generic

Update:

Upon working with it, I've discovered that drug_uses.csv is missing drugs that should be in the ATC system (and are present in the atc_codes_clean.csv dataset). It seems that the atc_codes_clean.csv is the way to go, even though it's currently in a messier state.

Update 3/1/2018

Matched app. 3k out of 4.5k items in the Medicare Part D dataset (see /drug-spending/R/datawrangling/atc_merge_atc_codes_clean_da and /atc_merge_drug_uses_csv_da for notebook), but putting this issue on hold and switching directions because:

ATC classification offers limited matching on brand name. While it is possible to get some successful matches based on the generic/chemical names, this approach becomes complicated for drugs with multiple active ingredients
Unsure how to combine multiple ATC code assignments, both for single compounds that have multiple ATC classifications and for drugs with multiple (2-8) active ingredients.
Started hitting a wall where a lot of research and hand annotation would be required to make progress

Task

Join drugs in the Medicare Part D spending data to their ATC Classification System categories by join the spending-201x.csv files on data.world to either atc_codes_clean.csv ~~or drug_uses.csv.~~

What we're looking for

Potential outputs that would help greatly to further the goal of this project:

A csv file of the two joined files if able to match all of the drugs in the Medicare spending files to ATC classification categories
OR both of the following (because the above will probably be challenging and/or a big time investment for one person):
A work in progress csv file with all successfully matched drugs
A work in progress csv file containing all of the drugs that COULD NOT be matched

The potential routes of matching drugs to their ATC classification categories:

`atc_codes_clean.csv` dataset

Match the drugname_generic column from any of the spending-201x.csv files on data.world to the atc_codes_clean.csv file on data.world by the level5 or kegg columns. But these columns are very messy in the atc_codes_clean.csv file. To be honest I didn't put in much effort into cleaning them because I wasn't sure how these columns would be used down the line.

drug_uses.csv dataset (which may be the easier of the two)

Two options:

Join the drugname_brand column in any of the spending-201x.csv datasets to the drugname_brand column in drug_uses.csv.
OR

Match the drugname_generic column from any of the spending-201x.csv files on data.world to the drugname_generic or substance or name columns in drug_uses.csv. These columns in the drug_uses.csv dataset are much cleaner than the atc_codes_clean.csv and I only realized that this file contained ATC classification information after I uploaded the atc_codes_clean.csv.

Side note: all of the spending-201x.csv files should have the same drugs in the same format, since they come from a parent wide .xlsx file that had the years data spread across the columns. If the drugs from one of the spending files are matched successfully, then the same steps should successfully join the other files, so no need to try and compare all of the spending files to find differences in drug names.

Ideal file formats for the analysis:

Jupyter notebook with code outputs
OR
R Markdown file (ideally knitted to html) with code outputs

How this will help

We're working towards matching drugs to their therapeutic uses. The USP Classification system, FDA drug approval data, and the ATC Classification system seem to be the best potential grouping categories of drugs from the datasets that have already been collected. The ATC classifications system provides classification seems to be more scientific/medical jargon-leaning, but it can be potentially useful down the line.

Explore and/or tidy the FDA_NDC_Product dataset

Status

@proof-by-accident has made progress on tidying the FDA_NDC_Product.csv dataset and is planning on tackling the exploration questions below.

Task

Explore and/or possibly tidy the FDA_NDC_Product.csv dataset, found on data.world
Data dictionary: https://github.com/Data4Democracy/drug-spending/blob/master/datadictionaries/FDA_NDC_Product.md
Tidy format reference: https://ramnathv.github.io/pycon2014-r/explore/tidy.html

What we're looking for

Tidying:
- Main columns that need tidying are: nonproprietaryname, substancename, active_numerator_strength, and active_ingred_unit. All of these columns may contain multiple values in one cell if a drug has multiple active ingredients. Ideal format is a separate row for each active ingredient (but other suggestions on better formatting are welcome)

~~Convert all string columns to lower case (can leave pharm_classes column as if, if uncomfortable evaluating the quality of the result)~~

~~Check if nonproprietaryname and substancename have the same values (after converting both to matching format)~~
Tidying completed! Tidied file: fda_ndc_product_tidy.csv on data.world.

Potential questions for exploration:

How many drugs have multiple active ingredients?
How many of the drugs found in the Medicare spending datasets have multiple active ingredients according to the FDA_NDC_Product,csv dataset. Try matching proprietaryname from the FDA_NDC_Product.csv dataset to the drugname_brand column in the spending_201x.csv datasets to address this question. After matching, what is the relationship between the nonproprietaryname and/or substancename columns from the FDA_NDC_Product.csv dataset to the drugname_generic column in the spending_201x.csv datasets.
How many of the drugs can at all be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
Anything else that seems interesting

How this will help

An option for matching the drugs in the Medicare spending datasets to therapeutic uses is to do so by active ingredients. It seems that the drugname_generic column in the spending datasets should be the main active ingredient that can be used for matching, but I had not considered drugs that may have multiple active ingredients (although drugname_generic appears to list multiple compounds for some drugs, so it may also include the active ingredients). The FDA_NDC_Product.csv dataset seems to contain a comprehensive list of the active ingredients in these drugs. A tidying of the FDNA_NDC_Product dataset and a comparison between this dataset and the Medicare spending datasets is an important step in accurately matching drugs to therapeutic uses.

Update and finalize data contribution docs

Leaving this here mostly as a to-do list.

Add link to best practices from data.world (https://docs.google.com/document/d/1p5A2DQ5gFC7XVKNVDw_ifKnycv_j1udmqY1M0rjbcxo/edit)
Clarify what should go in README vs what should go in data dictionary
~~Add link to objectives statement~~
Add preference for CSV + feather formats unless there's a reason to do otherwise
~~Edit folder to datadictionaries once reorg PR is merged~~
Clarify that one does not simply upload into data.world (must request contributor invite)

Create visualization to help understand physician/hospital payments

OpenPaymentsData.CMS.gov has data available on physician and teaching hospital payments from pharmaceutical and device companies. We'd like to explore it to see information like...

How do patterns vary by physician specialty?
How have payment patterns changed over time?
other questions of interest

Note that this is blocked until issue #45 is completed.

Call for new leadership

The existing team of maintainers is no longer able to give this project the proper attention it deserves. If you're interested in picking up where we've left off, submit your comments here or in the project Slack channel (#p-drug-spending).

data4democracy / drug-spending Goto Github PK

drug-spending's Introduction

drug-spending

Work on this project has ended. Check out conclusions and lessons learned in WRAP-UP.md.

Getting started

Things you should know about

Currently utilized skills

FAQ and other useful info

Downloading this repository

Project structure (or, "how do I find thing?")

Core data sets

Performing data analysis

Talking to people/asking for help

System requirements (suggested)

drug-spending's People

Contributors

Stargazers

Watchers

Forkers

drug-spending's Issues

Status

Task

What we're looking for

How this will help

Task

How this will help

Task

How this will help

Status

Task

What we're looking for

How this will help

Status

Task

What we're looking for

How this will help

Status

Task

What we're looking for

How this will help

Task

How this will help

Task

How this will help

Task

How this will help

Task

How this will help

Things you need to know

Task

Things you should know

What we're looking for

How this will help

Status

Task

What we're looking for

How this will help

Task

How this will help

What we want to get out of this

Before you continue

Task

How this will help

What we want to get out of this

Status

Update:

Update 3/1/2018

Task

What we're looking for

Potential outputs that would help greatly to further the goal of this project:

The potential routes of matching drugs to their ATC classification categories:

atc_codes_clean.csv dataset

drug_uses.csv dataset (which may be the easier of the two)

Ideal file formats for the analysis:

How this will help

Status

Task

What we're looking for

How this will help

Recommend Projects

Project structure (or, "how do I find `thing`?")

`atc_codes_clean.csv` dataset

`drug_uses.csv` dataset (which may be the easier of the two)