This repository is a sandbox for exploration into several areas:
- First, running automated safety checks on data inputs to ML pipelne using tox and GitHub Actions.
- Second, running a test suite with pytest, flake8flake8, and mypymypy.
- Third, experimenting with the Cookiecutter DS project structure.
- Fourth, learning more about the AACT database of clinical trials.
- Finally, various methods of keeping credentials out of repositories.
The impetus for testing these options out came from a video I recently viewed on YouTube by mCoding (see here). Of course once I got started there were multiple concepts I wanted to work with, hence the expanding list of resources above.
Cloning this repository provides most of the files required to run the tests. The data the tests are based on, however, comes from the Aggregate Analysis of ClinicalTrials.gov (AACT) database. This database hosts data submitted to ClinicalTrials.gov about proposed medical trials. Connecting to this account requires free credentials.
To use your credentials with this repository, copy the credentials.ini.example
file
that should be copied to credentials.ini
. Then populate credentials.ini
with your
username and password for the AACT database.
- smooth data flows for updated datasets based on sql queries
- translate helper functions for modern sci-kit learn Pipelines
- concatenate all data
- build model
- check model - use https://stackoverflow.com/questions/61877496/how-to-ensure-persistent-sklearn-models-on-bit-level
- produce outputs
- more tests (refactor for better code coverage?)
- give clean targets in makefile
- create sphinx documentation