jakevdp / jupyterworkflow Goto Github PK

Reproducible Data Analysis Workflow in Jupyter

Home Page: https://jakevdp.github.io/blog/2017/03/03/reproducible-data-analysis-in-jupyter/

License: MIT License

Jupyter Notebook 99.83% Python 0.16% Makefile 0.01%

jupyterworkflow's Introduction

Reproducible Data Analysis Workflow in Jupyter

Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook.

Blog Post with Videos & Descriptions

Each video is approximately 5-8 minutes; the videos are listed in the Jupyter Notebook linked above. Alternatively, you can view the playlist directly on YouTube.

jupyterworkflow's People

Contributors

Stargazers

Watchers

Forkers

harrymvr roosbeh-nowrouzian dokeeffe harendranathvegi9 joaquinpais donovanr certik bermuda wangjiahong alieanuser prcer alex-linhares nashavi anxietyhangover sarineb afcarl arpadthetall krinkere noisyoscillator vaishalilambe 0xyuzi laguer opiticalvin ydata305 allorimd bradib0y fmercury sakampavankumar mikekiwa vedraiyani ernestcr devotionzhu maverickactuary adpostma hussainaly maybeee18 turtlelabs hercules261188 plamenti savisaarke silva-m

jupyterworkflow's Issues

BUG : testing the number of unique hours in the dataset

shouldn't https://github.com/jakevdp/JupyterWorkflow/blob/master/jupyterworkflow/tests/test_data.py#L10 be assert len(...) == 24 instead of assert len(... == 24). the test was passing because comparing the numpy array with 24 produced a boolean array of non-zero length.

Sorry if I'm missing something obvious.
Kudos to the awesome playlist :)

JupyterWorkflow.ipynb: plt.style.use('seaborn') -> not available style in mpl 1.5.1

In jupyter notebook JupyterWorkflow.ipynb

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')

Matplotlib version = 1.5.1

The seaborn style for matplotlib is not available anymore.

OSError: 'seaborn' not found in the style library and input is not a valid URL or path. See `style.available` for list of available styles.

Available styles:

['seaborn-pastel',
 'seaborn-ticks',
 'ggplot',
 'seaborn-bright',
 'seaborn-talk',
 'seaborn-notebook',
 'seaborn-deep',
 'grayscale',
 'seaborn-muted',
 'seaborn-dark-palette',
 'seaborn-colorblind',
 'bmh',
 'seaborn-poster',
 'seaborn-paper',
 'fivethirtyeight',
 'seaborn-white',
 'classic',
 'seaborn-dark',
 'seaborn-whitegrid',
 'dark_background',
 'seaborn-darkgrid']

Continuous Integration

What I have found is that the notebooks break in a year or two, since I might have different versions of Python packages installed, the upstream data becomes unavailable, or some other reason. Also, you want to make sure that if people send PRs, that you can safely merge them.

Part of the solution is to actually test the notebooks themselves on Travis, as long as they run in reasonable time (say 30 min or less). That way it tests a particular version of packages, so if somebody sends a PR and it breaks in an unrelated notebook, it's easier to debug. At the very least you can run the notebook using the nbviewer command line tools. But it'd be nice to also test that they actually work --- I don't have a good solution for this.

The other problem is that the website where I got the data from changes and I can no longer download it, and then my whole pipeline that depended on the exact format and so on is useless. I don't have a good solution for this, since typically you don't own the copyright to the data, so you can't just upload it yourself somewhere. Also it can be big.

BUG : Cluster labels are switched when the analysis is rerun

In https://github.com/jakevdp/JupyterWorkflow/blob/master/UnsupervisedAnalysis.ipynb , correct me if i'm wrong but the cluster labels seem to be switched from the first time the analysis was run. The fact that the labels are switched can also be seen in the Analyzing outliers section of the notebook, where the results now show all weekdays instead of weekend like ride patters on weekdays.

Does the gmm assign the labels 0, 1 in the same way it did during the last run?

jakevdp / jupyterworkflow Goto Github PK

jupyterworkflow's Introduction

Reproducible Data Analysis Workflow in Jupyter

jupyterworkflow's People

Contributors

Stargazers

Watchers

Forkers

jupyterworkflow's Issues

BUG : testing the number of unique hours in the dataset

JupyterWorkflow.ipynb: plt.style.use('seaborn') -> not available style in mpl 1.5.1

Continuous Integration

BUG : Cluster labels are switched when the analysis is rerun

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent