Code Monkey home page Code Monkey logo

jupyterworkflow's Introduction

Reproducible Data Analysis Workflow in Jupyter

Jupyter notebooks provide a useful environment for interactive exploration of data. A common question, though, is how you can progress from this nonlinear, interactive, trial-and-error style of analysis to a more linear and reproducible analysis based on organized, well-tested code. This series of videos shows an example of how I approach reproducible data analysis within the Jupyter notebook.

Each video is approximately 5-8 minutes; the videos are listed in the Jupyter Notebook linked above. Alternatively, you can view the playlist directly on YouTube.

jupyterworkflow's People

Contributors

dokeeffe avatar harrymvr avatar jakevdp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

jupyterworkflow's Issues

JupyterWorkflow.ipynb: plt.style.use('seaborn') -> not available style in mpl 1.5.1

In jupyter notebook JupyterWorkflow.ipynb

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')

.

Matplotlib version = 1.5.1

The seaborn style for matplotlib is not available anymore.

OSError: 'seaborn' not found in the style library and input is not a valid URL or path. See `style.available` for list of available styles.

Available styles:

['seaborn-pastel',
 'seaborn-ticks',
 'ggplot',
 'seaborn-bright',
 'seaborn-talk',
 'seaborn-notebook',
 'seaborn-deep',
 'grayscale',
 'seaborn-muted',
 'seaborn-dark-palette',
 'seaborn-colorblind',
 'bmh',
 'seaborn-poster',
 'seaborn-paper',
 'fivethirtyeight',
 'seaborn-white',
 'classic',
 'seaborn-dark',
 'seaborn-whitegrid',
 'dark_background',
 'seaborn-darkgrid']

Continuous Integration

What I have found is that the notebooks break in a year or two, since I might have different versions of Python packages installed, the upstream data becomes unavailable, or some other reason. Also, you want to make sure that if people send PRs, that you can safely merge them.

Part of the solution is to actually test the notebooks themselves on Travis, as long as they run in reasonable time (say 30 min or less). That way it tests a particular version of packages, so if somebody sends a PR and it breaks in an unrelated notebook, it's easier to debug. At the very least you can run the notebook using the nbviewer command line tools. But it'd be nice to also test that they actually work --- I don't have a good solution for this.

The other problem is that the website where I got the data from changes and I can no longer download it, and then my whole pipeline that depended on the exact format and so on is useless. I don't have a good solution for this, since typically you don't own the copyright to the data, so you can't just upload it yourself somewhere. Also it can be big.

BUG : Cluster labels are switched when the analysis is rerun

In https://github.com/jakevdp/JupyterWorkflow/blob/master/UnsupervisedAnalysis.ipynb , correct me if i'm wrong but the cluster labels seem to be switched from the first time the analysis was run. The fact that the labels are switched can also be seen in the Analyzing outliers section of the notebook, where the results now show all weekdays instead of weekend like ride patters on weekdays.

Does the gmm assign the labels 0, 1 in the same way it did during the last run?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.