amueller / scipy-2016-sklearn Goto Github PK

View Code? Open in Web Editor NEW

517.0 36.0 518.0 19.94 MB

Scikit-learn tutorial at SciPy2016

License: Creative Commons Zero v1.0 Universal

Python 14.26% Jupyter Notebook 85.74%

scipy-2016-sklearn's Introduction

SciPy 2016 Scikit-learn Tutorial

Based on the SciPy 2015 tutorial by Kyle Kastner and Andreas Mueller.

Instructors

Sebastian Raschka @rasbt - Michigan State University, Computational Biology; Book: Python Machine Learning
Andreas Mueller @amuellerml - NYU Center for Data Science; Book: Introduction to Machine Learning with Python

The video recording of the tutorial is now available via YouTube:

This repository will contain the teaching material and other info associated with our scikit-learn tutorial at SciPy 2016 held July 11-17 in Austin, Texas.

Parts 1 to 5 make up the morning session, while parts 6 to 9 will be presented in the afternoon.

Schedule:

The 2-part tutorial will be held on Tuesday, July 12, 2016.

Parts 1 to 5: 8:00 AM - 12:00 PM (Room 105)
Parts 6 to 9: 1:30 PM - 5:30 PM (Room 105)

(You can find the full SciPy 2016 tutorial schedule here.)

Obtaining the Tutorial Material

If you have a GitHub account, it is probably most convenient if you fork the GitHub repository. If you don’t have an GitHub account, you can download the repository as a .zip file by heading over to the GitHub repository (https://github.com/amueller/scipy-2016-sklearn) in your browser and click the green “Download” button in the upper right.

Please note that we may add and improve the material until shortly before the tutorial session, and we recommend you to update your copy of the materials one day before the tutorials. If you have an GitHub account and forked/cloned the repository via GitHub, you can sync your existing fork with via the following commands:

git remote add upstream https://github.com/amueller/scipy-2016-sklearn.git
git fetch upstream
git checkout master merge upstream/master

If you don’t have a GitHub account, you may have to re-download the .zip archive from GitHub.

Installation Notes

This tutorial will require recent installations of

The last one is important, you should be able to type:

jupyter notebook

in your terminal window and see the notebook panel load in your web browser. Try opening and running a notebook from the material to see check that it works.

For users who do not yet have these packages installed, a relatively painless way to install all the requirements is to use a Python distribution such as Anaconda CE, which includes the most relevant Python packages for science, math, engineering, and data analysis; Anaconda can be downloaded and installed for free including commercial use and redistribution. The code examples in this tutorial should be compatible to Python 2.7, Python 3.4, and Python 3.5.

After obtaining the material, we strongly recommend you to open and execute the Jupyter Notebook jupter notebook check_env.ipynb that is located at the top level of this repository. Inside the repository, you can open the notebook by executing

jupyter notebook check_env.ipynb

inside this repository. Inside the Notebook, you can run the code cell by clicking on the "Run Cells" button as illustrated in the figure below:

Finally, if your environment satisfies the requirements for the tutorials, the executed code cell will produce an output message as shown below:

Although not required, we also recommend you to update the required Python packages to their latest versions to ensure best compatibility with the teaching material. Please upgrade already installed packages by executing

pip install [package-name] --upgrade
or conda update [package-name]

Data Downloads

The data for this tutorial is not included in the repository. We will be using several data sets during the tutorial: most are built-in to scikit-learn, which includes code that automatically downloads and caches these data.

Because the wireless network at conferences can often be spotty, it would be a good idea to download these data sets before arriving at the conference. Please run python fetch_data.py to download all necessary data beforehand.

The download size of the data files are approx. 280 MB, and after fetch_data.py extracted the data on your disk, the ./notebook/dataset folder will take 480 MB of your local solid state or hard drive.

Outline

Morning Session

01 Introduction to machine learning with sample applications, Supervised and Unsupervised learning [[view](notebooks/01\ Introduction\ to\ Machine\ Learning.ipynb)]
02 Scientific Computing Tools for Python: NumPy, SciPy, and matplotlib [[view](notebooks/02\ Scientific\ Computing\ Tools\ in\ Python.ipynb)]
03 Data formats, preparation, and representation [[view](notebooks/03\ Data\ Representation\ for\ Machine\ Learning.ipynb)]
04 Supervised learning: Training and test data [[view](notebooks/04\ Training\ and\ Testing\ Data.ipynb)]
05 Supervised learning: Estimators for classification [[view](notebooks/05\ Supervised\ Learning\ -\ Classification.ipynb)]
06 Supervised learning: Estimators for regression analysis [[view](notebooks/06\ Supervised\ Learning\ -\ Regression.ipynb)]
07 Unsupervised learning: Unsupervised Transformers [[view](notebooks/07\ Unsupervised\ Learning\ -\ Transformations\ and\ Dimensionality\ Reduction.ipynb)]
08 Unsupervised learning: Clustering [[view](notebooks/08\ Unsupervised\ Learning\ -\ Clustering.ipynb)]
09 The scikit-learn estimator interface [[view](notebooks/09\ Review\ of\ Scikit-learn\ API.ipynb)]
10 Preparing a real-world dataset (titanic) [[view](notebooks/10\ Case\ Study\ -\ Titanic\ Survival.ipynb)]
11 Working with text data via the bag-of-words model [[view](notebooks/11\ Text\ Feature\ Extraction.ipynb)]
12 Application: IMDb Movie Review Sentiment Analysis [[view](notebooks/12\ Case\ Study\ -\ SMS\ Spam\ Detection.ipynb)]

Afternoon Session

13 Cross-Validation [[view](notebooks/13\ Cross\ Validation.ipynb)]
14 Model complexity and grid search for adjusting hyperparameters [[view](notebooks/14\ Model\ Complexity\ and\ GridSearchCV.ipynb)]
15 Scikit-learn Pipelines [[view](notebooks/15\ Pipelining\ Estimators.ipynb)]
16 Supervised learning: Performance metrics for classification [[view](notebooks/16\ Performance\ metrics\ and\ Model\ Evaluation.ipynb)]
17 Supervised learning: Linear Models [[view](notebooks/17\ In\ Depth\ -\ Linear\ Models.ipynb)]
18 Supervised learning: Support Vector Machines [[view](notebooks/18\ In\ Depth\ -\ Support\ Vector\ Machines.ipynb)]
19 Supervised learning: Decision trees and random forests, and ensemble methods [[view](notebooks/19\ In\ Depth\ -\ Trees\ and\ Forests.ipynb)]
20 Supervised learning: feature selection [[view](notebooks/20\ Feature\ Selection.ipynb)]
21 Unsupervised learning: Hierarchical and density-based clustering algorithms [[view](notebooks/21\ Unsupervised\ learning\ -\ Hierarchical\ and\ density-based\ clustering\ algorithms.ipynb)]
22 Unsupervised learning: Non-linear dimensionality reduction [[view](notebooks/22\ Unsupervised\ learning\ -\ Non-linear\ dimensionality\ reduction.ipynb)]
23 Supervised learning: Out-of-core learning [[view](notebooks/23\ Out-of-core\ Learning\ Large\ Scale\ Text\ Classification.ipynb)]

scipy-2016-sklearn's People

Contributors

Stargazers

Watchers

Forkers

wavelets rasbt bekterra paulhendricks mldl fdoperezi huleg shubhams2m mprego evgraph aernlund markrlowe stevenchowell eranderson ayushthakur ppr10 jdetle gogrean watsona4 laventura tjmahr fanshaopu danieldizzy konggas niteeshkanungo izizxllz markpharaoh leezqcst jxlin csantill mouse1231 thecpshah riqisu hackathorn diguabo caohy1988 willblenkhorn darg0001 jerryleeasp yashodhan19 jianshuzhao tuhcrw hateif maoting1223 liuxiaozeeee pseemakurthi drzhanying andymason57 mohamedsaeedhammad reese-li zhaiyinstar vijaym123 ggyimah1031 dkolmas kuatroka gauravgp mariakmejiaguerra oftensmile kentchun33333 ianubhav lyltj2010 chookee eckis snowdj sudinb charlesaydin drstatsvenu vahtras olinero tkamag anhadarora rogerborras rbhowe ebrahim85 ksjpswaroop sam6889 jfear manaranjanp akzaidi farahanams rahmanhpu alizeb neuraloverflow kkhanh89 cinneesol joshuacourse boukos kamalkarki ihorpletnov linwoodc3 90sbrain vest12385 happy082807 nagyist bgelect alanzablocki cor215 martinhum datahack-ru rhiever

scipy-2016-sklearn's Issues

add implementation of a simple model?

We discussed adding linear regression with gradient descent in the beginning. Do we still want to do that?

Add exercises and solutions to 13 Cross Validation

Add 26 Unsupervised learning: Non-linear dimensionality reduction

Add solution to 18 In Depth - Support Vector Machines

Add outro to slides

I think that a nice way to wrap up the workshop is to point the students things they can do next, including books, online courses, projects to get involved in, etc.

You can see one example of such an outro in my recent workshop here (final section).

Add explanation for why t-SNE is not a good feature preprocessor for models

Notebook 22 makes a really important point that t-SNE is only for visualization, yet doesn't explicitly explain why that is the case. We should add a brief explanation for why that is.

Replace spam by imdb text data

Per the TODO file. Maybe @amueller can elaborate on this issue.

Add brief pandas demo to scientific tools notebook

pandas is used a fair bit throughout the workshop yet never explicitly introduced. We should add a short section going over pandas.

Add 27 Wrapper, filter, and embedded approaches for feature selection

Add solution to 15 Pipelining Estimators

Removing overfitting plots-checkpoint.ipynb

The "overfitting plots-checkpoint.ipynb" file (aka add super messy notebook on overfitting to have it tracked and move i…") can now safely be removed, right!? :)

Add exercises and solutions to 14 Model Complexity and GridSearchCV

Add solution to 22 Unsupervised learning - Non-linear dimensionality reduction

add 25 Unsupervised learning: Hierarchical and density-based clustering algorithms

No SMS data

I also can't find the SMS dataset in the repo or fetch_data.py :(

Add solution to 19 In Depth - Trees and Forests

make sure we have sensible exercises everywhere

Another renaming problem

Now I am confused:

The notebooks are:

13 Cross Validation.ipynb
14 Model Complexity and GridSearchCV
15 Pipelining Estimators.ipynb
16 Performance metrics and Model Evaluation.ipynb

The outline in the README is:

13 Cross-Validation
14 Model Complexity: Overfitting and underfitting
15 Grid search for adjusting hyperparameters
16 Scikit-learn Pipelines
17 Supervised learning: Performance metrics for classification

I assume you summarized

14 Model Complexity: Overfitting and underfitting
15 Grid search for adjusting hyperparameters

into

14 Model Complexity and GridSearchCV

We need another round of renaming then. I can do that then.

redo text with imdb instead of text messages

14 Application: IMDB Movie Review Sentiment Analysis

Set up Travis CI to automatically test all code

As discussed in #75, we should set up Travis CI to automatically test all code in this workshop.

Add solution to 07 Unsupervised Learning - Transformations and Dimensionality Reduction

Add 27 Supervised learning: Out-of-core learning notebook

Add exercises and solutions to 11 Text Feature Extraction

coordinate 7, 8, 9

They are currently weird. I'm not sure we should make this all about eigenfaces. I'd rather talk about more methods. But I'm not sure.

Add solution to 17 In Depth - Linear Models

Update hacky code in helpers.py

In helpers.py#68, we use some hacky code to get the indices of the training and testing set. We should find a better way to do this.

Add a section on automated machine learning

Perhaps the workshop already has too much material for the usual runtime (~8 hours), but how about a practical section on automated machine learning? I taught this section at the end of a version of this sklearn workshop and it was quite well-received, and I think it's a good topic to mention to beginners (even if just a teaser).

Add solution to 20 Feature Selection

pyyaml and Pillow dependency

FYI

jupyter notebook fails while loading nbextensions with ImportError: No module named yaml. Fixed by pip-installing pyyaml.
python fetch_data.py fails after downloading LFW data due to missing PIL library. Fixed by pip-installing Pillow (see Newmu/stylize#1)

Add exercises and solutions to 16 Performance metrics and Model Evaluation

Refine the explanation for the sklearn metrics section

Currently the sklearn metrics section discusses a whole bunch of metrics but doesn't seem to go into detail on why you would use one particular metric. One point I usually try to make about metrics is that the "correct" metric depends critically on your problem, e.g.,

if you're doing spam detection, maybe a FN isn't so bad, so you can use a metric that focuses on maximizing FP and TP
but if you're doing cancer detection, a FN is disastrous, so you would use a metric that focuses in minimizing FN (even at the expense of others)
etc.

IMO it's a good idea to give students an intuition behind why we choose certain metrics in ML.

move feature selection up?

Currently feature selection is after a bunch of unsupervised learning. I think it should be together with the supervised learning.

Add solution to 05 Supervised Learning - Classification

Add exercises and solutions to 04 Training and Testing Data

add links to notebooks in readme

we should do that after making sure the outline is as we want it.

add pandas to scientific computing tools?

solutions to exercises

it would be best if we had solutions for all exercises.

Add solution to 06 Supervised Learning - Regression

Bigram & character-level tokenization

Hi, Andy,
I am a bit confused about the Bigram & character-level tokenization in sklearn (e.g., as shown in Nb 03.4 ... say we have the following text tokenized as follows:

>> X
['Some say the world will end in fire,', 'Some say in ice.']

>> char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
>> char_vectorizer.fit(X)

>> print(char_vectorizer.get_feature_names())
[' e', ' f', ' i', ' s', ' t', ' w', 'ay', 'ce', 'd ', 'e ', 'e,', 'e.', 'en', 'fi', 'he', 'ic', 'il', 'in', 'ir', 'l ', 'ld', 'll', 'me', 'n ', 'nd', 'om', 'or', 're', 'rl', 'sa', 'so', 'th', 'wi', 'wo', 'y ']

Why would we end up with these single characters when we set ngram_range=(2, 2); I thought we'd only get these for e.g., ngram_range=(1, x)?