Code Monkey home page Code Monkey logo

scipy-2016-sklearn's Introduction

SciPy 2016 Scikit-learn Tutorial

Based on the SciPy 2015 tutorial by Kyle Kastner and Andreas Mueller.

Instructors


The video recording of the tutorial is now available via YouTube:


This repository will contain the teaching material and other info associated with our scikit-learn tutorial at SciPy 2016 held July 11-17 in Austin, Texas.

Parts 1 to 5 make up the morning session, while parts 6 to 9 will be presented in the afternoon.

Schedule:

The 2-part tutorial will be held on Tuesday, July 12, 2016.

  • Parts 1 to 5: 8:00 AM - 12:00 PM (Room 105)
  • Parts 6 to 9: 1:30 PM - 5:30 PM (Room 105)

(You can find the full SciPy 2016 tutorial schedule here.)

Obtaining the Tutorial Material

If you have a GitHub account, it is probably most convenient if you fork the GitHub repository. If you don’t have an GitHub account, you can download the repository as a .zip file by heading over to the GitHub repository (https://github.com/amueller/scipy-2016-sklearn) in your browser and click the green “Download” button in the upper right.

Please note that we may add and improve the material until shortly before the tutorial session, and we recommend you to update your copy of the materials one day before the tutorials. If you have an GitHub account and forked/cloned the repository via GitHub, you can sync your existing fork with via the following commands:

git remote add upstream https://github.com/amueller/scipy-2016-sklearn.git
git fetch upstream
git checkout master merge upstream/master

If you don’t have a GitHub account, you may have to re-download the .zip archive from GitHub.

Installation Notes

This tutorial will require recent installations of

The last one is important, you should be able to type:

jupyter notebook

in your terminal window and see the notebook panel load in your web browser. Try opening and running a notebook from the material to see check that it works.

For users who do not yet have these packages installed, a relatively painless way to install all the requirements is to use a Python distribution such as Anaconda CE, which includes the most relevant Python packages for science, math, engineering, and data analysis; Anaconda can be downloaded and installed for free including commercial use and redistribution. The code examples in this tutorial should be compatible to Python 2.7, Python 3.4, and Python 3.5.

After obtaining the material, we strongly recommend you to open and execute the Jupyter Notebook jupter notebook check_env.ipynb that is located at the top level of this repository. Inside the repository, you can open the notebook by executing

jupyter notebook check_env.ipynb

inside this repository. Inside the Notebook, you can run the code cell by clicking on the "Run Cells" button as illustrated in the figure below:

Finally, if your environment satisfies the requirements for the tutorials, the executed code cell will produce an output message as shown below:

Although not required, we also recommend you to update the required Python packages to their latest versions to ensure best compatibility with the teaching material. Please upgrade already installed packages by executing

  • pip install [package-name] --upgrade
  • or conda update [package-name]

Data Downloads

The data for this tutorial is not included in the repository. We will be using several data sets during the tutorial: most are built-in to scikit-learn, which includes code that automatically downloads and caches these data.

Because the wireless network at conferences can often be spotty, it would be a good idea to download these data sets before arriving at the conference. Please run python fetch_data.py to download all necessary data beforehand.

The download size of the data files are approx. 280 MB, and after fetch_data.py extracted the data on your disk, the ./notebook/dataset folder will take 480 MB of your local solid state or hard drive.

Outline

Morning Session

  • 01 Introduction to machine learning with sample applications, Supervised and Unsupervised learning [[view](notebooks/01\ Introduction\ to\ Machine\ Learning.ipynb)]
  • 02 Scientific Computing Tools for Python: NumPy, SciPy, and matplotlib [[view](notebooks/02\ Scientific\ Computing\ Tools\ in\ Python.ipynb)]
  • 03 Data formats, preparation, and representation [[view](notebooks/03\ Data\ Representation\ for\ Machine\ Learning.ipynb)]
  • 04 Supervised learning: Training and test data [[view](notebooks/04\ Training\ and\ Testing\ Data.ipynb)]
  • 05 Supervised learning: Estimators for classification [[view](notebooks/05\ Supervised\ Learning\ -\ Classification.ipynb)]
  • 06 Supervised learning: Estimators for regression analysis [[view](notebooks/06\ Supervised\ Learning\ -\ Regression.ipynb)]
  • 07 Unsupervised learning: Unsupervised Transformers [[view](notebooks/07\ Unsupervised\ Learning\ -\ Transformations\ and\ Dimensionality\ Reduction.ipynb)]
  • 08 Unsupervised learning: Clustering [[view](notebooks/08\ Unsupervised\ Learning\ -\ Clustering.ipynb)]
  • 09 The scikit-learn estimator interface [[view](notebooks/09\ Review\ of\ Scikit-learn\ API.ipynb)]
  • 10 Preparing a real-world dataset (titanic) [[view](notebooks/10\ Case\ Study\ -\ Titanic\ Survival.ipynb)]
  • 11 Working with text data via the bag-of-words model [[view](notebooks/11\ Text\ Feature\ Extraction.ipynb)]
  • 12 Application: IMDb Movie Review Sentiment Analysis [[view](notebooks/12\ Case\ Study\ -\ SMS\ Spam\ Detection.ipynb)]

Afternoon Session

  • 13 Cross-Validation [[view](notebooks/13\ Cross\ Validation.ipynb)]
  • 14 Model complexity and grid search for adjusting hyperparameters [[view](notebooks/14\ Model\ Complexity\ and\ GridSearchCV.ipynb)]
  • 15 Scikit-learn Pipelines [[view](notebooks/15\ Pipelining\ Estimators.ipynb)]
  • 16 Supervised learning: Performance metrics for classification [[view](notebooks/16\ Performance\ metrics\ and\ Model\ Evaluation.ipynb)]
  • 17 Supervised learning: Linear Models [[view](notebooks/17\ In\ Depth\ -\ Linear\ Models.ipynb)]
  • 18 Supervised learning: Support Vector Machines [[view](notebooks/18\ In\ Depth\ -\ Support\ Vector\ Machines.ipynb)]
  • 19 Supervised learning: Decision trees and random forests, and ensemble methods [[view](notebooks/19\ In\ Depth\ -\ Trees\ and\ Forests.ipynb)]
  • 20 Supervised learning: feature selection [[view](notebooks/20\ Feature\ Selection.ipynb)]
  • 21 Unsupervised learning: Hierarchical and density-based clustering algorithms [[view](notebooks/21\ Unsupervised\ learning\ -\ Hierarchical\ and\ density-based\ clustering\ algorithms.ipynb)]
  • 22 Unsupervised learning: Non-linear dimensionality reduction [[view](notebooks/22\ Unsupervised\ learning\ -\ Non-linear\ dimensionality\ reduction.ipynb)]
  • 23 Supervised learning: Out-of-core learning [[view](notebooks/23\ Out-of-core\ Learning\ Large\ Scale\ Text\ Classification.ipynb)]

scipy-2016-sklearn's People

Contributors

amueller avatar kastnerkyle avatar nelson-liu avatar rasbt avatar rhiever avatar scw avatar stavxyz avatar w3d3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scipy-2016-sklearn's Issues

Add outro to slides

I think that a nice way to wrap up the workshop is to point the students things they can do next, including books, online courses, projects to get involved in, etc.

You can see one example of such an outro in my recent workshop here (final section).

No SMS data

I also can't find the SMS dataset in the repo or fetch_data.py :(

Another renaming problem

Now I am confused:

The notebooks are:

13 Cross Validation.ipynb
14 Model Complexity and GridSearchCV
15 Pipelining Estimators.ipynb
16 Performance metrics and Model Evaluation.ipynb

The outline in the README is:

13 Cross-Validation
14 Model Complexity: Overfitting and underfitting
15 Grid search for adjusting hyperparameters
16 Scikit-learn Pipelines
17 Supervised learning: Performance metrics for classification

I assume you summarized

14 Model Complexity: Overfitting and underfitting
15 Grid search for adjusting hyperparameters

into

14 Model Complexity and GridSearchCV

??

We need another round of renaming then. I can do that then.

coordinate 7, 8, 9

They are currently weird. I'm not sure we should make this all about eigenfaces. I'd rather talk about more methods. But I'm not sure.

Add a section on automated machine learning

Perhaps the workshop already has too much material for the usual runtime (~8 hours), but how about a practical section on automated machine learning? I taught this section at the end of a version of this sklearn workshop and it was quite well-received, and I think it's a good topic to mention to beginners (even if just a teaser).

pyyaml and Pillow dependency

FYI

  1. jupyter notebook fails while loading nbextensions with ImportError: No module named yaml. Fixed by pip-installing pyyaml.
  2. python fetch_data.py fails after downloading LFW data due to missing PIL library. Fixed by pip-installing Pillow (see Newmu/stylize#1)

Refine the explanation for the sklearn metrics section

Currently the sklearn metrics section discusses a whole bunch of metrics but doesn't seem to go into detail on why you would use one particular metric. One point I usually try to make about metrics is that the "correct" metric depends critically on your problem, e.g.,

  • if you're doing spam detection, maybe a FN isn't so bad, so you can use a metric that focuses on maximizing FP and TP
  • but if you're doing cancer detection, a FN is disastrous, so you would use a metric that focuses in minimizing FN (even at the expense of others)
  • etc.

IMO it's a good idea to give students an intuition behind why we choose certain metrics in ML.

move feature selection up?

Currently feature selection is after a bunch of unsupervised learning. I think it should be together with the supervised learning.

Bigram & character-level tokenization

Hi, Andy,
I am a bit confused about the Bigram & character-level tokenization in sklearn (e.g., as shown in Nb 03.4 ... say we have the following text tokenized as follows:

>> X
['Some say the world will end in fire,', 'Some say in ice.']

>> char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
>> char_vectorizer.fit(X)

>> print(char_vectorizer.get_feature_names())
[' e', ' f', ' i', ' s', ' t', ' w', 'ay', 'ce', 'd ', 'e ', 'e,', 'e.', 'en', 'fi', 'he', 'ic', 'il', 'in', 'ir', 'l ', 'ld', 'll', 'me', 'n ', 'nd', 'om', 'or', 're', 'rl', 'sa', 'so', 'th', 'wi', 'wo', 'y ']

Why would we end up with these single characters when we set ngram_range=(2, 2); I thought we'd only get these for e.g., ngram_range=(1, x)?

!!! No titanic data

The titanic dataset seems to be missing (notebook 10). Do you have it on your local drive? It's probably pretty small so we could just add it to the repo instead of fetching it online.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.