inria / scikit-learn-mooc Goto Github PK

View Code? Open in Web Editor NEW

993.0 31.0 490.0 441.17 MB

Machine learning in Python with scikit-learn MOOC

Home Page: https://inria.github.io/scikit-learn-mooc

License: Creative Commons Attribution 4.0 International

Makefile 0.15% Python 42.87% Jupyter Notebook 56.23% CSS 0.38% HTML 0.07% Shell 0.18% JavaScript 0.14%

machine-learning mooc python scikit-learn

scikit-learn-mooc's People

Contributors

Stargazers

Watchers

Forkers

ogrisel glemaitre overstarry zhongkailv yushu-liu leo23 clemaitre58 kissml gaelvaroquaux lesteve twsthomas thomasjpfan lucyleeow harsh020 nilsys jeremiedbb aletheuin tsabarm caaddss mc-teach potamitis aboucaud maybeee18 lfarhi mdiazmel alagarrigue aydinmyilmaz data-analisis brospars mehrdad-dev hakeounglee shanwai1234 kexinfo ruska612 morandiaye espoirgk pakitochus procourses vgangadhar git4satya krishnamanoj-kota deepak12545 abdoiiii tnp1618 deebyadeepparida khanfarhan10 mithun162001 sidhanthasuchit mldeveloper01 atcodedog32 erkanhatipoglu vinayak-shanawad o7s8r6 arunsprogramming torkaitraining moenaga kartickey kdmac benndora1 asitdalai eliezerspinto goncaloperes bahlat87 nmoghaddam albertvillanova thomasbourgeois bsipocz kamrankausar marcio-costa93 shandizp jminango20 mks999 eltonlms backprop-fr chathumal93 jimtyhurst sattyi yann88400 jagdeepssandhu kempeguy joselinceron vincent-kipngeno kayservince sandeep-raychaudhuri dceregatti lucbourrat1 logp castorfou nikeshsomyani06 dave-velasquez filippo82 kingsabru jvbfr slevin48 peguerosdc maheshmulik2393 felgabr pierrekimbanzir datagistips niksc06

scikit-learn-mooc's Issues

Specify categories rather than handle_unknown for OneHotEncoder?

This would make it easier to switch from OneHotEncoder to OrdinalEncoder. At the moment probably because of historical reasons we have a mix of handle_unknown and categories:

❯ git grep -iP 'onehotencoder.+categories' python_scripts/
python_scripts/04_parameter_tuning_sol_02.py:categorical_processor = OneHotEncoder(categories=categories)

❯ git grep -iP 'onehotencoder.+handle_unk' python_scripts/
python_scripts/03_categorical_pipeline.py:    OneHotEncoder(handle_unknown='ignore'),
python_scripts/03_categorical_pipeline_column_transformer.py:    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
python_scripts/03_categorical_pipeline_ex_02.py:# `OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use a
python_scripts/03_categorical_pipeline_sol_01.py:    OneHotEncoder(handle_unknown="ignore"),
python_scripts/03_categorical_pipeline_sol_02.py:# `OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use a
python_scripts/03_categorical_pipeline_sol_02.py:     OneHotEncoder(handle_unknown="ignore", sparse=False),

Toggle navigation not working on pages with slide

https://inria.github.io/scikit-learn-mooc/linear_models/slides.html

Clicking on "Toggle navigation" icon (left arrow on the top left):

does not do anything. It does work on any other page I tried e.g.https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models.html.

On the console:

Uncaught TypeError: $(...).tooltip is not a function
    initTooltips https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8
    jQuery 9
    initTooltips https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8
    sbRunWhenDOMLoaded https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:6
    <anonymous> https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:14
sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8:81

Maybe some conflict between the remark.js and the sphinx-book js?

Not so important, but in case someone has some suggestions.

My notes about possible improvements from Euroscipy tutorial

Moved from: lesteve/scikit-learn-tutorial#3

This is not very structured, so feel free to edit, comment, open other issues for bigger chunks of work:

Content

  print(
    f"The accuracy using a {model.__class__.__name__} is "
    f"{model.score(data_test, target_test):.3f} with a fitting time of "
    f"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations"
  )
  The accuracy using a Pipeline is 0.818 with a fitting time of 0.809 seconds in [13] iterations

Cross-validation explanation plot: add legend for blue vs red. It looks like
there might be better images from scikit-learn documentation.
handle_unknown='ignore': explain more the reason: to put 0 in the categories if at test time, a category has not been seen in the train data.

Miscellaneous

Timings are very slow on binder
0.7 s for LogisticRegression fit vs 5.6s on binder.
2 minutes (rather than ~10s on Olivier's machine) for Reference pipeline (no
numerical scaling and integer-coded categories)
02_basic_preprocessing_exercise_03_solution.ipynb

Add details by comparing MAE and MSE

In the regression metrics, we miss an explanation regarding differences between the MSE and MAE.
I would expect to show the quadratic and linear curves and explained the effect of having a big outlier in both metrics.
It can be explained through graphics in an intuitive manner I think.

Merge sequence

We should merge notebook together within sequence

Add exercise linear model

Switch to CircleCI : done

@glemaitre @TwsThomas we are now using CircleCI on master. You probably want to rebase on master to get CircleCI to run on your PR.

Fix header warning

in /home/lesteve/dev/scikit-learn-mooc/jupyter-book/python_scripts/linear_regression_non_linear_link.py

WARNING: Non-consecutive header level increase; 0 to 3

Probably this happened after splitting the notebook.

Which datasets to use?

I am opening an issue such that we can settle on the dataset to use within the entire MOOC.
I think that we have a large enough number of notebooks to have a good idea of the type of datasets that are required when presenting the different concepts.
It is a follow-up to #97 (comment)

The current state is the following:

`adult_census`

local CSV
large number of samples
classification
numerical + categorical features
contains NA
contains rare categories
used in all 4 beginners' notebooks (as is)

`california_housing`

sklearn fetcher
large number of samples
regression
all numerical
no NA
used in cross-validation (as is), ensemble (as is), linear model (as is) notebooks

`penguins`

local CSV
small number of samples
classification
numerical + categorical features
contains NA
used in ensemble (subset of numerical features / drop NA), linear_model (subset of feature / drop NA / regression + classification), trees (subset of feature / drop NA / regression + classification)

`blood-transfusion-service-center`

openml
classification
numerical
no NA
imbalanced without processing
used in metrics

`Ames`

openml
regression
categorical + numerical
no NA
Used only numerical columns -> interest in non-gaussianality of the target

synthetic dataset

numpy
numerical
used in feature selection to create large number of synthetic features

`make_classification`, `make_moons`, `make_gaussian_quantiles`

sklearn
numerical
used in feature selection to create large number of synthetic features (make_classification)
used in linear model for 2D intuitive non-llinear datsets.

Pr curve uses interpolation (aka is wrong)

Here:
https://inria.github.io/scikit-learn-mooc/python_scripts/metrics.html#evaluation-and-different-probability-thresholds

Luckily plot_precision_recall_curve is released now ;)

Use some CSS to simplify presenter mode for slides

Presenter mode is an easy way to show notes for remark.js slides. It does show things that we probably don't want to show for someone who is trying to read the slides:

next slide and next slide notes
timer

@brospars if you had some time to look at this, it would be greatly appreciated!

Playing a bit with my modest CSS skills I can remove the elements with this CSS:

.remark-notes-area .remark-bottom-area .remark-notes-preview-area {
 display: none;
}

.remark-preview-area .remark-slide-container {
    display: none;
}

.remark-toolbar .remark-toolbar-timer {
 display: none;
}

I would better if the main slide was taking the whole height though and I have to say I don't really know how easy this is.

Slides preview no longer working

The slide preview explained here: https://github.com/INRIA/scikit-learn-mooc/blob/master/slides/README.md no longer works, probably due to the change of repo.

@brospars : maybe you know how to fix that. You did the work originally

Feature engineering in linear model

@GaelVaroquaux Just to remember that we did present something briefly in the linear model notebook regarding the feature enginnering:

https://github.com/INRIA/scikit-learn-mooc/blob/master/python_scripts/linear_models.py#L249-L480

This might be moved somewhere else and modified.

UTF8 marker for exercise and solution

The UTF8 markers in the titles help readability. We should add some for the exercise and their solutions.

Add a small (tiny) text at the beginning of each module

This text should:

Give in plain English what the module is about
Give the required skills for the module
Point to resources to learning these skills
Give the learning objectives
Give the investment in time

Add exercise metrics

Adapt first notebooks with KNearestNeighbors / LogisticRegression

The idea is to have

a first simple notebook with KNearestNeighbors as done by @GaelVaroquaux i.e. separate csv for train and test with numerical-only features
a second notebook as we had with full dataset + pandas slicing to select numerical feature + train_test_split LogisticRegression warning
add a few words about LogisticRegression to give some simple intuition like 0.1 * age + 3.3 * education-num > 0.5
an exercise with n_neighbors=1 100% training accuracy
also #108 to say that KNearestNeighbors is used to be didactic and may not be very useful in practice. Probably not worth putting it in the notebook but I am also curious about the accuracy of KNearestNeighbors compared to the DummyClassifier to be honest.

We might want to use the `make_column_selector` which might more explicit in the construction since we specify better the dtype.

We might want to use the make_column_selector which might more explicit in the construction since we specify better the dtype.

from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_include=["int", "float"])
numerical_columns = numerical_columns_selector(data)

Originally posted by @glemaitre in https://github.com/INRIA/scikit-learn-mooc/pull/13/files

Mention that KNN is seldom useful

Add a box saying that it is seldom useful, that we are using it because it is simple, and we will introduce better models.

Binary encoding of sex variable everywhere?

This is done here: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html?highlight=binary_encoding_columns#columntransformer

but not here:
https://inria.github.io/scikit-learn-mooc/python_scripts/04_parameter_tuning_search.html or probably in other places.

Need to search for ColumnTransformer to find where we define it.

Terminology consistency

Might be a nitpick but I think being consistent in the terminology used in the notebook would be helpful for learners (the people 😛 ). We should of course, note the other names something is called but for the notebooks be consistent. Potentially we could use terms that are consistent with those used in scikit-learn?

Off the top of my head these are used a little interchangeably:

estimator/learner/model
fit/learn/train
data point/sample/instance (i vote for not using instance, because we also talk about programming class instances)
infer/inference & predict/prediction

Inconsistent tree diagram and plot in data exploration notebook

Moved from lesteve/scikit-learn-tutorial#9.

I am pretty sure this was fixed in paris-saclay-cds/data-science-workshop-2019#10 but never committed back to the repo.

Suggestion for linear notebook

This suggestions come from the PR #37 :

We use the terms formula/formulate, rule and model (sometimes a bit interchangeably) but we never explicitly explain the relationship between them/what a model is e.g., a model is a formula for relating features to target. We do talk about it a bit in the first notebook but it's not clearly relating 'model' to 'rule':
https://github.com/INRIA/scikit-learn-mooc/blob/master/python_scripts/01_tabular_data_exploration.py#L202-203
We switch between using the terms infer/inference & predict/prediction. This is a nitpick but I think we should consistently only
use one term or explain that they mean the same thing.
Might not be necessary but should we clarify that curve just means 'line' and does not need to be curved, i.e., can be a straight line?
(l.352) Shall we already talk about kernel here ? In this case it might require more description of what a kernel is. Also more description on 'decision function' would be nice too.
(l866) Would it be worth giving the logistic regression model equation or explaining more about these coefficients?

Build failure during the PDF slides build

lesteve/scikit-learn-tutorial#23

Reconsider the use of pandas_profiling

The dependencies of pandas_profiling are crazy. It adds difficulty and risk to running the tutorial. Venvs are not well mastered by non experts, in particular in connections with jupyter notebooks.

Test if bootstrap CSS works, and use it

Notes and warnings should be displayed as such.

Technical details should be in a specific style.

Figs for the linear-regression chapter

As discussed with @TwsThomas today:

Figures about OLS, logistic and SVM

(I cannot attach the code that I have for that, I'll have to do a PR)

A figure about bias-variance in the ridge

Adapted from http://scipy-lectures.org/packages/scikit-learn/index.html#bias-variance-trade-off-illustration-on-a-simple-regression-problem

Slow notebooks inside Github Actions

https://inria.github.io/scikit-learn-mooc/notebook_timings.html

https://313-246063957-gh.circle-artifacts.com/0/jupyter-book/notebook_timings.html

	CircleCI	GH Action
python_scripts/04_basic_parameters_tuning	50s	390s
python_scripts/ensemble	50s	150s

maybe @ogrisel has some suggestions on this one (oversubscription due to the number of CPU that is not correctly detected)?

According to: https://docs.github.com/en/free-pro-team@latest/actions/reference/specifications-for-github-hosted-runners#supported-runners-and-hardware-resources

Each virtual machine has the same hardware resources available.

2-core CPU

7 GB of RAM memory

14 GB of SSD disk space

Add slides with scikit-learn API

Probably to add somewhere in the Predictive Modelling Pipeline.

http://ogrisel.github.io/decks/2019_intro_sklearn/#30

Every notebook should have a title

The title is useful to give an idea to the user of what is in there, and also for us to focus and stay on track

Move module take away in a separated markdown file

We should move the takeaway in a separated markdown file.
I assume that it is also a good candidate to have a subsection called "To go further" with external links as suggested by @ogrisel

Check the need of max_iter in LogisticRegression

The need for increasing max_iter will depend on the categorical features.
I am modifying some code right now but I think that we should remake a pass to make sure where this warning is issued.

Improvement once 0.24 is released

Once 0.24 is released, we should make the following improvement:

Use the MAPE whenever it makes sense.
Check again the plotting function using the new API.
~~No need to specify in advance the categories in the OrdinalEncoder and use rare categories instead.~~ We need to postpone such that OneHotEncoder support simple processing as well.
Add references to some scikit-learn example (interpretation of linear coefficient, common pitfalls, limitation of feature importances, etc.)

Add exercise ensemble

Suggestions

Moved from lesteve/scikit-learn-tutorial#20 by @lucyleeow.

I have some minor suggestions:

01_tabular_data_exploration:

adult_census.profile_report() tells us that there are a few duplicate rows. It may be worthwhile explaining how these duplicate entries may affect/not affect prediction?

02_basic_preprocessing

convergence warning - you explain that this tells us that our model stopped learning because it reached the maximum number of iterations allowed and that scaling the data will help. Can you expand on what convergence means, why increasing the number of allowed iterations is a bad idea and why scaling the data helps?
explain what the StandardScaler does? Maybe not everyone knows the equation?

04_basic_parameters_tuning:

I think you need to explain more about the hyper-parameter C. Maybe even just give them a useful link to read on regularisation and overfitting?
For the last cell:

model = make_pipeline(
    preprocessor, LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=5)
)
score = cross_val_score(model, data, target, n_jobs=4, cv=5)

you don't provide a Cs argument like you do above and it might be worth mentioning that by default it tests a grid of 10 C values.

Fix remarkjs-pdf timeout

I was annoyed to have the CI red because of the remarkjs-pdf timeout so I commented it out in b92c7cf.

If you have time to look at it @brospars maybe by patching remarkjs-pdf to have a bigger timeout (as you did in your PR to remarkjs-pdf) or using decktape to generate the PDF (as mentioned in the remark.js doc here), that would be much appreciated.

Originally posted by @lesteve in #14 (comment)

Add how to inspect a model within cross-validation

I wrote an example in scikit-learn which I think would be useful in the inspection part of the course: scikit-learn/scikit-learn#18821

Create index.md to navigate notebooks while giving the course

We have more and more notebooks which makes it harder and harder to tell people open this notebook but not this one whose filename is almost the same. Also with long filenames in visioconference settings people don't see the full filename (in JupyterLab at least) and the filenames all start the same.

We should have a index.md file that allows to navigate between notebooks more easily. We would always come back to this index.md file at the end of each notebooks and say "OK we did this notebook and now we are going to do this one" click and off we go.

The binder link in the README could go to this index.

In an ideal world the index.md could be generated from the _toc.yml file but I think it would be OK to have it manually generated at first to test the idea.

Hypothesis usage

Just a FYI I added hypothesis support in 56ae8b3.

This may be useful to add quick notes/comments while reading the content without leaving the page. Not really sure how this would interact with github issues, we'll see.

If you select some text, you get some icons to annotate/highlight. Here is a screenshot:

You need to create an account at https://hypothes.is to be able to create annotations.

Consistent British or American spelling

See #39 (comment)

I understand the default on the internet seems to be American spelling (even though it is wrong 🙄 )

Pair .py files to notebooks or rendered_notebooks folder?

The python_scripts/bla.py files are still paired to notebooks/bla.ipynb. That means that using Jupytext extension from JupyterLab or Jupyter notebook server, opening the .py file as notebook and saving will write the companion notebook to notebooks/bla.ipynb.

Should python_scripts/bla.py be paired with rendered_notebooks/bla.ipynb instead?

This feels more consistent and would allow to regenerate the rendered notebook from Jupyter rather than through the Makefile.

For people more likely to use the "Open .py file as notebook" @GaelVaroquaux @ogrisel @glemaitre let me know what you think!

Reduce boilerplate

Being didactic requires focusing on important things, and avoid side message.

The notebooks have too much boilerplate:

calls to IPython.display
mpl.rcParams
plotting functions such as plot_tree_decision_function

I'm not quite sure what's the best way to address this problem. Maybe defining help functions in a modules could help?

Introduce term expressivity

From #37 (comment), we should introduce and explain the term expressivity somewhere

Glossary?

Would it be useful to have a glossary. Maybe it could help with reminding people of terms and the terminology we use, e.g.,

here we state that we will use the terms 'hyper-parameter' and 'parameter' interchangably:

scikit-learn-mooc/python_scripts/04_basic_parameters_tuning.py

Line 23 in adce6e0

    
           # In this notebook we will use the words "hyper-parameters" and "parameters" interchangeably

but this kind of thing may be easy to forget as you work through the material.

02_basic_preprocessing => 02_numerical_pipeline
03_basic_categorical_variables => 03_categorical_pipeline
04_basic_parameter_tuning => 04_parameter_tuning

For the exercises & solutions we could shorten the names to something like:

02_numerical_pipeline_E01.ipynb
02_numerical_pipeline_S01.ipynb

The lexicographical order should still be good.

CSS for slides not robust enough?

Thanks for having a look at this @GaelVaroquaux!

When I look at https://inria.github.io/scikit-learn-mooc/ml_concepts/slides.html, I get this:

Is it only me?
Does that mean that the CSS is not robust enough for different screen sizes?
Is there a better way? There must be a better way, right?

inria / scikit-learn-mooc Goto Github PK

scikit-learn-mooc's People

Contributors

Stargazers

Watchers

Forkers

scikit-learn-mooc's Issues

Content

Miscellaneous

adult_census

california_housing

penguins

blood-transfusion-service-center

Ames

synthetic dataset

make_classification, make_moons, make_gaussian_quantiles