Code Monkey home page Code Monkey logo

scikit-learn-mooc's People

Contributors

aboucaud avatar accakut avatar alagarrigue avatar arturoamorq avatar bkmgit avatar brospars avatar daniel-m-campos avatar darigovresearch avatar gaelvaroquaux avatar glemaitre avatar hackmd-deploy avatar jeremiedbb avatar kinow avatar lesteve avatar lilianboulard avatar lucyleeow avatar mehrdad-dev avatar miykael avatar ogrisel avatar parmentelat avatar patrior avatar ph4ge avatar pierreloicq avatar psteinb avatar sjonnie404 avatar thawn avatar twsthomas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-learn-mooc's Issues

Specify categories rather than handle_unknown for OneHotEncoder?

This would make it easier to switch from OneHotEncoder to OrdinalEncoder. At the moment probably because of historical reasons we have a mix of handle_unknown and categories:

❯ git grep -iP 'onehotencoder.+categories' python_scripts/
python_scripts/04_parameter_tuning_sol_02.py:categorical_processor = OneHotEncoder(categories=categories)
❯ git grep -iP 'onehotencoder.+handle_unk' python_scripts/
python_scripts/03_categorical_pipeline.py:    OneHotEncoder(handle_unknown='ignore'),
python_scripts/03_categorical_pipeline_column_transformer.py:    ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
python_scripts/03_categorical_pipeline_ex_02.py:# `OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use a
python_scripts/03_categorical_pipeline_sol_01.py:    OneHotEncoder(handle_unknown="ignore"),
python_scripts/03_categorical_pipeline_sol_02.py:# `OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use a
python_scripts/03_categorical_pipeline_sol_02.py:     OneHotEncoder(handle_unknown="ignore", sparse=False),

Toggle navigation not working on pages with slide

https://inria.github.io/scikit-learn-mooc/linear_models/slides.html

Clicking on "Toggle navigation" icon (left arrow on the top left):
image

does not do anything. It does work on any other page I tried e.g.https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models.html.

On the console:

Uncaught TypeError: $(...).tooltip is not a function
    initTooltips https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8
    jQuery 9
    initTooltips https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8
    sbRunWhenDOMLoaded https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:6
    <anonymous> https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:14
sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8:81

Maybe some conflict between the remark.js and the sphinx-book js?

Not so important, but in case someone has some suggestions.

My notes about possible improvements from Euroscipy tutorial

Moved from: lesteve/scikit-learn-tutorial#3

This is not very structured, so feel free to edit, comment, open other issues for bigger chunks of work:

Content

  • have a TOC per notebook?

  • tinyurl (or huit.re) with link for easier access to the github repo (README)

  • first notebook with TOC so that the binder goes directly to this notebook

  • Should we get rid of education-num everywhere, since this is the same as
    education?

  • Put solutions in different folder? They interfere with the notebooks, you have
    to say: open the 02 notebook but not the one with exercise ...

  • naming: df vs adult_census harmonize, maybe data is good enough.

  • Can we have link in notebooks to an other notebook, that works locally, on
    binder, in the MOOC platform, etc ...

  • Question about : pipeline with the scaler does it compute the mean on the
    training, so you have to explain how the Pipeline works, calls .fit and
    .transform. You don't have to explain maybe, you can just say the parameters
    are modified only in the .fit (so not in the .predict)

  • Question about pipeline, why is it useful rather than just writing the code
    yourself? You have to explain .fit and .fit_transform. Hmmm, maybe you can
    just add a comment about why the Pipeline is useful in general.

  • minor: sparse=False in OneHotEncoder just for visualization purposes (to see the numpy
    array).

  • For exercise, have a link to the similar example, e.g. OrdinalEncoder put a
    link to what we did with OneHotEncoder.

  • Too-wide code: numerical_columns categorical_columns should cut at
    'capital-loss' and 'marital-status'. I think we should have a special formatter
    maybe black (I feel like it takes too much vertical space) or maybe yapf with
    some nice settings.

  • Say that education-num is not the number of years of education (I say that we
    could expect this, but I did not say this was not true)

  • young people work part-time. Say that non-working people (students) are not
    part of the survey.

  • Harmonize the way to get categorical_columns vs numerical_columns. Some code
    use dtype some code use explicit column names.

  • Side-comments about the train test split, goal is not to memorize. Should
    there be more details for the MOOC ? Or links to the first part about
    overfitting vs underfitting.

  • 02 exercise 01, not cross_val_score but use train_test_split

  • different kind of preprocessing, add a link to user guide. Question was: what
    happens if the data is not gaussian.

  • n_iter_ is a list for some reason ...

  print(
    f"The accuracy using a {model.__class__.__name__} is "
    f"{model.score(data_test, target_test):.3f} with a fitting time of "
    f"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations"
  )
  The accuracy using a Pipeline is 0.818 with a fitting time of 0.809 seconds in [13] iterations
  • Cross-validation explanation plot: add legend for blue vs red. It looks like
    there might be better images from scikit-learn documentation.
  • handle_unknown='ignore': explain more the reason: to put 0 in the categories if at test time, a category has not been seen in the train data.

Miscellaneous

  • Timings are very slow on binder
    0.7 s for LogisticRegression fit vs 5.6s on binder.
    2 minutes (rather than ~10s on Olivier's machine) for Reference pipeline (no
    numerical scaling and integer-coded categories)
    02_basic_preprocessing_exercise_03_solution.ipynb

Add details by comparing MAE and MSE

In the regression metrics, we miss an explanation regarding differences between the MSE and MAE.
I would expect to show the quadratic and linear curves and explained the effect of having a big outlier in both metrics.
It can be explained through graphics in an intuitive manner I think.

Fix header warning

in /home/lesteve/dev/scikit-learn-mooc/jupyter-book/python_scripts/linear_regression_non_linear_link.py

WARNING: Non-consecutive header level increase; 0 to 3

Probably this happened after splitting the notebook.

Which datasets to use?

I am opening an issue such that we can settle on the dataset to use within the entire MOOC.
I think that we have a large enough number of notebooks to have a good idea of the type of datasets that are required when presenting the different concepts.
It is a follow-up to #97 (comment)

The current state is the following:

adult_census

  • local CSV
  • large number of samples
  • classification
  • numerical + categorical features
  • contains NA
  • contains rare categories
  • used in all 4 beginners' notebooks (as is)

california_housing

  • sklearn fetcher
  • large number of samples
  • regression
  • all numerical
  • no NA
  • used in cross-validation (as is), ensemble (as is), linear model (as is) notebooks

penguins

  • local CSV
  • small number of samples
  • classification
  • numerical + categorical features
  • contains NA
  • used in ensemble (subset of numerical features / drop NA), linear_model (subset of feature / drop NA / regression + classification), trees (subset of feature / drop NA / regression + classification)

blood-transfusion-service-center

  • openml
  • classification
  • numerical
  • no NA
  • imbalanced without processing
  • used in metrics

Ames

  • openml
  • regression
  • categorical + numerical
  • no NA
  • Used only numerical columns -> interest in non-gaussianality of the target

synthetic dataset

  • numpy
  • numerical
  • used in feature selection to create large number of synthetic features

make_classification, make_moons, make_gaussian_quantiles

  • sklearn
  • numerical
  • used in feature selection to create large number of synthetic features (make_classification)
  • used in linear model for 2D intuitive non-llinear datsets.

Use some CSS to simplify presenter mode for slides

Presenter mode is an easy way to show notes for remark.js slides. It does show things that we probably don't want to show for someone who is trying to read the slides:

  • next slide and next slide notes
  • timer

image

@brospars if you had some time to look at this, it would be greatly appreciated!

Playing a bit with my modest CSS skills I can remove the elements with this CSS:

.remark-notes-area .remark-bottom-area .remark-notes-preview-area {
 display: none;
}

.remark-preview-area .remark-slide-container {
    display: none;
}

.remark-toolbar .remark-toolbar-timer {
 display: none;
}

image

I would better if the main slide was taking the whole height though and I have to say I don't really know how easy this is.

Adapt first notebooks with KNearestNeighbors / LogisticRegression

The idea is to have

  • a first simple notebook with KNearestNeighbors as done by @GaelVaroquaux i.e. separate csv for train and test with numerical-only features
  • a second notebook as we had with full dataset + pandas slicing to select numerical feature + train_test_split LogisticRegression warning
  • add a few words about LogisticRegression to give some simple intuition like 0.1 * age + 3.3 * education-num > 0.5
  • an exercise with n_neighbors=1 100% training accuracy
  • also #108 to say that KNearestNeighbors is used to be didactic and may not be very useful in practice. Probably not worth putting it in the notebook but I am also curious about the accuracy of KNearestNeighbors compared to the DummyClassifier to be honest.

We might want to use the `make_column_selector` which might more explicit in the construction since we specify better the dtype.

We might want to use the make_column_selector which might more explicit in the construction since we specify better the dtype.

from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_include=["int", "float"])
numerical_columns = numerical_columns_selector(data)

Originally posted by @glemaitre in https://github.com/INRIA/scikit-learn-mooc/pull/13/files

Terminology consistency

Might be a nitpick but I think being consistent in the terminology used in the notebook would be helpful for learners (the people 😛 ). We should of course, note the other names something is called but for the notebooks be consistent. Potentially we could use terms that are consistent with those used in scikit-learn?

Off the top of my head these are used a little interchangeably:

  • estimator/learner/model
  • fit/learn/train
  • data point/sample/instance (i vote for not using instance, because we also talk about programming class instances)
  • infer/inference & predict/prediction

Suggestion for linear notebook

This suggestions come from the PR #37 :

  • We use the terms formula/formulate, rule and model (sometimes a bit interchangeably) but we never explicitly explain the relationship between them/what a model is e.g., a model is a formula for relating features to target. We do talk about it a bit in the first notebook but it's not clearly relating 'model' to 'rule':
    https://github.com/INRIA/scikit-learn-mooc/blob/master/python_scripts/01_tabular_data_exploration.py#L202-203

  • We switch between using the terms infer/inference & predict/prediction. This is a nitpick but I think we should consistently only
    use one term or explain that they mean the same thing.

  • Might not be necessary but should we clarify that curve just means 'line' and does not need to be curved, i.e., can be a straight line?

  • (l.352) Shall we already talk about kernel here ? In this case it might require more description of what a kernel is. Also more description on 'decision function' would be nice too.

  • (l866) Would it be worth giving the logistic regression model equation or explaining more about these coefficients?

Reconsider the use of pandas_profiling

The dependencies of pandas_profiling are crazy. It adds difficulty and risk to running the tutorial. Venvs are not well mastered by non experts, in particular in connections with jupyter notebooks.

Slow notebooks inside Github Actions

https://inria.github.io/scikit-learn-mooc/notebook_timings.html

https://313-246063957-gh.circle-artifacts.com/0/jupyter-book/notebook_timings.html

CircleCI GH Action
python_scripts/04_basic_parameters_tuning 50s 390s
python_scripts/ensemble 50s 150s

maybe @ogrisel has some suggestions on this one (oversubscription due to the number of CPU that is not correctly detected)?

According to: https://docs.github.com/en/free-pro-team@latest/actions/reference/specifications-for-github-hosted-runners#supported-runners-and-hardware-resources

Each virtual machine has the same hardware resources available.

  • 2-core CPU
  • 7 GB of RAM memory
  • 14 GB of SSD disk space

Check the need of max_iter in LogisticRegression

The need for increasing max_iter will depend on the categorical features.
I am modifying some code right now but I think that we should remake a pass to make sure where this warning is issued.

Improvement once 0.24 is released

Once 0.24 is released, we should make the following improvement:

  • Use the MAPE whenever it makes sense.
  • Check again the plotting function using the new API.
  • No need to specify in advance the categories in the OrdinalEncoder and use rare categories instead. We need to postpone such that OneHotEncoder support simple processing as well.
  • Add references to some scikit-learn example (interpretation of linear coefficient, common pitfalls, limitation of feature importances, etc.)

Suggestions

Moved from lesteve/scikit-learn-tutorial#20 by @lucyleeow.

I have some minor suggestions:

01_tabular_data_exploration:

  • adult_census.profile_report() tells us that there are a few duplicate rows. It may be worthwhile explaining how these duplicate entries may affect/not affect prediction?

02_basic_preprocessing

  • convergence warning - you explain that this tells us that our model stopped learning because it reached the maximum number of iterations allowed and that scaling the data will help. Can you expand on what convergence means, why increasing the number of allowed iterations is a bad idea and why scaling the data helps?
  • explain what the StandardScaler does? Maybe not everyone knows the equation?

04_basic_parameters_tuning:

  • I think you need to explain more about the hyper-parameter C. Maybe even just give them a useful link to read on regularisation and overfitting?
  • For the last cell:
model = make_pipeline(
    preprocessor, LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=5)
)
score = cross_val_score(model, data, target, n_jobs=4, cv=5)

you don't provide a Cs argument like you do above and it might be worth mentioning that by default it tests a grid of 10 C values.

Fix remarkjs-pdf timeout

I was annoyed to have the CI red because of the remarkjs-pdf timeout so I commented it out in b92c7cf.

If you have time to look at it @brospars maybe by patching remarkjs-pdf to have a bigger timeout (as you did in your PR to remarkjs-pdf) or using decktape to generate the PDF (as mentioned in the remark.js doc here), that would be much appreciated.

Originally posted by @lesteve in #14 (comment)

Create index.md to navigate notebooks while giving the course

We have more and more notebooks which makes it harder and harder to tell people open this notebook but not this one whose filename is almost the same. Also with long filenames in visioconference settings people don't see the full filename (in JupyterLab at least) and the filenames all start the same.

We should have a index.md file that allows to navigate between notebooks more easily. We would always come back to this index.md file at the end of each notebooks and say "OK we did this notebook and now we are going to do this one" click and off we go.

The binder link in the README could go to this index.

In an ideal world the index.md could be generated from the _toc.yml file but I think it would be OK to have it manually generated at first to test the idea.

Hypothesis usage

Just a FYI I added hypothesis support in 56ae8b3.

This may be useful to add quick notes/comments while reading the content without leaving the page. Not really sure how this would interact with github issues, we'll see.

If you select some text, you get some icons to annotate/highlight. Here is a screenshot:

image

You need to create an account at https://hypothes.is to be able to create annotations.

Pair .py files to notebooks or rendered_notebooks folder?

The python_scripts/bla.py files are still paired to notebooks/bla.ipynb. That means that using Jupytext extension from JupyterLab or Jupyter notebook server, opening the .py file as notebook and saving will write the companion notebook to notebooks/bla.ipynb.

Should python_scripts/bla.py be paired with rendered_notebooks/bla.ipynb instead?

This feels more consistent and would allow to regenerate the rendered notebook from Jupyter rather than through the Makefile.

For people more likely to use the "Open .py file as notebook" @GaelVaroquaux @ogrisel @glemaitre let me know what you think!

Reduce boilerplate

Being didactic requires focusing on important things, and avoid side message.

The notebooks have too much boilerplate:

  • calls to IPython.display
  • mpl.rcParams
  • plotting functions such as plot_tree_decision_function

I'm not quite sure what's the best way to address this problem. Maybe defining help functions in a modules could help?

Glossary?

Would it be useful to have a glossary. Maybe it could help with reminding people of terms and the terminology we use, e.g.,

here we state that we will use the terms 'hyper-parameter' and 'parameter' interchangably:

# In this notebook we will use the words "hyper-parameters" and "parameters" interchangeably

but this kind of thing may be easy to forget as you work through the material.

Rename intro notebooks

I find the current names not descriptive enough. I think we should rename "basic" to "intro" in general and more specifically:

  • 02_basic_preprocessing => 02_numerical_pipeline
  • 03_basic_categorical_variables => 03_categorical_pipeline
  • 04_basic_parameter_tuning => 04_parameter_tuning

For the exercises & solutions we could shorten the names to something like:

  • 02_numerical_pipeline_E01.ipynb
  • 02_numerical_pipeline_S01.ipynb

The lexicographical order should still be good.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.