inria / scikit-learn-mooc Goto Github PK
View Code? Open in Web Editor NEWMachine learning in Python with scikit-learn MOOC
Home Page: https://inria.github.io/scikit-learn-mooc
License: Creative Commons Attribution 4.0 International
Machine learning in Python with scikit-learn MOOC
Home Page: https://inria.github.io/scikit-learn-mooc
License: Creative Commons Attribution 4.0 International
This would make it easier to switch from OneHotEncoder to OrdinalEncoder. At the moment probably because of historical reasons we have a mix of handle_unknown
and categories
:
❯ git grep -iP 'onehotencoder.+categories' python_scripts/
python_scripts/04_parameter_tuning_sol_02.py:categorical_processor = OneHotEncoder(categories=categories)
❯ git grep -iP 'onehotencoder.+handle_unk' python_scripts/
python_scripts/03_categorical_pipeline.py: OneHotEncoder(handle_unknown='ignore'),
python_scripts/03_categorical_pipeline_column_transformer.py: ('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
python_scripts/03_categorical_pipeline_ex_02.py:# `OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use a
python_scripts/03_categorical_pipeline_sol_01.py: OneHotEncoder(handle_unknown="ignore"),
python_scripts/03_categorical_pipeline_sol_02.py:# `OneHotEncoder(handle_unknown="ignore", sparse=False)` to force the use a
python_scripts/03_categorical_pipeline_sol_02.py: OneHotEncoder(handle_unknown="ignore", sparse=False),
https://inria.github.io/scikit-learn-mooc/linear_models/slides.html
Clicking on "Toggle navigation" icon (left arrow on the top left):
does not do anything. It does work on any other page I tried e.g.https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models.html.
On the console:
Uncaught TypeError: $(...).tooltip is not a function
initTooltips https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8
jQuery 9
initTooltips https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8
sbRunWhenDOMLoaded https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:6
<anonymous> https://inria.github.io/scikit-learn-mooc/_static/sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:14
sphinx-book-theme.be0a4a0c39cd630af62a2fcf693f3f06.js:8:81
Maybe some conflict between the remark.js and the sphinx-book js?
Not so important, but in case someone has some suggestions.
Moved from: lesteve/scikit-learn-tutorial#3
This is not very structured, so feel free to edit, comment, open other issues for bigger chunks of work:
have a TOC per notebook?
tinyurl (or huit.re) with link for easier access to the github repo (README)
first notebook with TOC so that the binder goes directly to this notebook
Should we get rid of education-num everywhere, since this is the same as
education?
Put solutions in different folder? They interfere with the notebooks, you have
to say: open the 02 notebook but not the one with exercise ...
naming: df vs adult_census harmonize, maybe data is good enough.
Can we have link in notebooks to an other notebook, that works locally, on
binder, in the MOOC platform, etc ...
Question about : pipeline with the scaler does it compute the mean on the
training, so you have to explain how the Pipeline works, calls .fit and
.transform. You don't have to explain maybe, you can just say the parameters
are modified only in the .fit (so not in the .predict)
Question about pipeline, why is it useful rather than just writing the code
yourself? You have to explain .fit and .fit_transform. Hmmm, maybe you can
just add a comment about why the Pipeline is useful in general.
minor: sparse=False in OneHotEncoder just for visualization purposes (to see the numpy
array).
For exercise, have a link to the similar example, e.g. OrdinalEncoder put a
link to what we did with OneHotEncoder.
Too-wide code: numerical_columns categorical_columns should cut at
'capital-loss' and 'marital-status'. I think we should have a special formatter
maybe black (I feel like it takes too much vertical space) or maybe yapf with
some nice settings.
Say that education-num is not the number of years of education (I say that we
could expect this, but I did not say this was not true)
young people work part-time. Say that non-working people (students) are not
part of the survey.
Harmonize the way to get categorical_columns vs numerical_columns. Some code
use dtype some code use explicit column names.
Side-comments about the train test split, goal is not to memorize. Should
there be more details for the MOOC ? Or links to the first part about
overfitting vs underfitting.
02 exercise 01, not cross_val_score but use train_test_split
different kind of preprocessing, add a link to user guide. Question was: what
happens if the data is not gaussian.
n_iter_ is a list for some reason ...
print(
f"The accuracy using a {model.__class__.__name__} is "
f"{model.score(data_test, target_test):.3f} with a fitting time of "
f"{elapsed_time:.3f} seconds in {model[-1].n_iter_} iterations"
)
The accuracy using a Pipeline is 0.818 with a fitting time of 0.809 seconds in [13] iterations
In the regression metrics, we miss an explanation regarding differences between the MSE and MAE.
I would expect to show the quadratic and linear curves and explained the effect of having a big outlier in both metrics.
It can be explained through graphics in an intuitive manner I think.
We should merge notebook together within sequence
@glemaitre @TwsThomas we are now using CircleCI on master. You probably want to rebase on master to get CircleCI to run on your PR.
in /home/lesteve/dev/scikit-learn-mooc/jupyter-book/python_scripts/linear_regression_non_linear_link.py
WARNING: Non-consecutive header level increase; 0 to 3
Probably this happened after splitting the notebook.
I am opening an issue such that we can settle on the dataset to use within the entire MOOC.
I think that we have a large enough number of notebooks to have a good idea of the type of datasets that are required when presenting the different concepts.
It is a follow-up to #97 (comment)
The current state is the following:
adult_census
california_housing
penguins
blood-transfusion-service-center
Ames
make_classification
, make_moons
, make_gaussian_quantiles
Luckily plot_precision_recall_curve
is released now ;)
Presenter mode is an easy way to show notes for remark.js slides. It does show things that we probably don't want to show for someone who is trying to read the slides:
@brospars if you had some time to look at this, it would be greatly appreciated!
Playing a bit with my modest CSS skills I can remove the elements with this CSS:
.remark-notes-area .remark-bottom-area .remark-notes-preview-area {
display: none;
}
.remark-preview-area .remark-slide-container {
display: none;
}
.remark-toolbar .remark-toolbar-timer {
display: none;
}
I would better if the main slide was taking the whole height though and I have to say I don't really know how easy this is.
The slide preview explained here: https://github.com/INRIA/scikit-learn-mooc/blob/master/slides/README.md no longer works, probably due to the change of repo.
@brospars : maybe you know how to fix that. You did the work originally
@GaelVaroquaux Just to remember that we did present something briefly in the linear model notebook regarding the feature enginnering:
https://github.com/INRIA/scikit-learn-mooc/blob/master/python_scripts/linear_models.py#L249-L480
This might be moved somewhere else and modified.
The UTF8 markers in the titles help readability. We should add some for the exercise and their solutions.
This text should:
The idea is to have
train_test_split
LogisticRegression warning0.1 * age + 3.3 * education-num > 0.5
KNearestNeighbors
compared to the DummyClassifier
to be honest.We might want to use the make_column_selector
which might more explicit in the construction since we specify better the dtype.
from sklearn.compose import make_column_selector as selector
numerical_columns_selector = selector(dtype_include=["int", "float"])
numerical_columns = numerical_columns_selector(data)
Originally posted by @glemaitre in https://github.com/INRIA/scikit-learn-mooc/pull/13/files
Add a box saying that it is seldom useful, that we are using it because it is simple, and we will introduce better models.
This is done here: https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html?highlight=binary_encoding_columns#columntransformer
but not here:
https://inria.github.io/scikit-learn-mooc/python_scripts/04_parameter_tuning_search.html or probably in other places.
Need to search for ColumnTransformer to find where we define it.
Might be a nitpick but I think being consistent in the terminology used in the notebook would be helpful for learners (the people 😛 ). We should of course, note the other names something is called but for the notebooks be consistent. Potentially we could use terms that are consistent with those used in scikit-learn?
Off the top of my head these are used a little interchangeably:
Moved from lesteve/scikit-learn-tutorial#9.
I am pretty sure this was fixed in paris-saclay-cds/data-science-workshop-2019#10 but never committed back to the repo.
This suggestions come from the PR #37 :
We use the terms formula/formulate, rule and model (sometimes a bit interchangeably) but we never explicitly explain the relationship between them/what a model is e.g., a model is a formula for relating features to target. We do talk about it a bit in the first notebook but it's not clearly relating 'model' to 'rule':
https://github.com/INRIA/scikit-learn-mooc/blob/master/python_scripts/01_tabular_data_exploration.py#L202-203
We switch between using the terms infer/inference & predict/prediction. This is a nitpick but I think we should consistently only
use one term or explain that they mean the same thing.
Might not be necessary but should we clarify that curve just means 'line' and does not need to be curved, i.e., can be a straight line?
(l.352) Shall we already talk about kernel here ? In this case it might require more description of what a kernel is. Also more description on 'decision function' would be nice too.
(l866) Would it be worth giving the logistic regression model equation or explaining more about these coefficients?
The dependencies of pandas_profiling are crazy. It adds difficulty and risk to running the tutorial. Venvs are not well mastered by non experts, in particular in connections with jupyter notebooks.
Notes and warnings should be displayed as such.
Technical details should be in a specific style.
As discussed with @TwsThomas today:
(I cannot attach the code that I have for that, I'll have to do a PR)
https://inria.github.io/scikit-learn-mooc/notebook_timings.html
https://313-246063957-gh.circle-artifacts.com/0/jupyter-book/notebook_timings.html
CircleCI | GH Action | |
---|---|---|
python_scripts/04_basic_parameters_tuning | 50s | 390s |
python_scripts/ensemble | 50s | 150s |
maybe @ogrisel has some suggestions on this one (oversubscription due to the number of CPU that is not correctly detected)?
Each virtual machine has the same hardware resources available.
- 2-core CPU
- 7 GB of RAM memory
- 14 GB of SSD disk space
Probably to add somewhere in the Predictive Modelling Pipeline.
The title is useful to give an idea to the user of what is in there, and also for us to focus and stay on track
We should move the takeaway in a separated markdown file.
I assume that it is also a good candidate to have a subsection called "To go further" with external links as suggested by @ogrisel
The need for increasing max_iter
will depend on the categorical features.
I am modifying some code right now but I think that we should remake a pass to make sure where this warning is issued.
Once 0.24 is released, we should make the following improvement:
OrdinalEncoder
and use rare categories instead.OneHotEncoder
support simple processing as well.Moved from lesteve/scikit-learn-tutorial#20 by @lucyleeow.
I have some minor suggestions:
01_tabular_data_exploration:
adult_census.profile_report()
tells us that there are a few duplicate rows. It may be worthwhile explaining how these duplicate entries may affect/not affect prediction?02_basic_preprocessing
StandardScaler
does? Maybe not everyone knows the equation?04_basic_parameters_tuning:
C
. Maybe even just give them a useful link to read on regularisation and overfitting?model = make_pipeline(
preprocessor, LogisticRegressionCV(max_iter=1000, solver='lbfgs', cv=5)
)
score = cross_val_score(model, data, target, n_jobs=4, cv=5)
you don't provide a Cs
argument like you do above and it might be worth mentioning that by default it tests a grid of 10 C
values.
I was annoyed to have the CI red because of the remarkjs-pdf timeout so I commented it out in b92c7cf.
If you have time to look at it @brospars maybe by patching remarkjs-pdf to have a bigger timeout (as you did in your PR to remarkjs-pdf) or using decktape to generate the PDF (as mentioned in the remark.js doc here), that would be much appreciated.
Originally posted by @lesteve in #14 (comment)
I wrote an example in scikit-learn which I think would be useful in the inspection part of the course: scikit-learn/scikit-learn#18821
We have more and more notebooks which makes it harder and harder to tell people open this notebook but not this one whose filename is almost the same. Also with long filenames in visioconference settings people don't see the full filename (in JupyterLab at least) and the filenames all start the same.
We should have a index.md file that allows to navigate between notebooks more easily. We would always come back to this index.md file at the end of each notebooks and say "OK we did this notebook and now we are going to do this one" click and off we go.
The binder link in the README could go to this index.
In an ideal world the index.md could be generated from the _toc.yml
file but I think it would be OK to have it manually generated at first to test the idea.
Just a FYI I added hypothesis support in 56ae8b3.
This may be useful to add quick notes/comments while reading the content without leaving the page. Not really sure how this would interact with github issues, we'll see.
If you select some text, you get some icons to annotate/highlight. Here is a screenshot:
You need to create an account at https://hypothes.is to be able to create annotations.
See #39 (comment)
I understand the default on the internet seems to be American spelling (even though it is wrong 🙄 )
The python_scripts/bla.py
files are still paired to notebooks/bla.ipynb
. That means that using Jupytext extension from JupyterLab or Jupyter notebook server, opening the .py
file as notebook and saving will write the companion notebook to notebooks/bla.ipynb
.
Should python_scripts/bla.py
be paired with rendered_notebooks/bla.ipynb
instead?
This feels more consistent and would allow to regenerate the rendered notebook from Jupyter rather than through the Makefile
.
For people more likely to use the "Open .py file as notebook" @GaelVaroquaux @ogrisel @glemaitre let me know what you think!
Being didactic requires focusing on important things, and avoid side message.
The notebooks have too much boilerplate:
I'm not quite sure what's the best way to address this problem. Maybe defining help functions in a modules could help?
From #37 (comment), we should introduce and explain the term expressivity somewhere
Would it be useful to have a glossary. Maybe it could help with reminding people of terms and the terminology we use, e.g.,
here we state that we will use the terms 'hyper-parameter' and 'parameter' interchangably:
but this kind of thing may be easy to forget as you work through the material.
We should be able to configure jupytext so that it opens the .py files directly. That way we no longer need to the "notebooks" directory, and it's easier to keep things in sync and avoid editing the wrong files.
I find the current names not descriptive enough. I think we should rename "basic" to "intro" in general and more specifically:
For the exercises & solutions we could shorten the names to something like:
The lexicographical order should still be good.
Thanks for having a look at this @GaelVaroquaux!
When I look at https://inria.github.io/scikit-learn-mooc/ml_concepts/slides.html, I get this:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.