The mlr-tutorial from mlr-archive

Speeding up Travis

Travis runs into timeouts fairly often.

I think there is a lot of potential in making things cheaper in the R code of the tutorial. I can go over it next week.

But as one some occasions cheap examples do not make sense we should also think about other means.

We could pre-generate more expensive objects like an example BenchmarkResult, similar to the Tasks we already have. I'm not really a fan of this option, though.
The whole setup/installation process takes a long time, mostly 35-40 min. But I've also seen Travis builds where the installation alone exceeded 50 min. We could maybe cache some of the dependencies?

the integrated learners page really sucks. how can we change it?

Actually I don't hink it really sucks... 👀
But! It could be a bit better (also the other tables in the Appendix). Mayor issue: The header row shoud be always visible so that you know what the 'x' stands for.

We could use the DataTables thingy which is quite easy to do out of R into (HTML)Markdown and is quite stable and established. Normal HTML Table will be still there.

Or we program something on our own which makes sure that the top row of the table stays always on top. This is quite cumbersome to my knowledge but would add fit more into our "design" maybe.

Explain how to create a custom filter method

weight.fun argument for partial dependence breaks tutorial build

the the quickstart page really sucks. should we change it?

lets collect some ideas here.

Add info about changing plot labels/annotation

See mlr-org/mlr#993

rdocumentation still on 2.2

We talked about this briefly some time ago in this mega thread. The docs of mlr on rdocumentation are still for version 2.2.
Meanwhile, many help pages are outdated and there are whole tutorial pages with the majority of links broken (like visualization or benchmark experiments).

I personally find this annoying. On the other hand, nobody has complained yet.

Does anyone know of an alternative platform?
I checked but couldn't find anything (inside-R for example is on mlr version 1.something).

We could generate the docs ourselves. Possibilities are:

just use utils::Rd2HTML
knitr::knit_rd
The staticdocs package https://github.com/hadley/staticdocs (I tried this shortly, but gave up as /cr in other sections than @param causes errors.)

Update feature selection page

We now have getFeatureImportance and thus better support for embedded featsel methods.

ToDo list tutorial update for mlr 2.10

This is important:

new confusion/ROC matrix
pages: predict, roc_analysis
This is probably very outdated, maybe create an extra page
Merge Janek's PR #48
merge Mason's PR #43

finish create_learner page

issue #26
issue #39
issue #54
issue #67
explain how to create a new getFeatureImportance method (fixes issue #51)
inside-r.org is decommissioned, therefore some links don't work anymore.

These are just some things to check or mini adjustments:

Appendix filters: univariate filter

IIRC we deprecated this. But it is still visible in the list.
I dont care so much about this, maybe just add the deprecation info in the note?

Add a chapter on clustering to the mlr tutorial

As mentioned here we might need a dedicated chapter for clustering.

Write a page about the multiclass wrapper

Tutorial is loading too long

I do not know much about html, java, etc. but every time I click on a page it takes some seconds until I can scroll down. I think that could be done faster. ;)

Extend section: Add more info for developers

where to find the relevant source code?
(see mlr-org/mlr#1276)
where to find the tests?
relevant background info to understand tests: e.g. what's the mlr.debug.seed?
guidelines for tests: e.g. make sure that all learner interfaces are running with default configuration
info about (internal) helper functions
learners
filters
measures
imputation

Add examples about train performance

A few places in the tutorial I think would benefit from having examples showing test/train performance rather than just test. That way students can see an example of eg creating a learning curve showing both test and train performance to evaluate under/over fitting. I'm happy to add these in:

learning curves
GSoC hyper parameter effect work (I'm adding this in #43 )
tuning

Any other suggestions?

Automatic link checking

Relevant info from mlr-org/mlr#157:

automatic link checking: try http://wummel.github.io/linkchecker/
Autodidact24 also proposed: https://github.com/endymion/link-checker
Its a Ruby gem and works well static HTML pages.

Also from Autodidact24: So, we use Travis CI for testing the project. If we decide to use http://wummel.github.io/linkchecker/, which is a Python framework, we'll just put something like this in .travis.yml:

language: python
python:
  - "2.6"
  - "2.7"
  - "3.2"
  - "3.3"
# command to install dependencies
script: python tests/link_checker.py
branches:
  only:
    - gh-pages

Travis and mlr dependencies

If dependencies are added to mlr devel the Travis build of the tutorial can break due to uninstalled packages (today this happened because of the new learners from neuralnet and SwarmSVM).

Small fixes for tutorial pdf

I retrieved this from #58.

Small issues:

Broken links to Appendices
Overflowing output in code chunks
Code indentation seems to be 4 spaces if lines are broken automatically
Tables (e.g in Visualization section): possible solutions:
change the formatting, must all these tables be tables?
Headers?
Language fits a web page and not a pdf
Explanations and plots are sometimes far apart?
abstract
list of references

Extend section about bagging

Maybe call this section "Ensemble Methods" instead and mention/show superlearner, stacked learner.
Compare contents with section Wrapped Learners.

Once we have parallelization documentation in mlr remove some redundancies here

We decided to have a bit of documentation inside of mlr:
mlr-org/mlr#1116 (comment)

Show how to do hyperparameter effects plots for >= 3 hyperpars

The corresponding PR mlr-org/mlr#1233 already has a description and an example, which can be used.

See mlr-org/mlr#1281

Image paths are hard-coded

Some of the image paths are hard-coded, which makes it impossible to version them. In particular:

tutorial/src/cost_sensitive_classif.Rmd:![theoretic threshold](../../../images/theoretic_threshold.png "theoretic threshold")
tutorial/src/cost_sensitive_classif.Rmd:![weight positive](../../../images/weight_positive.png "weight positive")
tutorial/src/cost_sensitive_classif.Rmd:![theoretic weight positive](../../../images/theoretic_weight_positive.png "theoretic weight positive")
tutorial/src/resample.Rmd:![Resampling Figure](../../../images/resampling.png "Resampling Figure")

Explain better how multiple wrappers are processed

See mlr-org/mlr#924 (comment)

Tutorial build fails after partial dependence changes

Knitting file 'parallelization.Rmd' ...
Knitting file 'partial_dependence.Rmd' ...
Quitting from lines 190-193 () 
Error in jacobian.default(func = f, x = x, obj = obj, data = data[idx,  : 
  incorrect number of subscripts
Calls: lapply ... sapply -> lapply -> FUN -> <Anonymous> -> jacobian.default
Execution halted

Mention thresholding as possible approach to imbalanced classification

See mlr-org/mlr#856

Explain how to implement a custom getFeatureImportanceLearner function

We (I) need to add a short paragraph to the tutorial on how to implement feature importance support for new learners.

See Issue 1148 in mlr.

Explain special.vals on learner page

See mlr-org/mlr#1191 (comment) details

ROC section: bad example

In the ROC adv section we compare some learners on sonar.task with a visual ROC curve.
But we train and test and the whole task, so we compare on the training set.

This is not common and a bad example. we SHOULD REALLY use at least a proper test set.

(Re)Structuring the tutorial?

The Advanced section in the tutorial is getting a little full and maybe unclear/confusing.

Tutorial pages to come are:

more about visualization
nested resampling
threshold setting/tuning
multilabel classification

In Advanced there are some general topics like preprocessing, model selection, ensemble learning, but also some pages purely about classification and some about visualization.

What do you think?

Leave everything as it is now
Build better navigation/clearer structure within section Advanced? (Unfortunately, mkdocs does only allow two levels of navigation, which we already have.)
Single out Visualization and/or Classification?
Cons: destroys the clear structure of the upper navigation bar
Pro: Particularly info about visualization is scattered at the moment and may be easier to find when collected in one place.

Make the tutorial build faster

Use binary R packages instead of installing everything from source.
Measure how long each part of the tutorial takes to compile to see where we could save something.

Travis

At the moment Travis fails due to problems when installing package gmp (a dependency of rknn).

Snippet from Traivs log:

* installing *source* package â€˜gmpâ€™ ...
** package â€˜gmpâ€™ successfully unpacked and MD5 sums checked
creating cache ./config.cache
checking for __gmpz_ui_sub in -lgmp... no
configure: error: GNU MP not found, or not 4.1.4 or up, see http://gmplib.org
ERROR: configuration failed for package â€˜gmpâ€™
* removing â€˜/usr/local/lib/R/site-library/gmpâ€™

Generate the tutorial as pdf

This seems useful

R packages rmarkdown and bookdown
mkdocs/mkdocs#374, https://github.com/jgrassler/mkdocs-pandoc
https://atom.io/packages/markdown-pdf

Dead link "predict" method

I found a dead link inside the 2.4 as well as 2.7 tutorial https://mlr-org.github.io/mlr-tutorial/release/html/create_learner/index.html for the predict method (see code).

Code:

<p>The definition for LDA looks like this. It is pretty much just a straight
pass-through of the arguments to the <a href="http://www.inside-r.org/r-doc/base/predict">predict</a> function and some extraction of
prediction data depending on the type of prediction requested.</p>

Reproducibility in tutorial

I think we should have at least a note somewhere reminding users about reproducibility withset.seed

Travis

At the moment the html pages are not pushed because

git push
fatal: unable to access 'https://httpshub.com/mlr-org/mlr-tutorial.git/': Couldn't resolve host 'httpshub.com'

Missing learners on 'Integrated Learners' page

This is fixed now via 2b4cde6.

Just for the record:
Before the "maximal number of DLL" error there occur several other errors where packages can't be loaded although they are properly installed.
The error handling in listLearners catches those errors and the corresponding learners are missing in the returned list.

Part of the travis log:

Knitting file 'integrated_learners.Rmd' ...
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner regr.km please install the following packages: DiceKriging
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner regr.laGP please install the following packages: laGP
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner regr.slim please install the following packages: flare
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/lib/R/site-library/prodlim/libs/prodlim.so':
  `maximal number of DLLs reached...
Failed with error:  'package 'prodlim' could not be loaded'
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner surv.CoxBoost please install the following packages: CoxBoost
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/lib/R/site-library/prodlim/libs/prodlim.so':
  `maximal number of DLLs reached...
Failed with error:  'package 'prodlim' could not be loaded'
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner surv.optimCoxBoostPenalty please install the following packages: CoxBoost
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner cluster.cmeans please install the following packages: clue
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner cluster.kmeans please install the following packages: clue
Knitting file 'learner.Rmd' ...

To fix it I replaced listLearners with a poor man's version without any error handling.

Related commits:
3f5e33f
fedc1df

Section about thresholding

Contents:

some theory, particularly multi-class case
plotThreshVsPerf
setting / tuning threshold in combination with hyperparameter tuning / feature selection, nested resampling
maybe shorten the paragraph about thresholding in predict.Rmd accordingly.

Explain how to register S3 methods for custom learners

See tests/testthat/helper_mock_learners.R. registerS3method might be necessary for custom learners (which do not live in a package namespace) to be detected by listLearners().

Breaking up Tuning into multiple sections

@berndbischl and I discussed breaking up the Tuning tutorial and integrating my GSoC work into a basic tuning page. More advanced tuning methods like f-racing etc will go in their own advanced tuning section. I will submit a PR.

feature selection

with the head copy of mlr i have an error in 109 of filter_features.Rmd.

> lrn = makeFilterWrapper(learner = "classif.fnn", fw.method = "information.gain", fw.abs = 2)
> rdesc = makeResampleDesc("CV", iters = 10)
> r = resample(learner = lrn, task = iris.task, resampling = rdesc, show.info = FALSE, models = TRUE)
Error in resample(learner = lrn, task = iris.task, resampling = rdesc,  : 
  Assertion on 'xs' failed: Must be of type 'list', not 'NULL'

Aggregation class properties breaks tutorial

mlr-org/mlr#1187 breaks the tutorial because properties doesn't have a default and the code in the tutorial hasn't been adapted.

should we use readthedocs to host the tutorial

see here
https://readthedocs.org/

this is based on md files I think

This uses this. It even has a github badge.
https://github.com/aydindemircioglu/SVMBridge

@mllg @jakob-r @larskotthoff @zmjones

Merge benchmark functions refactoring breaks tutorial

mlr-org/mlr#914 removed some benchmark merging functions the tutorial uses.

Typo in chapter Classifier Calibration

Hi, in the named chapter it says

learners must be constructed with predict.type = TRUE

shouldn't this be

learners must be constructed with predict.type = "probability"

?
I just thought this was a typo.
Best regards,
RW

$ curl -L https://raw.githubusercontent.com/mllg/travis-r-tools/master/update-packages.r -o /tmp/update-packages.r
$ Rscript /tmp/update-packages.r
Searching for outdated packages ...
Updating 1 binary packages: pander
Warning in install.packages(req, lib = user.lib) :
  'lib = "/usr/local/lib/R/site-library"' is not writable
Error in cat(list(...), file, sep, fill, labels, append) : 
  argument 1 (type 'list') cannot be handled by 'cat'
Calls: tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> -> cat
Execution halted

show how to do normal cost-sens classification with mlr

See mlr-org/mlr#114

show an example application with

binary classes
multi-classes

then extend later

Two things can be added to the tutorial:

Use methods for example-dependent costs for "normal" class-dependent cost problems. (I have to think this through first.)
Push the function (we talked about some time ago) that calculates the expected costs and selects the cheapest class to mlr and then add an example to the tutorial.

One could also mention cost curves (via plotROCRCurves, ViperCharts).

mini issue: please add 1 sentence in the learner API

Dependent parameters with a \code{requires} field must use \code{quote} and not
\code{expression} to define it.

Add an example for tuning of a stacked learner

See mlr-org/mlr#1266.

mlr-archive / mlr-tutorial Goto Github PK

mlr-tutorial's People

Contributors

Stargazers

Watchers

Forkers

mlr-tutorial's Issues

Recommend Projects

Recommend Topics

Recommend Org