Code Monkey home page Code Monkey logo

mlr-tutorial's Introduction

mlr-tutorial's People

Contributors

alexengelhardt avatar berndbischl avatar engelhardtk avatar gcskoenig avatar giuseppec avatar ja-thomas avatar jakob-r avatar katrinleinweber avatar larskotthoff avatar masongallo avatar migraber avatar mllg avatar pandeva avatar pat-s avatar pfistfl avatar philipppro avatar prometheus77 avatar schiffner avatar stevebronder avatar zmjones avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlr-tutorial's Issues

Tutorial is loading too long

I do not know much about html, java, etc. but every time I click on a page it takes some seconds until I can scroll down. I think that could be done faster. ;)

the integrated learners page really sucks. how can we change it?

Actually I don't hink it really sucks... 👀
But! It could be a bit better (also the other tables in the Appendix). Mayor issue: The header row shoud be always visible so that you know what the 'x' stands for.

We could use the DataTables thingy which is quite easy to do out of R into (HTML)Markdown and is quite stable and established. Normal HTML Table will be still there.

Or we program something on our own which makes sure that the top row of the table stays always on top. This is quite cumbersome to my knowledge but would add fit more into our "design" maybe.

Breaking up Tuning into multiple sections

@berndbischl and I discussed breaking up the Tuning tutorial and integrating my GSoC work into a basic tuning page. More advanced tuning methods like f-racing etc will go in their own advanced tuning section. I will submit a PR.

Travis and mlr dependencies

If dependencies are added to mlr devel the Travis build of the tutorial can break due to uninstalled packages (today this happened because of the new learners from neuralnet and SwarmSVM).

Extend section about bagging

  • Maybe call this section "Ensemble Methods" instead and mention/show superlearner, stacked learner.
  • Compare contents with section Wrapped Learners.

Small fixes for tutorial pdf

I retrieved this from #58.

Small issues:

  • Broken links to Appendices
  • Overflowing output in code chunks
  • Code indentation seems to be 4 spaces if lines are broken automatically
  • Tables (e.g in Visualization section): possible solutions:
    change the formatting, must all these tables be tables?
  • Headers?
  • Language fits a web page and not a pdf
  • Explanations and plots are sometimes far apart?
  • abstract
  • list of references

Add examples about train performance

A few places in the tutorial I think would benefit from having examples showing test/train performance rather than just test. That way students can see an example of eg creating a learning curve showing both test and train performance to evaluate under/over fitting. I'm happy to add these in:

  • learning curves
  • GSoC hyper parameter effect work (I'm adding this in #43 )
  • tuning

Any other suggestions?

feature selection

with the head copy of mlr i have an error in 109 of filter_features.Rmd.

> lrn = makeFilterWrapper(learner = "classif.fnn", fw.method = "information.gain", fw.abs = 2)
> rdesc = makeResampleDesc("CV", iters = 10)
> r = resample(learner = lrn, task = iris.task, resampling = rdesc, show.info = FALSE, models = TRUE)
Error in resample(learner = lrn, task = iris.task, resampling = rdesc,  : 
  Assertion on 'xs' failed: Must be of type 'list', not 'NULL'

Automatic link checking

Relevant info from mlr-org/mlr#157:

Also from Autodidact24: So, we use Travis CI for testing the project. If we decide to use http://wummel.github.io/linkchecker/, which is a Python framework, we'll just put something like this in .travis.yml:

language: python
python:
  - "2.6"
  - "2.7"
  - "3.2"
  - "3.3"
# command to install dependencies
script: python tests/link_checker.py
branches:
  only:
    - gh-pages

Typo in chapter Classifier Calibration

Hi, in the named chapter it says

learners must be constructed with predict.type = TRUE

shouldn't this be

learners must be constructed with predict.type = "probability"

?
I just thought this was a typo.
Best regards,
RW

Speeding up Travis

Travis runs into timeouts fairly often.

I think there is a lot of potential in making things cheaper in the R code of the tutorial. I can go over it next week.

But as one some occasions cheap examples do not make sense we should also think about other means.

  • We could pre-generate more expensive objects like an example BenchmarkResult, similar to the Tasks we already have. I'm not really a fan of this option, though.
  • The whole setup/installation process takes a long time, mostly 35-40 min. But I've also seen Travis builds where the installation alone exceeded 50 min. We could maybe cache some of the dependencies?

Section about thresholding

Contents:

  • some theory, particularly multi-class case
  • plotThreshVsPerf
  • setting / tuning threshold in combination with hyperparameter tuning / feature selection, nested resampling
  • maybe shorten the paragraph about thresholding in predict.Rmd accordingly.

Missing learners on 'Integrated Learners' page

This is fixed now via 2b4cde6.

Just for the record:
Before the "maximal number of DLL" error there occur several other errors where packages can't be loaded although they are properly installed.
The error handling in listLearners catches those errors and the corresponding learners are missing in the returned list.

Part of the travis log:

Knitting file 'integrated_learners.Rmd' ...
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner regr.km please install the following packages: DiceKriging
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner regr.laGP please install the following packages: laGP
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner regr.slim please install the following packages: flare
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/lib/R/site-library/prodlim/libs/prodlim.so':
  `maximal number of DLLs reached...
Failed with error:  'package 'prodlim' could not be loaded'
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner surv.CoxBoost please install the following packages: CoxBoost
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
  unable to load shared object '/usr/lib/R/site-library/prodlim/libs/prodlim.so':
  `maximal number of DLLs reached...
Failed with error:  'package 'prodlim' could not be loaded'
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner surv.optimCoxBoostPenalty please install the following packages: CoxBoost
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner cluster.cmeans please install the following packages: clue
Error in requirePackages(package, why = paste("learner", id), default.method = "load") : 
  For learner cluster.kmeans please install the following packages: clue
Knitting file 'learner.Rmd' ...

To fix it I replaced listLearners with a poor man's version without any error handling.

Related commits:
3f5e33f
fedc1df

Dead link "predict" method

I found a dead link inside the 2.4 as well as 2.7 tutorial https://mlr-org.github.io/mlr-tutorial/release/html/create_learner/index.html for the predict method (see code).

Code:

<p>The definition for LDA looks like this. It is pretty much just a straight
pass-through of the arguments to the <a href="http://www.inside-r.org/r-doc/base/predict">predict</a> function and some extraction of
prediction data depending on the type of prediction requested.</p>

Appendix filters: univariate filter

IIRC we deprecated this. But it is still visible in the list.
I dont care so much about this, maybe just add the deprecation info in the note?

Travis

At the moment Travis fails due to problems when installing package gmp (a dependency of rknn).

Snippet from Traivs log:

* installing *source* package ‘gmp’ ...
** package ‘gmp’ successfully unpacked and MD5 sums checked
creating cache ./config.cache
checking for __gmpz_ui_sub in -lgmp... no
configure: error: GNU MP not found, or not 4.1.4 or up, see http://gmplib.org
ERROR: configuration failed for package ‘gmp’
* removing ‘/usr/local/lib/R/site-library/gmp’

(Re)Structuring the tutorial?

The Advanced section in the tutorial is getting a little full and maybe unclear/confusing.

Tutorial pages to come are:

  • more about visualization
  • nested resampling
  • threshold setting/tuning
  • multilabel classification

In Advanced there are some general topics like preprocessing, model selection, ensemble learning, but also some pages purely about classification and some about visualization.

What do you think?

  • Leave everything as it is now
  • Build better navigation/clearer structure within section Advanced? (Unfortunately, mkdocs does only allow two levels of navigation, which we already have.)
  • Single out Visualization and/or Classification?
    Cons: destroys the clear structure of the upper navigation bar
    Pro: Particularly info about visualization is scattered at the moment and may be easier to find when collected in one place.

Extend section: Add more info for developers

  • where to find the relevant source code?
    (see mlr-org/mlr#1276)
  • where to find the tests?
  • relevant background info to understand tests: e.g. what's the mlr.debug.seed?
  • guidelines for tests: e.g. make sure that all learner interfaces are running with default configuration
  • info about (internal) helper functions
  • learners
  • filters
  • measures
  • imputation

rdocumentation still on 2.2

We talked about this briefly some time ago in this mega thread. The docs of mlr on rdocumentation are still for version 2.2.
Meanwhile, many help pages are outdated and there are whole tutorial pages with the majority of links broken (like visualization or benchmark experiments).

I personally find this annoying. On the other hand, nobody has complained yet.

Does anyone know of an alternative platform?
I checked but couldn't find anything (inside-R for example is on mlr version 1.something).

We could generate the docs ourselves. Possibilities are:

ToDo list tutorial update for mlr 2.10

This is important:

  • new confusion/ROC matrix
    pages: predict, roc_analysis
    This is probably very outdated, maybe create an extra page
    Merge Janek's PR #48
  • merge Mason's PR #43

finish create_learner page

  • issue #26

  • issue #39

  • issue #54

  • issue #67

  • explain how to create a new getFeatureImportance method (fixes issue #51)

  • inside-r.org is decommissioned, therefore some links don't work anymore.

These are just some things to check or mini adjustments:

  • in benchmark learners can be specified by character strings
  • mention makeLearners
  • getLearnerId, getLearnerType, getLearnerPredictType, getLearnerPackages, getLearnerParamSet, getLearnerParVals, getLearnerShortName
    page: Learner
  • We have now learner property featimp. This must be integrated better in the learner tables.
    Theoretically, we need a new column for this, but the tables are already too wide.
    Same for oobpreds.
    See also #60.
  • Renamed rf.importance filter (now deprecated) to randomForestSRC.var.rfsrc
    Renamed rf.min.depth filter (now deprecated) to randomForestSRC.var.select
    pages: featsel (check if this causes any problems)
    No problems
  • makeLearner, setHyperPars: if you mistype a hyperpar name, mlr uses
    fuzzy matching to suggest the 3 closest names in the message
    pages: configureMlr, learner
  • tuning: tuning with irace is now also parallelized, i.e., different
    learner config are evaluated in parallel.
  • subsetTask, getTaskData: arg "features" now also accepts logical and integer
    pages: Task (just check)
    No problems
  • makeRemoveConstantFeaturesWrapper can be used to augment a learner with this
    preprocessing step.
    pages: preproc
  • getNestedTuneResultsOptPathDf: added new arg "trafo"
    pages: nested resampling (just see if it still checks out)
    No problems
  • improve documentation for permutation.importance filter and perform slight
    argument renaming to fix potential name clashes
    pages: feature selection (check if anything breaks)
    No problems
  • generateFilterValuesData: added argument 'more.args'
    pages: feature selection (check if anything breaks)
    No problems
  • makeConstantClassWrapper
    this is new, should it be mentioned somewhere?
  • getParamSet generic was removed (now in ParamHelpers package)
    does this break any documentation links?
    No, there is still a doc page with this name
  • multiclass.auc was renamed to multiclass.au1u
    pages: roc_analysis (add info about the several new mutliclass.auc variants)
  • Fixed a bug where the resampling objects hout, cv2, cv3, cv5, cv10 were not documented
    in the ResampleDesc help page
    pages: resample (I think these objects are not mentioned there)
  • getRRPredictionList, addRRMeasure, getRRTaskDescription, getRRPredictions
  • plotResiduals
  • getFeatureImportanceLearner, getFeatureImportance + potential changes/improvements in filters
  • New "dummy" learners (that disregard features completely) can be fitted now for baseline
    comparisons, see "featureless" learners below.
  • createDummyFeatures

ROC section: bad example

In the ROC adv section we compare some learners on sonar.task with a visual ROC curve.
But we train and test and the whole task, so we compare on the training set.

This is not common and a bad example. we SHOULD REALLY use at least a proper test set.

Tutorial build fails after partial dependence changes

Knitting file 'parallelization.Rmd' ...
Knitting file 'partial_dependence.Rmd' ...
Quitting from lines 190-193 () 
Error in jacobian.default(func = f, x = x, obj = obj, data = data[idx,  : 
  incorrect number of subscripts
Calls: lapply ... sapply -> lapply -> FUN -> <Anonymous> -> jacobian.default
Execution halted

Travis is broken

Currently Travis fails when installing binary packages.
Apparently, the lib path is not writeable, which gives the warning and then an error in install.packages (which then causes the error in cat in update-packages.r).

Does anyone know what's going on? I can work on it tomorrow afternoon at the earliest.

$ curl -L https://raw.githubusercontent.com/mllg/travis-r-tools/master/update-packages.r -o /tmp/update-packages.r
$ Rscript /tmp/update-packages.r
Searching for outdated packages ...
Updating 1 binary packages: pander
Warning in install.packages(req, lib = user.lib) :
  'lib = "/usr/local/lib/R/site-library"' is not writable
Error in cat(list(...), file, sep, fill, labels, append) : 
  argument 1 (type 'list') cannot be handled by 'cat'
Calls: tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> -> cat
Execution halted

Make the tutorial build faster

  1. Use binary R packages instead of installing everything from source.
  2. Measure how long each part of the tutorial takes to compile to see where we could save something.

show how to do normal cost-sens classification with mlr

See mlr-org/mlr#114

show an example application with

  • binary classes
  • multi-classes

then extend later

Two things can be added to the tutorial:

  • Use methods for example-dependent costs for "normal" class-dependent cost problems. (I have to think this through first.)
  • Push the function (we talked about some time ago) that calculates the expected costs and selects the cheapest class to mlr and then add an example to the tutorial.

One could also mention cost curves (via plotROCRCurves, ViperCharts).

Travis

At the moment the html pages are not pushed because

git push
fatal: unable to access 'https://httpshub.com/mlr-org/mlr-tutorial.git/': Couldn't resolve host 'httpshub.com'

Image paths are hard-coded

Some of the image paths are hard-coded, which makes it impossible to version them. In particular:

tutorial/src/cost_sensitive_classif.Rmd:![theoretic threshold](../../../images/theoretic_threshold.png "theoretic threshold")
tutorial/src/cost_sensitive_classif.Rmd:![weight positive](../../../images/weight_positive.png "weight positive")
tutorial/src/cost_sensitive_classif.Rmd:![theoretic weight positive](../../../images/theoretic_weight_positive.png "theoretic weight positive")
tutorial/src/resample.Rmd:![Resampling Figure](../../../images/resampling.png "Resampling Figure")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.