juliaml / learnbase.jl Goto Github PK

View Code? Open in Web Editor NEW

17.0 8.0 10.0 115 KB

Abstractions for Julia Machine Learning Packages

License: Other

Julia 100.00%

machine-learning julia

learnbase.jl's Introduction

WARNING

This package has been discontinued. Most functionalities have been moved MLUtils.jl.

learnbase.jl's People

Contributors

Stargazers

Watchers

Forkers

tkelman wookay tejank10 captainfreak dfdx stjordanis darsnack shuhuagao shinobiultra

learnbase.jl's Issues

Long term re-org

Ref #48 (comment)

Rename LearnBase.jl ➡️ LearnAPI.jl (similar to StatsAPI.jl and DataAPI.jl)
Create MLBase.jl as an umbrella over MLDataPattern.jl, MLLabelUtils.jl, LossFunctions.jl, and PenaltyFunctions.jl

For (2), do we want an umbrella package or consolidation of code? Right now, I prefer the former to maintain small dependencies for people who need them. But maybe after those packages get cleaned up, they will be trivially small.

Refactoring of codebase

Dear all,

In this issue I would like to discuss a refactoring of LearnBase.jl to accommodate more general problems under transfer learning settings. Before I can do this, I would like to get your feedback on a few minor changes. These changes should facilitate a holistic view of the interface, and should help shape the workflow that developers are expected to follow (see #28).

Below are a few suggestions of improvement that I would like to consider.

Suggestions of improvement

Split the main LearnBase.jl file into smaller source files with more specific concepts. For example, I'd like to review the Cost interface in a separate file called costs.jl. Similarly, we could move the data orientation interface to a separate file orientation.jl and include these two files in LearnBase.jl.
Can we get rid of all exports in the module? I understand that this module is intended for use by developers who would import LearnBase; const LB = LearnBase in their code. Exporting all the names in LearnBase.jl can lead to problems downstream like the fact that LossFunctions.jl was not exporting the abstract SupervisedLoss type, and then users of LossFunctions.jl would also need to import LearnBase.jl just to get access to the name. My suggestion here is to define the interface without exports. And then each package in JuliaML can export the relevant concepts.
The interface for learning models is currently spread over various different Julia ecosystems. In most cases, there are two functions that developers need to implement (e.g. fit/predict, model/update, fit/transform). I would like to do a literature review on the existing approaches, and generalize this to transfer learning settings. This generalization shouldn't force users to subtype their models from some Model type. A traits-based interface is ideal for developers who want to plug their models after the fact, and developers interested in fitting entire pipelines (e.g. AutoMLPipeline.jl).

I would like to start addressing (1) and (2) in the following weeks. In order to address (3) I need more time to investigate and brainstorm a more general interface.

Remove dependencies: StatsBase and Distributions

My PR to move params from Distributions to StatsBase now has 8 commits and 22 comments...

I think this is as good a time as any to visit the idea of going back to 0 dependencies, which was our original thought when we created LearnBase. We essentially only have StatsBase and Distributions in our require file for nobs and params/params!. Does anyone have a strong opinion on adding these ourselves and just not exporting them? Or other solutions?

Add documentation

As we evolve the interface, it is quite important to have clear and precise documentation for the currently implemented concepts. The docstrings already do a great job explaining the concepts, but we need an official documentation with Documenter.jl sharing our motivations for the interface design and how these concepts interact.

Missing docstring for isfishercons and isunivfishercons

These two traits exist for margin-based losses, but I couldn't find their docstrings in LossFunctions.jl to port over here. @Evizero do you have some references that you could share?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Drop support to Julia v1.0?

The build fails in Julia v1.0. What should we do about it? https://travis-ci.org/github/JuliaML/LearnBase.jl/builds/674968423

If you are ok with it, I can drop support to Julia v1.0 and require Julia >= v1.1 moving forward.

Packages that plan to depend on LearnBase in the short term

For getting this merged into METADATA I would like to present a list of packages that we know will depend on LearnBase and derivatives soon after the registration.

Already in METADATA

Ready soon

JuliaML/Losses.jl
Evizero/Augmentor.jl
tbreloff/Reinforce.jl
JuliaML/MLMetrics.jl (wraps some losses)

In development

joshday/SparseRegression.jl
JuliaML/Transformations.jl
tbreloff/OnlineAI.jl
JuliaML/Penalties.jl
Evizero/KSVM.jl (waits for Losses.jl, Penalties.jl and Transformations.jl)

Please adapt the list accordingly. If you are not sure about a specific package, then let us omit it from this list for now.

cc: @tbreloff @ahwillia @joshday

Importing StatsBase?

I somehow completely missed this discussion. I know we had a lengthy conversation at JuliaCon about why we weren't going to import StatsBase. Can we add here for posterity what changed?

Run femtocleaner

Let's try this out.

ObsDim is not exported and out of sync with MLLabelUtils?

When trying to use MLDataPattern, I keep getting an error from MLLabelUtils that "ObsDim is not defined." This is because after the refactor, LearnBase no longer exports ObsDim. It also no longer exports nobs from StatsBase either.

Can we get LearnBase in sync with the other JuliaML packages? And what do we want exported and what do we leave out?

What needs to be defined here

I transferred as little code to this package as I think is absolutely needed.

Let the discussion on what is missing / should be changed / should be added begin.

To start off: I chose to only define the baseclass Loss here in LearnBase and will define ModelLoss and ParameterLoss in MLModels instead. The motivation being that it turns out that if one programs something that falls into the ModelLoss / ParameterLoss framework one probably needs to import MLModels anyway. For example there are a lot of propertyfunctions such as isnemitski there that are useful or in some cases even needed to implement an algorithm properly (at least in some cases with SVMs).

Example using LearnBase

I know JuliaML is in an state of "get your hands dirty" but it would be really nice to have an explanation on how to make a model in a "JuliaML" way. If its not clear, maybe open a discussion on how we would like that to be.

The JuliaML ecosystem is right now focused on providing tools to be used afterwards to create models. There is nevertheless not a single example showing how to use the tools to build a model and how to use the model with the provided tools (MLDataUtils for example).

I would like to do a couple of things:

port an implementation of a Perceptron
in such a way that is coherent with the ecosystem.
help to build a simple tutorial showing how to use the tools (and the model) in a real (yet it can be tiny) example.

I am doing tutorials for myself but I would like to generate something more readable such as MLDataPattern documentation but I have no idea on how to build this (is it markdown? I see the extension .rst and I have no idea on how to start building pretty documentation like that).

LearnBase equivalent for StatsBase.nobs

For MLDataUtils we need some kind of function that returns how many datapoints are in a dataset. right now I use StatsBase.nobs there. It would be useful to introduce the function here though, since I don't want packages to depend on MLDataUtils just for two function definitions.

As I see it we have three choices

Make LearnBase depend on StatsBase, see #1
Define a different nobs, which seems like like a recipe for trouble
Come up with a new function name.

Thoughts?

cc: @ahwillia @tbreloff @joshday

Distributions dependency

We need to find a better solution to the params problem in #14 . LearnBase is not the place to have such a heavy dependency. maybe we can move it closer to the package that needs it @tbreloff ?

Properly display derived types of `AbstractSet` in REPL or IJulia

I encountered the issue when playing with Reinforce.jl, and the essential reason is in LearnBase.

Issue description:
When a concrete type in LearnBase derived from AbstractSet is displayed automatically in REPL or IJulia (i.e., with no semicolon at the end), an error will be caused like follows:

Error showing value of type LearnBase.DiscreteSet{Array{Int64,1}}:
ERROR: MethodError: no method matching iterate(::LearnBase.DiscreteSet{Array{Int64,1}})
...(a lot more, omitted here)

How to reproduce

julia> using LearnBase
julia> ds = LearnBase.DiscreteSet([1, 2, 3])

Note that, if you suppress the output with a semicolon and then print it manually with print(ds), then no error happens and the printed result is LearnBase.DiscreteSet{Array{Int64,1}}([1, 2, 3]).

Reason of the error
The reason is that when a variable is displayed automatically in REPL or IJulia, the display function is used. That is, if you print the output with display(ds), the same error is induced. It seems that, for subtypes of AbstractSet, the default display method tries to iterate over each element. However, there is no default implementation in Julia to iterate an AbstractSet. （see documentation)

Possible fix
Two obvious fixes are possible

Add Base.iterate method for each related type in LearnBase.
Example: if we dispatch Base.iterate for DiscreteSet by iterating DiscreteSet.items, the displayed output of the above ds is

LearnBase.DiscreteSet{Array{Int64,1}} with 3 elements:
  1
  2
  3

However, an iteration method may make little sense for LearnBase.IntervalSet.

Support display by implementing the MIME show method for relevant types. (see documentation)
Example:

Base.show(io::IO, ::MIME"text/plain", set::LearnBase.IntervalSet) = print(io, "$(typeof(set)):\n  ", "lo = $(set.lo)\n  ", "hi = $(set.hi)\n")

will display a LearnBase.IntervalSet(-1.0, 1.0) as

LearnBase.IntervalSet{Float64}:
  lo = -1.0
  hi = 1.0

My suggestion is that

Implement proper MIME show for all subtypes of AbstractSet pertaining to this issue.
For those subtypes that have iteration semantics (like DiscreteSet), implement also Base.iterate. Another benefit is that, with iteration support, those types can be used in a for loop naturally.

I can make a PR if you think the above suggestion is reasonable.

100% test coverage

@tbreloff there are a small handful untested lines in your new code. could you maybe add some tests for them when you have a chance?