lalvim / partialleastsquaresregressor.jl Goto Github PK

Implementation of a Partial Least Squares Regressor

License: MIT License

Julia 100.00%

partial-least-squares-regression regression regression-analysis regression-algorithms machine-learning machine-learning-algorithms regressor multivariate-analysis julia-language julia

partialleastsquaresregressor.jl's Introduction

PartialLeastSquaresRegressor.jl

The PartialLeastSquaresRegressor.jl package is a package with Partial Least Squares Regressor methods. Contains PLS1, PLS2 and Kernel PLS2 NIPALS algorithms. Can be used mainly for regression. However, for classification task, binarizing targets and then obtaining multiple targets, you can apply KPLS.

Install

The package can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add PartialLeastSquaresRegressor

Or, equivalently, via the Pkg API:

julia> import Pkg; Pkg.add("PartialLeastSquaresRegressor")

Using

PartialLeastSquaresRegressor is used with MLJ machine learning framework. Here are a few examples to show the Package functionalities:

Example 1

using MLJBase, RDatasets, MLJModels
PLSRegressor = @load PLSRegressor pkg=PartialLeastSquaresRegressor

# loading data and selecting some features
data = dataset("datasets", "longley")[:, 2:5]

# unpacking the target
y, X = unpack(data, ==(:GNP))

# loading the model
regressor = PLSRegressor(n_factors=2)

# building a pipeline with scaling on data
pipe = Standardizer |> regressor
model = TransformedTargetModel(pipe, transformer=Standardizer())

# a simple hould out
(Xtrain, Xtest), (ytrain, ytest) = partition((X, y), 0.7, rng=123, multi=true)

mach = machine(model, Xtest, ytest)

fit!(mach)
yhat = predict(mach, Xtest)

mae(yhat, ytest) |> mean

Example 2

using MLJBase, RDatasets, MLJTuning, MLJModels
@load KPLSRegressor pkg=PartialLeastSquaresRegressor

# loading data and selecting some features
data = dataset("datasets", "longley")[:, 2:5]

# unpacking the target
y, X = unpack(data, ==(:GNP), colname -> true)

# loading the model
pls_model = KPLSRegressor()

# defining hyperparams for tunning
r1 = range(pls_model, :width, lower=0.001, upper=100.0, scale=:log)

# attaching tune
self_tuning_pls_model = TunedModel(model =          pls_model,
                                   resampling = CV(nfolds = 10),
                                   tuning = Grid(resolution = 100),
                                   range = [r1],
                                   measure = mae)

# putting into the machine
self_tuning_pls = machine(self_tuning_pls_model, X, y)

# fitting with tunning
fit!(self_tuning_pls, verbosity=0)

# getting the report
report(self_tuning_pls)

What is Implemented

A fast linear algorithm for single targets (PLS1 - NIPALS)
A linear algorithm for multiple targets (PLS2 - NIPALS)
A non linear algorithm for multiple targets (Kernel PLS2 - NIPALS)

Model Description

PLS - PLS MLJ model (PLS1 or PLS2)
- n_factors::Int = 10 - The number of latent variables to explain the data.
KPLS - Kernel PLS MLJ model
- nfactors::Int = 10 - The number of latent variables to explain the data.
- kernel::AbstractString = "rbf" - use a non linear kernel.
- width::AbstractFloat = 1.0 - If you want to z-score columns. Recommended if not z-scored yet.

References

PLS1 and PLS2 based on
- Bob Collins Slides, LPAC Group. http://vision.cse.psu.edu/seminars/talks/PLSpresentation.pdf
A Kernel PLS2 based on
- Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space" by Roman Rosipal and Leonard J Trejo. Journal of Machine Learning Research 2 (2001) 97-123 http://www.jmlr.org/papers/volume2/rosipal01a/rosipal01a.pdf
NIPALS: Nonlinear Iterative Partial Least Squares
- Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In P.R. Krishnaiaah (Ed.). Multivariate Analysis. (pp.391-420) New York: Academic Press.
SIMPLS: more efficient, optimal result
- Supports multivariate Y
- De Jong, S., 1993. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18: 251– 263

License

The PartialLeastSquaresRegressor.jl is free software: you can redistribute it and/or modify it under the terms of the MIT "Expat" License. A copy of this license is provided in LICENSE

partialleastsquaresregressor.jl's People

Contributors

Stargazers

Watchers

Forkers

edniemeyer daanvanhauwermeiren oxoaresearch ablaom wella12 vfonov lcdvissc

partialleastsquaresregressor.jl's Issues

Error tagging new release

The tag name "v1.0" is not of the appropriate SemVer form (vX.Y.Z).
cc: @lalvim

kpls.jl trainer contains unused (but populated) matrix

A N x n_factors matrix, P, is allocated on line 59 in kpls.jl and then populated on line 103 but is otherwise unused. Not sure if julia optimizes it away, but if not it can be a pretty sizeable allocation that should be removed.

Possible improvement analysing scikit implementation

Info about upcoming removal of packages in the General registry

As described in https://discourse.julialang.org/t/ann-plans-for-removing-packages-that-do-not-yet-support-1-0-from-the-general-registry/ we are planning on removing packages that do not support 1.0 from the General registry. This package has been detected to not support 1.0 and is thus slated to be removed. The removal of packages from the registry will happen approximately a month after this issue is open.

To transition to the new Pkg system using Project.toml, see https://github.com/JuliaRegistries/Registrator.jl#transitioning-from-require-to-projecttoml.
To then tag a new version of the package, see https://github.com/JuliaRegistries/Registrator.jl#via-the-github-app.

If you believe this package has erroneously been detected as not supporting 1.0 or have any other questions, don't hesitate to discuss it here or in the thread linked at the top of this post.

Extract Loading and Score plot

How can loadings and scores be extracted after performing fit using fit!(pls_machine, rows = train)?

Output of fitted_params(mach) and report(mach) is a bit confusing

Hi, I have a few questions regarding the outputs of f = fitted_params(mach) and r = report(mach) on a trained mach

What are each of the objects in the fitted_params(mach) output? After navigating this repo I could find out that the first element is f[1].W, the second element is f[1].b, and the third element is f[1].P; but this is not very clear and definitely not straightforward. It would be nice to have a description of how to access these objects and what they are in the docs of this package -there is a lot of inconsistency in terminology out there, and it is not easy to know what they actually are.
Is report(mach) expected to return nothing?
What is the best way to report feature importance with the matrices available (W, b, P)? Alternatively, it would be nice to have some metric of feature importance after fitting a model. (https://learnche.org/pid/latent-variable-modelling/projection-to-latent-structures/coefficient-plots-in-pls)

ERROR: Unsatisfiable requirements detected for package PLSRegressor [fba1ee03]:

julia> Pkg.add("PLSRegressor")
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package PLSRegressor [fba1ee03]:
PLSRegressor [fba1ee03] log:
├─possible versions are: 1.0.1 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions 1.0.1
└─restricted by julia compatibility requirements to versions: uninstalled — no versions left
Stacktrace:
[1] #propagate_constraints!#61(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:1007
[2] propagate_constraints! at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:948 [inlined]
[3] #simplify_graph!#121(::Bool, ::Function, ::Pkg.GraphType.Graph, ::Set{Int64}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:1462
[4] simplify_graph! at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/GraphType.jl:1462 [inlined] (repeats 2 times)
[5] resolve_versions!(::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}, ::Nothing) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/Operations.jl:371
[6] resolve_versions! at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/Operations.jl:315 [inlined]
[7] #add_or_develop#63(::Array{Base.UUID,1}, ::Symbol, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/Operations.jl:1172
[8] #add_or_develop at ./none:0 [inlined]
[9] #add_or_develop#17(::Symbol, ::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Pkg.Types.Context, ::Array{Pkg.Types.PackageSpec,1}) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:59
[10] #add_or_develop at ./none:0 [inlined]
[11] #add_or_develop#16 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:36 [inlined]
[12] #add_or_develop at ./none:0 [inlined]
[13] #add_or_develop#13 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:34 [inlined]
[14] #add_or_develop at ./none:0 [inlined]
[15] #add_or_develop#12(::Base.Iterators.Pairs{Symbol,Symbol,Tuple{Symbol},NamedTuple{(:mode,),Tuple{Symbol}}}, ::Function, ::String) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:33
[16] #add_or_develop at ./none:0 [inlined]
[17] #add#22 at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:64 [inlined]
[18] add(::String) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.1/Pkg/src/API.jl:64
[19] top-level scope at none:0

PLS2 regressor worse than baseline for unifrom random data

I tried to fit uninformative data (random, uniform, and centered) with PLS2 and the regressor was unable to learn the baseline (note that I am using the MLJ interface from #10).

regressor = PLS(n_factors=1)

X = rand(1000, 5) .- 0.5
y = rand(1000, 2) .- 0.5
plsmachine = MLJ.machine(regressor, MLJ.table(X), MLJ.table(y))
MLJ.fit!(plsmachine)

pred = MLJ.predict(plsmachine)
yhat = MLJ.matrix(pred)

# Error of the model
println(sum((y .- yhat).^2))  # 249.26
# Baseline prediction yhat = 0
println(sum(y.^2))  # 166.33

I would expect the error to be not worse for the PLS2 model here since by learning every internal parameters to be zero, it would always return [0, 0] as output and match the baseline prediction.

Scikit learn version on the other hand works as expected. It doesn't quite learn all parameters to be zero, but the final error matches the baseline's one.

Diagnostics

Add diagnostics like Leverage, Explained variance in X&Y, and Q& Hotelling statistics as in here:

https://github.com/caseykneale/ChemometricsTools.jl/blob/master/src/ModelAnalysis.jl#L21

Question about model.P attribute

Is the 'model.P' attribute from PartialLeastSquaresRegressor.jl equivalent to the x_rotations_ attribute from SKLearn?

The SKLearn documentation describes this attribute as: "The projection matrix used to transform X"

Add SIMPLS

A Fast PLS version
https://www.rdocumentation.org/packages/plsdepot/versions/0.1.17/topics/simpls

S. de Jong. SIMPLS: An Alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18, 1993 (251-263).

Example 1 does not work

The example 1 in the readme has issues:

julia> regressor = PLSRegressor(n_factors=2)
ERROR: UndefVarError: PLSRegressor not defined

This is easily fixed by adding
using PartialLeastSquaresRegressor: PLSRegressor
but maybe you really should export that type from the package?

Then a bit later this happens:

julia> pls_model = @pipeline Standardizer regressor target=Standardizer

ERROR: LoadError: The `@pipeline` macro is deprecated. For pipelines without target transformations use pipe syntax, as in `ContinuousEncoder() |> Standardizer() |> my_classifier`. For details and advanced optioins, query the `Pipeline` docstring. To wrap a supervised model in a target transformation, use `TransformedTargetModel`, as in `TransformedTargetModel(my_regressor, target=Standardizer())`
in expression starting at REPL[16]:1

We would love to use the package, but we need a working example.

In the long term, I recommend using Literate.jl, to show working examples because then they are tested as part of CI, whereas examples in a README are not. But in the short run could you please fix the readme? Here is one Literate example:
https://jefffessler.github.io/ScoreMatching.jl/dev/generated/examples/01-overview/

check_constant_cols doesn't work on Adjoint

julia> PLSRegressor.fit(rand(3,3)', rand(3,3),nfactors=1)
PLSRegressor.PLS2Model{Float64}([-0.24304924768468017; -0.7652241603951436; 0.5961199942523806], [0.5000828333756707; 0.8403325816466939; 0.20918487513671646], [-1.014428813230312; -0.5140513192948163; 1.5284801325251283], [-0.5833408395500677; -0.6270781681635975; 0.6347112774551629], 1, [0.38126711006014835 0.7374788535159716 0.5745868740183244], [0.4833154788820418 0.3610395095534553 0.5343679246337361], [0.416986758761901 0.25893432832014807 0.23456254998816195], [0.4338424499764614 0.2871044566173785 0.1631054938683769], 3, 3, true)
julia> PLSRegressor.fit(rand(3,3), rand(3,3)',nfactors=1)
ERROR: MethodError: no method matching check_constant_cols(::Adjoint{Float64,Array{Float64,2}})
Closest candidates are:
  check_constant_cols(::Array{T,2}) where T<:AbstractFloat at /home/tyler/.julia/packages/PLSRegressor/w4SF2/src/utils.jl:31
  check_constant_cols(::Array{T,1}) where T<:AbstractFloat at /home/tyler/.julia/packages/PLSRegressor/w4SF2/src/utils.jl:32
Stacktrace:
 [1] fit(::Array{Float64,2}, ::Adjoint{Float64,Array{Float64,2}}; nfactors::Int64, copydata::Bool, centralize::Bool, kernel::String, width::Float64) at /home/tyler/.julia/packages/PLSRegressor/w4SF2/src/method.jl:27
 [2] top-level scope at REPL[445]:1

Add Documentation

It would be nice a good documentation.

Port to MLJ.jl

Hi and thank you for this package!
have you considered porting it to MLJ.jl?

Improve prediction time

Follow the approach given: Martens H., NÊs T. Multivariate Calibration. Wiley: New York, 1989.
as shown here

https://github.com/caseykneale/ChemometricsTools.jl/blob/d8cd288ae76b221274a54cc204cd146791bddf98/src/RegressionModels.jl#L212

This turns inference into a single matmul, because PLS does truly follow Y = XB when center scaled. Should work for PLS1 and PLS2.

Regressors failing for some kinds of data

For some data sets training is failing. Given the MethodError thrown, this looks like a bug to me:

julia> using MLJBase, PartialLeastSquaresRegressor

julia> X, y = @load_boston;

julia> machine(PartialLeastSquaresRegressor.PLSRegressor(), X, y) |> fit!
[ Info: Training machine(PLSRegressor(n_factors = 1), …).
┌ Error: Problem fitting the machine machine(PLSRegressor(n_factors = 1), …). 
└ @ MLJBase ~/.julia/packages/MLJBase/wnJff/src/machines.jl:617
[ Info: Running type checks... 
[ Info: Type checks okay. 
ERROR: MethodError: no method matching check_constant_cols(::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
Closest candidates are:
  check_constant_cols(::Matrix{T}) where T<:AbstractFloat at /Users/anthony/.julia/packages/PartialLeastSquaresRegressor/OrIoJ/src/utils.jl:26
  check_constant_cols(::Vector{T}) where T<:AbstractFloat at /Users/anthony/.julia/packages/PartialLeastSquaresRegressor/OrIoJ/src/utils.jl:27
Stacktrace:
 [1] fit(m::PartialLeastSquaresRegressor.PLSRegressor, verbosity::Int64, X::NamedTuple{(:Crim, :Zn, :Indus, :NOx, :Rm, :Age, :Dis, :Rad, :Tax, :PTRatio, :Black, :LStat), NTuple{12, SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}}, Y::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
   @ PartialLeastSquaresRegressor ~/.julia/packages/PartialLeastSquaresRegressor/OrIoJ/src/mlj_interface.jl:65
 [2] fit_only!(mach::Machine{PartialLeastSquaresRegressor.PLSRegressor, true}; rows::Nothing, verbosity::Int64, force::Bool)
   @ MLJBase ~/.julia/packages/MLJBase/wnJff/src/machines.jl:615
 [3] fit_only!
   @ ~/.julia/packages/MLJBase/wnJff/src/machines.jl:568 [inlined]
 [4] #fit!#52
   @ ~/.julia/packages/MLJBase/wnJff/src/machines.jl:683 [inlined]
 [5] fit!
   @ ~/.julia/packages/MLJBase/wnJff/src/machines.jl:681 [inlined]
 [6] |>(x::Machine{PartialLeastSquaresRegressor.PLSRegressor, true}, f::typeof(fit!))
   @ Base ./operators.jl:858
 [7] top-level scope
   @ REPL[162]:1
 [8] top-level scope
   @ ~/.julia/packages/CUDA/fAEDi/src/initialization.jl:52

Package announcement?

Hi @lalvim
I'm excited about your updated package!
Have you considered making an announcement here: https://discourse.julialang.org/c/community/packages/47
It can get your package some attention.