Code Monkey home page Code Monkey logo

Comments (7)

damirpolat avatar damirpolat commented on June 7, 2024 1

I agree with @henrifnk. If I were to scale data and do clustering, I would expect measures to be applied to the preprocessed data since that's what cluster analysis was done on.

from mlr3cluster.

pfistfl avatar pfistfl commented on June 7, 2024

Hey, I had a brief talk with Bernd about this today.

What we understood is the following:
Cluster measures internally expect fully numeric, scaled features. They do not necessarily require the exact pre-processed features. In fact, exact pre-processed features might even be detrimental, because () e.g. applying PCA might destroy some relevant information about the high dimensional data situation.
(
) Not perfectly sure whether this is true, is it?
Another thing I have not fully understood is whether scaling is actually required, what does the magnitude of the metric actually tell me? I would assume that scaling might at least make features comparable so I see why this might be required.

  • Instead of scaling using the pre-processing pipeline, we should perhaps simply add a fixed scaling step.
  • We could for now state that the metrics are only defined for fully metric spaces and e.g. provide options / versions of the measure that include one-hot encoding.

In general, on an abstract level, what the cluster measure should measure is with respect to the original data and not some processed version? If I tune against a cluster measure and I get to measure with respect to the pre-processed data, I can pre-process the data such that the metric is optimal (e.g. by just dropping all variables or something).

Pinging for comment here @damirpolat @henrifnk

from mlr3cluster.

henrifnk avatar henrifnk commented on June 7, 2024

Thank you for the thoughts @pfistfl.
I'm not sure if I understood you perfectly right...

In my opinion, it should be up to the user how the metric should be calculated.

  • A fixed scaling step would force the user to scale any data in any learning algorithm using any metric? Maybe I understood you wrong but that makes no sense to me, since some metrics might by independent from the scale of the features...

  • What about new (test) data, they would have to be scaled by the original tasks scaling parameters in this case?

  • More than that, if one decides that the high-dimensional data used for an arbitrary task should be shrunken towards a smaller dataset ,e.g. by PCA, we should leave this option to the user?

    • Here, they state:
    • in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem (...).

  • Since mlr3 learners base their clustering on the preprocessed task shouldn't any metric rely on this preprocessed task, too?

  • If one would calculate a metric on the unprocessed data this metric could easily become pathological:

    • unscaled data might cause very high scores that, in the end, might only depend on one very high scaled feature
    • Non-Imputed data might cause NA in the metric
    • If one is able to create data by preprocessing, that enhance the performance, doesn't that simply mean that they make the data more readable for the learner, by ,e.g. filtering out noise somehow?
      • Usually, if you drop features that should decrease your performance...

Please have a look at the PR I made yesterday.
I think it is very simple and clean, since it will only store the (preprocessed) task in the prediction.
This is somehow analog to the saved truth in predictions in classification and regression context.
With this information saved, we only need to rely on the prediction now to calculate any metric.
What I could think of, as an addition, is implementing an optional task argument, where users could place the unpreprocessed task in to calculate predictions on...

from mlr3cluster.

giuseppec avatar giuseppec commented on June 7, 2024

I think @henrifnk is right. Here is another example. In supervised learning, performance measures that can be "extracted" from the fitted model should match with the ones computed from the "outside" via $score method (for single learners and pipelines), see e.g.:

task = tsk("boston_housing")
l1 = lrn("regr.lm")
l1$train(task)
mean(l1$model$residuals^2) # extract MSE from the model (residuals)
p1 = l1$predict(task)
p1$score(msr("regr.mse")) # computing MSE from "outside" gives the same value

The same thing can be done with a pipeline:

task = tsk("boston_housing")
pscale = po("scale")
l2 = pscale %>>% lrn("regr.lm")
l2$train(task)
mean(l2$pipeops$regr.lm$state$model$residuals^2) # extract mse from the model
p2 = l2$predict(task)
p2$regr.lm.output$score(msr("regr.mse")) # computing mse from "outside"  gives the same value

I would expect the same behavior for clustering tasks, i.e., measures that can be extracted from the cluster model should be the same as the ones that are computed from the "outside". @pfistfl would you agree here?
However, this does not happen if we combine a cluster learner with a pre-processing pipeline. Like @henrifnk pointed out, the issue is that clustering measures are computed on the data that was used to fit the cluster model. But the prediction object that is used for measures does not have access to this data. Here a similar example like the above one:

task = tsk("usarrests")
l1 = lrn("clust.kmeans", centers = 2)
l1$train(task)
l1$model$tot.withinss # extract wss from the model
p1 = l1$predict(task)
p1$score(msr("clust.wss"), task = task) # computing wss from "outside" gives the same value

pscale = po("scale")
l2 = pscale %>>% lrn("clust.kmeans", centers = 2)
l2$train(task)
p2 = l2$predict(task)
l2$pipeops$clust.kmeans$state$model$tot.withinss 
p2$clust.kmeans.output$score(msr("clust.wss"), task) # computing wss from "outside" is not the same
# you have to do this to fix it and obtain the same wss value as the one that can be extracted from the model
p2$clust.kmeans.output$score(msr("clust.wss"), task = pscale$train(list(task))$output) 

Obviously, the "fix" in the last line where we pass the scaled task does not work if you benchmark multiple learners.

from mlr3cluster.

pfistfl avatar pfistfl commented on June 7, 2024

I am happy that we disagree here since this gives us the possibility to flesh things out.
I might be wrong, I am the person with the least experience in clustering after all, but I am still not convinced.

To reduce confusion I am trying to re-state the discussion quickly. Given a graph such as:

<<HERE>> po("scale") %>>% ... %>>% <<THERE>> po(lrn("clust.kmeans"))

The open question is at which point we want to compute cluster measures, <<HERE>> (favoured by me) or <<THERE>>
favoured by you. <<HERE>> could be followed up by a fixed set of pre-processing steps such as scale, that is independent of the other pipeline.

@henrifnk stated

If one would calculate a metric on the unprocessed data this metric could easily become pathological:
unscaled data might cause very high scores that, in the end, might only depend on one very high scaled feature

This is exactly my problem. We would like to ensure that any data that is passed to the measure has the same scale.
I'll try to give examples:

  1. Suppose we tune a pre-processing pipeline along with a clustering algorithm that looks something like
po("scale") %>>% po(flt("anova")) %>>% po(lrn("clust.kmeans"))

and measure using the preprocessed task.
This could have a pathological optimum: Drop all features but one that can be easily clustered.
This yields an objectively stupid clustering algorithm (that only takes into account one feature but disregards the rest of the data) that is good with respect to the desired clustering metric.

  1. Assume we have only one pre-processing operator that divides each feature's value by a number a: po("col_divide", a).
    This will yield a better cluster learner for larger a and will be optimal as a -> Inf. This happens without any practical improvements to the underlying clustering model but instead due to a pathology in the measuring process.

My general argument is the following:

By allowing transformations for the measure, we allow the pipeline to change the goal post (the values measured by our clustering metric) . And if an agent (our pipeline) can move it's own goal post (e.g. through tuning), it will often not become better but instead, just move the goal towards something that is easier to solve (by simply ignoring conflicting information). The analogy is the cleaning robot that learned to put a bucket on its head so it does not see any dirt. Can not see any dirt -> problem solved!

With respect to @henrifnk 's other comments:

Non-Imputed data might cause NA in the metric
-> Agree! Same holds for e.g. categorical features. But instead of using the pipeline here, we might want to have a FIXED preproc pipeline!

What about new (test) data, they would have to be scaled by the original tasks scaling parameters in this case?
Agree that we need to find a solution here, but this is orthogonal!

More than that, if one decides that the high-dimensional data used for an arbitrary task should be shrunken towards a smaller dataset, e.g. by PCA, we should leave this option to the user?
Agree, so you should be able to do PCA but we should not measure quality with respect to PCA transformed features.

since some metrics might by independent from the scale of the features...
In this case simply no scaling!

@giuseppec I get your problem but in your case, we look at the target variable which is mostly unchanged throughout the Pipeline. I think my suggestion is not optimal BUT it avoids falling into the traps mentioned above.

What we instead should have:

Each metric should know IF it is sensible to scaling / can deal with NA's etc. And it should then treat it's input accordingly (i.e. by re-scaling).

from mlr3cluster.

henrifnk avatar henrifnk commented on June 7, 2024

Finally, I think a really understand you point, thank you for carrying that out :).
I'll try to wrap those two options up.

Option 1: Have a stable pipeline for cluster measures:

Independent from the pipeline of a given cluster learner, there will always be the same mechanism that preprocesses the task data that determine the scoring of a certain cluster measure.
This pipeline could look somewhat like:

po('imputemean') %>>% po('scale') %>>% po(lrn("clust.[lrn_id]"))

The pipe operators within that pipeline must be somewhat smart to the task and their measure, such, that they can decide whether it is really necessary to call on them.
E.g., scaling would not need to be performed on a task with data within the same range,
one hot encoding, only if there are non numeric, non binary features ...
@pfistfl please, correct me if I am wrong or misunderstood something, here.

Option 2: Mirror the preprocessed task from pipeline-learner

The measure is calculated by the same task, as the learner was calculated on, by default.
If the data get manipulated like in the following pipeline, the measure is calculated on exactly these manipulated data, no matter how the user specified the manipulation.

po('imputemean') %>>% po('pca') %>>% po(lrn("clust.[lrn_id]"))

Additionally, an optional task argument in the measure enables the user to specify any preprocessed task for the measure to calculate the score on.
This could then, e.g., be a task that is specified like this:

po('imputemean') %>>% po('scale') %>>% po(lrn("clust.[lrn_id]"))

Addition: This might be supplemented by a warning if measures are calculated on tasks whre features have a different scale or similar issues...

Let me briefly point out 2 scenarios where your approach would be problematic:

Scenario 1:

lrn(kmeans.clust)$train(tsk('usarrest'))

The user is training a scale sensitive learner on a task with differently scaled features.

Option 1: Measures from prediction would be magically scaled now and the user wouldn't notice his faulty design...
Consitantly, mlr3cluster, would have to force the user now, also to scale the learner in this scenario, right?
This would not make any sense to me, as we don't force scaling in regression or classification context where learners might be scale sensitive too...

**Option 2: ** Results would be biased be the features with higher scale, but (!) also clusters made by the learner are biased from that problem...

Scenario 2

The user reads in very raw data that are no even in shape for the learner to use them (e.g. images etc...). She/He wants to use mlr3 now.

Option 1: Not working. The user could do predictions but couldn't calculate measures, as the pre-given pipeline is not able to shape the raw data. Imagine him seeing this error, that the measure is not able to handle the data. This will probably be contraintuitive and confusing...
To me, it was always one of the key features of mlr3 pipelines, that it enables you to do the whole workflow within mlr3, smoothly...

**Option2:" No problmes...

To be honest, to me, the second option is still way more attractive as is gives the user any freedom to calculate the measure on any data that might make sense in a certain situation!
It is more flexible, as the user can specify any detail of the task, he wants to have.
And it is more transparent, such, that the user knows how she/he specified the pipeline that lead to the measure in the output. If we dictate the pipeline, no one will ever know how it is really calculated...
I see your point, that measures could be wrong, if the pipeline was set in a wrong way.
But this pipeline seems to me like a arbitrary conglomerate of different preprocessing steps that might make sense.
Should we e.g. standardize or normalize, should we impute mean or median... This desicions will all be arbitrary...
In the end, this really sounds like an assessment between the freedom and the capabilities a certain user has with mlr3 package and the security of a stable measure.
But, you have this problem of wrong usage in any ML context, when it comes to people that specify things not correctly...

from mlr3cluster.

damirpolat avatar damirpolat commented on June 7, 2024

I thought about it again recently. My opinion: measures should be calculated on the same data on which clustering was done. I can see @pfistfl's argument about moving the goalpost but at the same time I think users should be the ones that are responsible for ensuring that their pipeline makes sense for their task. Also, I would image this could become a problem if there was an automated way for tuning pipelines that takes into the account preprocessing ops. But does mlr3 do that now? We could deal with that later when it comes up.
Any other final thoughts: @giuseppec or @mllg?

from mlr3cluster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.