Add support for multinomial `LogisticRegression` models about jpmml-sparkml HOT 5 CLOSED

blublinsky commented on August 11, 2024

Add support for multinomial `LogisticRegression` models

from jpmml-sparkml.

Comments (5)

vruusmann commented on August 11, 2024

Related to #2

Current version of SparkML encoder supports only scalar column as inputs, while the majority of current implementations are using vectors to describe inputs.

This comes from the "ideological difference" between PMML and most ML frameworks. With PMML, you have "model schema", which specifies field-level data type, operational type (eg. continuous vs categorical vs. ordinal), data preparation and validation logic etc. With ML frameworks such as Scikit-Learn and Apache Spark ML you don't have any of that - all input is assumed to be continuous floating-point data (which can be stacked and sliced arbitrarily).

The PMML approach is by far superior. Especially, if you need to "look inside" the model in order to analyze/interpret it, or store it for extended periods of time.

VectorAssembler which creates a mapping metadata from vector to individual columns from which it was produced.

Simply make this "scalar column-to-vector element" mapping part of your main pipeline (at the moment you're doing in a separate helper pipeline?), and it will be translated to the PMML representation.

This metadata can be used for mapping vector back to the individual columns. I am enclosing a sample code of such implementation for your consideration.

The JPMML-SparkML library was developed following Apache Spark ML 1.5.X/1.6.X metadata approach. At that time, the VectorAssembler metadata only gave you input column indices, but not their names/data types/operational types.

Maybe things have improved in Apache Spark ML 2.0.X/2.1.X.

SparkML currently supports multiple labels, while exporter limits this to 2. Any plans on extending this?

Multi-class logistic regression models must be Apache Spark ML 2.1.X thing? Would be trivial to implement.

I don't want to upgrade the "base version" of Apache Spark ML from 2.0.X to 2.1.X before I've caught up with some extra 2.0.X model and transformation types such as IsotonicRegressionModel, CountVectorizerModel, IDFModel (see #6) etc.

from jpmml-sparkml.

blublinsky commented on August 11, 2024

Do you have any dates in mind for upgrading the project to better support 2.1?

from jpmml-sparkml.

vruusmann commented on August 11, 2024

Can't promise anything time-wise as there is no schedule/roadmap.

If the support for multi-class LogisticRegressionModel objects is critical, then it would be possible to initiate a 1.2-SNAPSHOT development branch (that would be targeting Apache Spark ML 2.1.X), and get it done there.

from jpmml-sparkml.

vruusmann commented on August 11, 2024

I've just released JPMML-SparkML version 1.2.0, which is Apache Spark 2.1.X compatible (all re-generated integration tests pass cleanly), and supports multinomial LogisticRegression models.

No support for vector columns, though. It's a matter of principle - it's impossible to generate a meaningful PMML document if there's no field-level information (name, data type, op type) available.

from jpmml-sparkml.

blublinsky commented on August 11, 2024

The majority of ML methods are using Vector features - http://spark.apache.org/docs/latest/ml-guide.html, so the way I see it, not supporting vectors is probably a dead end.
This said, I understand why you are against it.
If I may suggest - I see 2 options
If a vectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler or something similar is used for vector creation, then a vector contains a metadata describing columns it was build from, which allows you to translate vector to a list of fields.
If there is metadata information, then you can introduce synthetic fields - v1, vn, which can be used in PMML
Will this work?

from jpmml-sparkml.

Add support for multinomial `LogisticRegression` models about jpmml-sparkml HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent