Comments (5)
Related to #2
Current version of SparkML encoder supports only scalar column as inputs, while the majority of current implementations are using vectors to describe inputs.
This comes from the "ideological difference" between PMML and most ML frameworks. With PMML, you have "model schema", which specifies field-level data type, operational type (eg. continuous vs categorical vs. ordinal), data preparation and validation logic etc. With ML frameworks such as Scikit-Learn and Apache Spark ML you don't have any of that - all input is assumed to be continuous floating-point data (which can be stacked and sliced arbitrarily).
The PMML approach is by far superior. Especially, if you need to "look inside" the model in order to analyze/interpret it, or store it for extended periods of time.
VectorAssembler which creates a mapping metadata from vector to individual columns from which it was produced.
Simply make this "scalar column-to-vector element" mapping part of your main pipeline (at the moment you're doing in a separate helper pipeline?), and it will be translated to the PMML representation.
This metadata can be used for mapping vector back to the individual columns. I am enclosing a sample code of such implementation for your consideration.
The JPMML-SparkML library was developed following Apache Spark ML 1.5.X/1.6.X metadata approach. At that time, the VectorAssembler metadata only gave you input column indices, but not their names/data types/operational types.
Maybe things have improved in Apache Spark ML 2.0.X/2.1.X.
SparkML currently supports multiple labels, while exporter limits this to 2. Any plans on extending this?
Multi-class logistic regression models must be Apache Spark ML 2.1.X thing? Would be trivial to implement.
I don't want to upgrade the "base version" of Apache Spark ML from 2.0.X to 2.1.X before I've caught up with some extra 2.0.X model and transformation types such as IsotonicRegressionModel
, CountVectorizerModel
, IDFModel
(see #6) etc.
from jpmml-sparkml.
Do you have any dates in mind for upgrading the project to better support 2.1?
from jpmml-sparkml.
Can't promise anything time-wise as there is no schedule/roadmap.
If the support for multi-class LogisticRegressionModel
objects is critical, then it would be possible to initiate a 1.2-SNAPSHOT
development branch (that would be targeting Apache Spark ML 2.1.X), and get it done there.
from jpmml-sparkml.
I've just released JPMML-SparkML version 1.2.0, which is Apache Spark 2.1.X compatible (all re-generated integration tests pass cleanly), and supports multinomial LogisticRegression
models.
No support for vector columns, though. It's a matter of principle - it's impossible to generate a meaningful PMML document if there's no field-level information (name, data type, op type) available.
from jpmml-sparkml.
The majority of ML methods are using Vector features - http://spark.apache.org/docs/latest/ml-guide.html, so the way I see it, not supporting vectors is probably a dead end.
This said, I understand why you are against it.
If I may suggest - I see 2 options
If a vectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler or something similar is used for vector creation, then a vector contains a metadata describing columns it was build from, which allows you to translate vector to a list of fields.
If there is metadata information, then you can introduce synthetic fields - v1, vn, which can be used in PMML
Will this work?
from jpmml-sparkml.
Related Issues (20)
- MultilayerPerceptronClassificationModel IllegalArgumentException("Expected 3 target categories, got 2 target categories"); HOT 1
- How to import the training data schema in libsvm format HOT 15
- Wrong code path for multinomial logistic regression model HOT 1
- Probability column not being found when using it in a stacked model HOT 6
- StringIndexerModelConverter gives java.lang.IllegalArgumentException HOT 4
- java.lang.ClassNotFoundException: org.jpmml.converter.BaseNFeature HOT 5
- Support for custom Java-backed models (eg. factorization machine) HOT 1
- Why One-Hot-Encoding is not visible in PMML? HOT 1
- py4j.protocol.Py4JError: org.jpmml.sparkml.PMMLBuilder does not exist in the JVM HOT 1
- Error with LightGBMClassificationModel HOT 5
- Support for `XGBoostRegressor.missing` property HOT 6
- Troubleshooting XGBoost model performance HOT 17
- Support for Apache Spark 3.3.X HOT 2
- 2.x jars missing from Maven Central HOT 3
- Support for `replace` SQL function HOT 6
- Exception in thread "main" java.lang.NoClassDefFoundError: com/microsoft/azure/synapse/ml/codegen/Wrappable
- java.lang.NoSuchMethodError: org.jpmml.sparkml.SparkMLEncoder.getDataField HOT 1
- Databricks Install HOT 1
- Version v4 is not supported HOT 2
- Cannot convert (partially-) unfitted pipelines HOT 22
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jpmml-sparkml.