Code Monkey home page Code Monkey logo

Comments (5)

vruusmann avatar vruusmann commented on August 11, 2024

Related to #2

Current version of SparkML encoder supports only scalar column as inputs, while the majority of current implementations are using vectors to describe inputs.

This comes from the "ideological difference" between PMML and most ML frameworks. With PMML, you have "model schema", which specifies field-level data type, operational type (eg. continuous vs categorical vs. ordinal), data preparation and validation logic etc. With ML frameworks such as Scikit-Learn and Apache Spark ML you don't have any of that - all input is assumed to be continuous floating-point data (which can be stacked and sliced arbitrarily).

The PMML approach is by far superior. Especially, if you need to "look inside" the model in order to analyze/interpret it, or store it for extended periods of time.

VectorAssembler which creates a mapping metadata from vector to individual columns from which it was produced.

Simply make this "scalar column-to-vector element" mapping part of your main pipeline (at the moment you're doing in a separate helper pipeline?), and it will be translated to the PMML representation.

This metadata can be used for mapping vector back to the individual columns. I am enclosing a sample code of such implementation for your consideration.

The JPMML-SparkML library was developed following Apache Spark ML 1.5.X/1.6.X metadata approach. At that time, the VectorAssembler metadata only gave you input column indices, but not their names/data types/operational types.

Maybe things have improved in Apache Spark ML 2.0.X/2.1.X.

SparkML currently supports multiple labels, while exporter limits this to 2. Any plans on extending this?

Multi-class logistic regression models must be Apache Spark ML 2.1.X thing? Would be trivial to implement.

I don't want to upgrade the "base version" of Apache Spark ML from 2.0.X to 2.1.X before I've caught up with some extra 2.0.X model and transformation types such as IsotonicRegressionModel, CountVectorizerModel, IDFModel (see #6) etc.

from jpmml-sparkml.

blublinsky avatar blublinsky commented on August 11, 2024

Do you have any dates in mind for upgrading the project to better support 2.1?

from jpmml-sparkml.

vruusmann avatar vruusmann commented on August 11, 2024

Can't promise anything time-wise as there is no schedule/roadmap.

If the support for multi-class LogisticRegressionModel objects is critical, then it would be possible to initiate a 1.2-SNAPSHOT development branch (that would be targeting Apache Spark ML 2.1.X), and get it done there.

from jpmml-sparkml.

vruusmann avatar vruusmann commented on August 11, 2024

I've just released JPMML-SparkML version 1.2.0, which is Apache Spark 2.1.X compatible (all re-generated integration tests pass cleanly), and supports multinomial LogisticRegression models.

No support for vector columns, though. It's a matter of principle - it's impossible to generate a meaningful PMML document if there's no field-level information (name, data type, op type) available.

from jpmml-sparkml.

blublinsky avatar blublinsky commented on August 11, 2024

The majority of ML methods are using Vector features - http://spark.apache.org/docs/latest/ml-guide.html, so the way I see it, not supporting vectors is probably a dead end.
This said, I understand why you are against it.
If I may suggest - I see 2 options
If a vectorAssembler http://spark.apache.org/docs/latest/ml-features.html#vectorassembler or something similar is used for vector creation, then a vector contains a metadata describing columns it was build from, which allows you to translate vector to a list of fields.
If there is metadata information, then you can introduce synthetic fields - v1, vn, which can be used in PMML
Will this work?

from jpmml-sparkml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.