Code Monkey home page Code Monkey logo

jpmml-sparkml's Introduction

JPMML-SparkML Build Status

Java library and command-line application for converting Apache Spark ML pipelines to PMML.

Table of Contents

Features

Overview

  • Functionality:
    • Thorough collection, analysis and encoding of feature information:
      • Names.
      • Data and operational types.
      • Valid, invalid and missing value spaces.
    • Pipeline extensions:
      • Pruning.
      • Model verification.
    • Conversion options.
  • Extensibility:
    • Rich Java APIs for developing custom converters.
    • Automatic discovery and registration of custom converters based on META-INF/sparkml2pmml.properties resource files.
    • Direct interfacing with other JPMML conversion libraries such as JPMML-LightGBM and JPMML-XGBoost.
  • Production quality:
    • Complete test coverage.
    • Fully compliant with the JPMML-Evaluator library.

Supported libraries

Apache Spark ML

Examples: main.py

JPMML-SparkML
  • Feature transformers:
    • org.jpmml.sparkml.feature.InvalidCategoryTransformer
    • org.jpmml.sparkml.feature.SparseToDenseTransformer
LightGBM

Examples: LightGBMAuditNA.scala, LightGBMAutoNA.scaka, etc.

XGBoost

Examples: XGBoostAuditNA.scala, XGBoostAutoNA.scala, etc.

Prerequisites

  • Apache Spark 3.0.X, 3.1.X, 3.2.X, 3.3.X, 3.4.X or 3.5.X.

Installation

Library

JPMML-SparkML library JAR file (together with accompanying Java source and Javadocs JAR files) is released via Maven Central Repository.

The current version is 2.5.1 (20 June, 2024).

<dependency>
	<groupId>org.jpmml</groupId>
	<artifactId>pmml-sparkml</artifactId>
	<version>2.5.1</version>
</dependency>

Compatibility matrix

Active development branches:

Apache Spark version JPMML-SparkML branch
3.0.X 2.0.X
3.1.X 2.1.X
3.2.X 2.2.X
3.3.X 2.3.X
3.4.X 2.4.X
3.5.X master

Archived development branches:

Apache Spark version JPMML-SparkML branch
1.5.X and 1.6.X 1.0.X
2.0.X 1.1.X
2.1.X 1.2.X
2.2.X 1.3.X
2.3.X 1.4.X
2.4.X 1.5.X
3.0.X 1.6.X
3.1.X 1.7.X
3.2.X 1.8.X

Example application

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces two JAR files:

  • pmml-sparkml/target/pmml-sparkml-2.5-SNAPSHOT.jar - Library JAR file.
  • pmml-sparkml-exampletarget/pmml-sparkml-example-executable-2.5-SNAPSHOT.jar - Example application JAR file.

Usage

Library

Fitting a Spark ML pipeline:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.feature.RFormula

val irisData = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("Iris.csv")
val irisSchema = irisData.schema

val rFormula = new RFormula().setFormula("Species ~ .")
val dtClassifier = new DecisionTreeClassifier().setLabelCol(rFormula.getLabelCol).setFeaturesCol(rFormula.getFeaturesCol)
val pipeline = new Pipeline().setStages(Array(rFormula, dtClassifier))

val pipelineModel = pipeline.fit(irisData)

Converting the fitted Spark ML pipeline to an in-memory PMML class model object:

import org.jpmml.sparkml.PMMLBuilder

val pmml = new PMMLBuilder(irisSchema, pipelineModel).build()

The representation of individual Spark ML pipeline stages can be customized via conversion options:

import org.jpmml.sparkml.PMMLBuilder
import org.jpmml.sparkml.model.HasTreeOptions

val dtClassifierModel = pipelineModel.stages(1)

val pmml = new PMMLBuilder(irisSchema, pipelineModel).putOption(dtClassifierModel, HasTreeOptions.OPTION_COMPACT, false).putOption(dtClassifierModel, HasTreeOptions.OPTION_ESTIMATE_FEATURE_IMPORTANCES, true).build()

Viewing the in-memory PMML class model object:

import javax.xml.transform.stream.StreamResult
import org.jpmml.model.JAXBUtil

JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

Example application

The example application JAR file contains an executable class org.jpmml.sparkml.example.Main, which can be used to convert a pair of serialized org.apache.spark.sql.types.StructType and org.apache.spark.ml.PipelineModel objects to PMML.

The example application JAR file does not include Apache Spark runtime libraries. Therefore, this executable class must be executed using Apache Spark's spark-submit helper script.

For example, converting a pair of Spark ML schema and pipeline serialization files pmml-sparkml/src/test/resources/schema/Iris.json and pmml-sparkml/src/test/resources/pipeline/DecisionTreeIris.zip, respectively, to a PMML file DecisionTreeIris.pmml:

spark-submit --master local --class org.jpmml.sparkml.example.Main pmml-sparkml-example/target/pmml-sparkml-example-executable-2.5-SNAPSHOT.jar --schema-input pmml-sparkml/src/test/resources/schema/Iris.json --pipeline-input pmml-sparkml/src/test/resources/pipeline/DecisionTreeIris.zip --pmml-output DecisionTreeIris.pmml

Getting help:

spark-submit --master local --class org.jpmml.sparkml.example.Main pmml-sparkml-example/target/pmml-sparkml-example-executable-2.5-SNAPSHOT.jar --help

Documentation

License

JPMML-SparkML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SparkML in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SparkML available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SparkML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact [email protected]

jpmml-sparkml's People

Contributors

vruusmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jpmml-sparkml's Issues

Error: org.apache.spark.ml.feature.VectorAssembler is not supported

I ran programs in spark-local sucessfully. But when I ran codes in spark-yarn online, the following error message occurred (I have shaded org.jpmml to org.shaded.jpmml):

java.lang.IllegalArgumentException: Transformer class org.apache.spark.ml.feature.VectorAssembler is not supported
	at org.shaded.jpmml.sparkml.ConverterFactory.newConverter(ConverterFactory.java:53)
	at org.shaded.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:109)
	at com.iqiyi.columbus.joint_prevent.model.pmml.PMMLModelLocal$.saveModelAndEvaluate(PMMLModelLocal.scala:93)
	at com.iqiyi.columbus.joint_prevent.model.pmml.PMMLModelLocal$.main(PMMLModelLocal.scala:134)
	at com.iqiyi.columbus.joint_prevent.model.pmml.PMMLModelLocal.main(PMMLModelLocal.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)

Here is the stages of PipelineModel and the codes of buliding pmml:
val pipeline = new Pipeline().setStages(Array(vectorAssem, labelIndexer, featureIndexer, gbt, labelConverter))

val pipelineModel = trainPipelineModel(data, trainingData)
val pmml = new PMMLBuilder(schema, pipelineModel).build()

I used Spark 2.2.0, jpmml-sparkml 1.3.8, pmml-model 1.4.3(or 1.4.2, both failed).

Add support for multinomial `LogisticRegression` models

Current version of SparkML encoder supports only scalar column as inputs, while the majority of current implementations are using vectors to describe inputs. In a lot of cases those vectors are created using VectorAssembler which creates a mapping metadata from vector to individual columns from which it was produced. This metadata can be used for mapping vector back to the individual columns. I am enclosing a sample code of such implementation for your consideration.

The other issue that I have encountered is usage of the converter for Logistic regression. SparkML currently supports multiple labels, while exporter limits this to 2. Any plans on extending this?

SparkMLEncoder.java.zip

Custom transformers for configuring continuous and categorical feature information

As requested in jpmml/jpmml-evaluator#56

The JPMML-SkLearn project defines two custom transformation types sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain, which provide the ability to configure missing value, invalid value etc. treatments. For example:

mapper = DataFrameMapper([
  ("Sepal.Length", ContinuousDomain(missing_value_treatment = "as_is", invalid_value_treatment = "as_is"))
])

The JPMML-SparkML should provide identical functionality.

Custom Estimator to add to JPMML-SPARKML

Hi

I do have a custom estimator that merges rare categorical values into one value as 'RARE' so that I can group all the rare labels as together. I would like to know if it is possible and how can I add my custom modelconverter as you did for spark standard ml-features.

Ti give an example my custom estimator handles rare columns for categorical columns. So, if there are 1000 categories and only 30 of them are used in most of the time the rest 970 columns will be marked as RARE. So in my model I only save the rare labels. If you need I can paste the code itself as well.

Even if I manage it, I am not sure if jpmml-evaluater will be able to make it run.

jpmml-sparkml with python 3

how do I get mvn -Ppyspark clean package to use python 3.5 instead of python 2.7?

thank you for your help

Support transformed labels

Running Spark 2.1.2, using jpmml-sparkml 1.2.7.

While attempting to run the following pyspark in order to convert a simple pipeline with a RandomForestClassifer model with either toPMMLByteArray or toPMML, I'm receiving the a NullPointerException.

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import *

def updateFlightsSchema(dataSet):
    return ( dataSet.withColumn("DepDelay_Double",  dataSet["DepDelay"].cast("Double"))
                    .withColumn("DepDelay",         dataSet["DepDelay"].cast("Double"))
                    .withColumn("ArrDelay",         dataSet["ArrDelay"].cast("Double"))
                    .withColumn("Month",            dataSet["Month"].cast("Double"))
                    .withColumn("DayofMonth",       dataSet["DayofMonth"].cast("Double"))
                    .withColumn("CRSDepTime",       dataSet["CRSDepTime"].cast("Double"))
                    .withColumn("Distance",         dataSet["Distance"].cast("Double"))
                    .withColumn("AirTime",          dataSet["AirTime"].cast("Double"))
            )
    
data2007 = updateFlightsSchema(sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("mode", "DROPMALFORMED").load("2007_short.csv"))

removeCancelled = SQLTransformer(statement="select * from __THIS__ where Cancelled = \"0\" AND Diverted = \"0\"")
data2007 = removeCancelled.transform(data2007)

binarizer = Binarizer(threshold=15.0, inputCol="DepDelay_Double", outputCol="DepDelay_Bin")
featuresAssembler = VectorAssembler(inputCols=["Month", "CRSDepTime", "Distance"], outputCol="features")
rfc3 = RandomForestClassifier(labelCol="DepDelay_Bin", featuresCol="features", numTrees=3, maxDepth=5, seed=10305)

pipelineRF3 = Pipeline(stages=[binarizer, featuresAssembler, rfc3])

model3 = pipelineRF3.fit(data2007)

from py4j.java_gateway import JavaClass
from pyspark.ml.common import _py2java

javaDF = _py2java(sc, data2007)
javaSchema = javaDF.schema.__call__()

jvm = sc._gateway.jvm

javaConverter = sc._gateway.jvm.org.jpmml.sparkml.ConverterUtil
if(not isinstance(javaConverter, JavaClass)):
    raise RuntimeError("JPMML-SparkML not found on classpath")

pmml = jvm.org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(javaSchema, model3._to_java())
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.jpmml.sparkml.ConverterUtil.toPMMLByteArray.
: java.lang.NullPointerException
	at org.jpmml.converter.CategoricalLabel.<init>(CategoricalLabel.java:35)
	at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:82)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:162)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:86)
	at org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(ConverterUtil.java:142)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:748)

Following #22 I attempted to use the different Indexers on features and label columns to try and hint that these are categorical, but this resulted in the same error. Further, when I print the final tree, I do not see categorical feature declarations.

Dataset used, and tree output attached.
2007_short.zip
rfc.txt

ConverterUtil can not transform models that trained by sparse data

As we discussed in emails last week, I hope that this project can transform model to pmml in the second way as showed in the following code.
Thanks.

Hello,

Sometimes, my training data would be sparse in libsvm format, and the Data
Frame is suitable to be formatted as follow rather than using RFormula in mllib.

root
|-- features: vector (nullable = false)
|-- label: double (nullable = false)

This kind of "data layout" contains very little feature information.
Sure, it could be converted to PMML, but in that case the "feature"
column would be expanded into n double columns "x1", "x2", .., "x_n".

You could open an feature request in JPMML-SparkML issue tracker
(https://github.com/jpmml/jpmml-sparkml/issues), and I would take care
of it then. Also, please include a reproducible sample code.

VR

  def testPMML(sc: SparkContext) = {
    val rdd = sc.makeRDD(Seq((1.0, 2.0, 3.0, 0.0), (0.0, 2.0, 0.0, 3.0) , (1.0, 0.0, 0.0, 2.0)))
      .map(a => Row(a._1, Vectors.dense(Array(a._2, a._3, a._4)).toSparse))
    val schema = StructType(List(StructField("label", DoubleType), StructField("features", new VectorUDT)))
    val sqlContext = new SQLContext(sc)
    val irisData = sqlContext.createDataFrame(rdd, schema)

    val classifier = new LogisticRegression()
      .setLabelCol("label")
      .setFeaturesCol("features")

    // the first way
    val pipeline = new Pipeline()
      .setStages(Array(classifier))
    val pipelineModel = pipeline.fit(irisData)
    var pmml = ConverterUtil.toPMML(schema, pipelineModel)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

    // the second way
    val lrModel = classifier.fit(irisData)
    pmml = ConverterUtil.toPMML(schema, lrModel)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))
  }

Spark pipeline was not converted properly

I am writing a scala project using spark pipeline with GBDT model and stringIndexer to convert NominalColumns. And I convert the pipeline model to PMML. But when i using PMML model to predict there is a input data mismatch. I think that is because stringIndexer is not converted to PMML. But JPMML support spark stringIndex operation. I don't know why. Exception messages as follow.

Caused by: org.jpmml.evaluator.InvalidResultException (at or around line 23)
	at org.jpmml.evaluator.FieldValueUtil.performInvalidValueTreatment(FieldValueUtil.java:178)
	at org.jpmml.evaluator.FieldValueUtil.prepareInputValue(FieldValueUtil.java:90)
	at org.jpmml.evaluator.InputField.prepare(InputField.java:64)

java.lang.ClassNotFoundException: org.jpmml.sparkml.feature.NGramConverter

Hello,
I'm trying to use pmml export on a spark ml model and I am getting ajava.lang.ClassNotFoundExceptionerror when calling ConverterUtil.toPMML

I dealt with those conflicts by refering to readme.md and employing Maven Shade Plugin.

Here is my pom.xml file:

    <dependency>
  		<groupId>org.jpmml</groupId>
  		<artifactId>jpmml-sparkml</artifactId>
  		<version>1.3.3</version>
  		<scope>compile</scope>
  </dependency>

  <build>
  	<resources>
  		<resource>
  			<directory>src/main/resources</directory>
  			<excludes>
  				<exclude>**/*.xml</exclude>
  			</excludes>
  			<filtering>true</filtering>
  		</resource>
  	</resources>
  	<plugins>
  		<plugin>
  			<groupId>net.alchim31.maven</groupId>
  			<artifactId>scala-maven-plugin</artifactId>
  			<version>3.2.1</version>
  			<executions>
  				<execution>
  					<id>compile</id>
  					<goals>
  						<goal>compile</goal>
  					</goals>
  					<phase>process-resources</phase>
  				</execution>
  			</executions>
  			<configuration>
  				<scalaVersion>${scala.version}</scalaVersion>
  			</configuration>
  		</plugin>
  		<plugin>
  			<groupId>org.apache.maven.plugins</groupId>
  			<artifactId>maven-shade-plugin</artifactId>
  			<version>${maven.shade.version}</version>
  			<executions>
  				<execution>
  					<phase>package</phase>
  					<goals>
  						<goal>shade</goal>
  					</goals>
  					<configuration>
  						<relocations>
  							<relocation>
  								<pattern>org.dmg.pmml</pattern>
  								<shadedPattern>org.shaded.dmg.pmml</shadedPattern>
  							</relocation>
  							<relocation>
  								<pattern>org.jpmml</pattern>
  								<shadedPattern>org.shaded.jpmml</shadedPattern>
  							</relocation>
  						</relocations>
  					</configuration>
  				</execution>
  			</executions>
  		</plugin>
  	</plugins>
  </build>

Here is the error stack trace๏ผš

18/03/06 21:40:01 WARN sparkml.ConverterUtil: Failed to load transformer converter class
java.lang.ClassNotFoundException: org.jpmml.sparkml.feature.NGramConverter
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:351)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:318)
	at org.shaded.jpmml.sparkml.ConverterUtil.<clinit>(ConverterUtil.java:369)
	at com.nubia.train.Ad_ctr_train_PMML$.main(Ad_ctr_train_PMML.scala:157)
	at com.nubia.train.Ad_ctr_train_PMML.main(Ad_ctr_train_PMML.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/03/06 21:40:01 WARN sparkml.ConverterUtil: Failed to load transformer class
java.lang.ClassNotFoundException: org.apache.spark.ml.feature.MaxAbsScalerModel
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:341)
	at org.shaded.jpmml.sparkml.ConverterUtil.init(ConverterUtil.java:318)
	at org.shaded.jpmml.sparkml.ConverterUtil.<clinit>(ConverterUtil.java:369)
	at com.nubia.train.Ad_ctr_train_PMML$.main(Ad_ctr_train_PMML.scala:157)
	at com.nubia.train.Ad_ctr_train_PMML.main(Ad_ctr_train_PMML.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
......

Is it because Maven Shade Plugin๏ผŸ
Thanks in advance :)

Support for `boolean` target fields

A binary classification model with boolean target field cannot be converted:

val formula = new org.apache.spark.ml.feature.RFormula().setFormula("ResultConverted ~ distancevaluefromcentrallocation + availabilitynext3days")

Schema:

root
|-- ResultConverted: boolean (nullable = true)
|-- distancevaluefromcentrallocation: double (nullable = true)
|-- availabilitynext3days: long (nullable = true)

The exception is:

Exception in thread "main" java.lang.IllegalArgumentException: Expected 2 target categories, got 0 target categories
      at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:134)
      at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:161)

Classloading pb

I'm facing an unsolvable pb with HDP 2.4.

The spark assembly in HDP 2.4 had an incompatible version of jpmml that is not shaded.

For this line ConverterUtil.toPMML(schema, pipe), we have these error:

java.lang.NoSuchMethodError: org.dmg.pmml.DataField.setOpType(Lorg/dmg/pmml/OpType;)Lorg/dmg/pmml/DataField;
        at org.jpmml.sparkml.FeatureMapper.toContinuous(FeatureMapper.java:185)
        at org.jpmml.sparkml.FeatureMapper.createSchema(FeatureMapper.java:135)
        at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:123)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:43)
        at $iwC$$iwC$$iwC.<init>(<console>:45)
        at $iwC$$iwC.<init>(<console>:47)
        at $iwC.<init>(<console>:49)
        at <init>(<console>:51)
        at .<init>(<console>:55)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What would be involved in supporting Tokenizer, IDF, and HashingTF features?

Hi, first thanks for this excellent project (and also pmml-evaluator)! I have a Spark ML pipeline which uses Tokenizer, HashingTF, and IDF in order to feed a column containing text to a multiclass classifier which predicts a category. How feasible / hard would it be to support such a pipeline in jpmml-sparkml? I was thinking about taking a shot at it. Should Tokenizer get converted to an org.dmg.pmml.DocumentTermMatrix, or something else? And what about HashingTF and IDF? What pmml objects should those be converted to? Thanks in advance

java.lang.NoSuchMethodError on org.dmg.pmml.MiningField.setUsageType

Dear Sir or Madam,
When I try to export my model, I encounter following error.
I am using the jpmml-sparkml 1.1.6, and spark 2.0.2

scala> val pmml = ConverterUtil.toPMML(df.schema, model)
java.lang.NoSuchMethodError: org.dmg.pmml.MiningField.setUsageType(Lorg/dmg/pmml/MiningField$UsageType;)Lorg/dmg/pmml/MiningField;
  at org.jpmml.converter.ModelUtil.createMiningField(ModelUtil.java:73)
  at org.jpmml.converter.ModelUtil.createMiningSchema(ModelUtil.java:57)
  at org.jpmml.converter.ModelUtil.createMiningSchema(ModelUtil.java:46)
  at org.jpmml.sparkml.model.RandomForestClassificationModelConverter.encodeModel(RandomForestClassificationModelConverter.java:45)
  at org.jpmml.sparkml.model.RandomForestClassificationModelConverter.encodeModel(RandomForestClassificationModelConverter.java:33)
  at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:131)
  ... 52 elided

Thanks in advance for helping.

Unsupported vector type on datasource that provides it

Hello,

We are using Spark with a custom datasource that directly gives a label, vector(features) dataframe which saves using a VectorAssembler in the pipeline.
While this works just fine to train ML models, we can't export them to PMML using jpmml-sparkml because we receive this error
java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

Looking around on various sites, I see that it comes from the fact that jpmml-sparkml does not know how to handle our dataframe. What metadata are we missing so that our models can be exported to PMML?

As a workaround, we can have "split" data and use a VectorAssembler but it uses some computation time that we feel is a bit wasted.

Warning about JPMML-SparkML and Apache Spark ML version incompatibility

The JPMML-SparkML project contains three active development branches (1.1.X, 1.2.X and 1.3.X), which target specific Apache Spark ML versions (2.0.X, 2.1.X and 2.2.X, respectively).

Depending on the complexity of pipeline, the following scenarios may take place when there's a version mismatch between the two:

  1. The conversion fails (eg. by throwing some sort of exception).
  2. The conversion succeeds, but the resulting PMML document is incorrect in a sense that is contains "outdated" prediction logic, so that (J)PMML and Apache Spark ML predictions are different.
  3. The conversion succeeds, and the resulting PMML document is correct.

The JPMML-SparkML library should contain special logic to rule out the first two scenarios. It should detect the version of the Apace Spark ML environment, and refuse to execute if it's not the correct one (eg. by throwing an exception that states "This version of JPMML-SparkML is compatible with Apache Spark ML version 2.X, but the current execution environment is Apache Spark ML 2.Y").

val pmml_model = new PMMLBuilder(schema,pipeline_model).build() => error skip

When I train the sample, I use StringIndexer to execute the following statement:
Val pmml_model = new PMMLBuilder(schema,pipeline_model).build()

The error is:

Exception in thread "main" java.lang.IllegalArgumentException: skip
	at org.jpmml.sparkml.feature.StringIndexerModelConverter.encodeFeatures(StringIndexerModelConverter.java:65)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
	at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114)
	at com.nubia.train.Ad_ctr_train$.main(Ad_ctr_train.scala:182)
	at com.nubia.train.Ad_ctr_train.main(Ad_ctr_train.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Label field is required and Features field cannot be Vector for Random Forest Regression

I have a generated RandomForestRegressionModel model which was created somewhat similar to that done here: https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#random-forest-regression . The main difference with mine is that the features vector is created using a VectorAssembler and only the generated RandomForestRegressionModel is in the pipeline that i'm trying to export to PMML.

PipelineModel model = pipeline.fit(trainingData);

TrainValidationSplitModel tvsm = (TrainValidationSplitModel) model.stages()[0];
RandomForestRegressionModel rfrm = (RandomForestRegressionModel) tvsm.bestModel();

List<Transformer> stages = new ArrayList<>();
stages.add(rfrm);

final PipelineModel pipelineModel = new PipelineModel(
    UUID.randomUUID().toString(),
    stages);

StructType schema = testData.schema();

PMML pmml = ConverterUtil.toPMML(schema, pipelineModel);

When trying to export the model to PMML, I get the following exception stating that the label field doesn't exist.

Exception in thread "main" java.lang.IllegalArgumentException: Field "label" does not exist.
	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
	at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
	at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
	at scala.collection.AbstractMap.getOrElse(Map.scala:59)
	at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:139)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:73)
	at org.jpmml.sparkml.SparkMLEncoder.getOnlyFeature(SparkMLEncoder.java:60)
	at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:66)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:161)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:76)
	at org.loesoft.playground.datascience.mllib.Foo.main(Foo.java:144)

Although I know its probably wrong to do this but, I can get around this by adding the the "label" field to the schema however, then I get the following error further down when trying to parse the "features" field with the following exception:

Exception in thread "main" java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:160)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:73)
	at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:140)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:161)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:76)
	at org.loesoft.playground.datascience.mllib.Foo.main(Foo.java:146)

So with this, two things stand out as issues to me. The first is the requirement of the "label" field. I do not believe this is used in the execution of the model so I'm not sure why it is required. The other is the requirement that the "features" field is expected to be a string, integral, double, or boolean when the RandomForestRegressionModel requires it to be a vector.

I'm using version 1.2.2 of jpmml-sparkml and 2.1.1 of spark-mllib_2.11

Bear in mind that I'm fairly new to Spark ML and JPMML so if I am incorrect on this matter, then I would appreciate some education as to where I am wrong.

Support for Decimal Types?

Not sure if this is a noob question, but I'm wondering why there seems to be no support for DecimalType inputs to models?

When my featuresDF includes these types, I get the following error:

IllegalArgumentExceptionTraceback (most recent call last)
<ipython-input-46-d8430332ffa6> in <module>()
      1 from jpmml_sparkml import toPMMLBytes
----> 2 pmmlBytes = toPMMLBytes(spark, DF, pipelineModel)
      3 print(pmmlBytes)

/home/hadoop/pyenv/eggs/jpmml_sparkml-1.1rc0-py2.7.egg/jpmml_sparkml/__init__.pyc in toPMMLBytes(sc, df, pipelineModel)
     17         if(not isinstance(javaConverter, JavaClass)):
     18                 raise RuntimeError("JPMML-SparkML not found on classpath")
---> 19         return javaConverter.toPMMLByteArray(javaSchema, javaPipelineModel)

/usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: u'Expected string, integral, double or boolean type, got decimal(18,0) type'

I was eventually able to address this in pyspark with the following pre-model hack:

DF = ...
# convert all decimals to double
for f in DF.schema.fields:
    d = json.loads(f.json())
    if 'decimal' in d["type"]:
        DF = DF.withColumn(d['name'], DF[d["name"]].cast("double"))

However, i'm curious why DecimalType, which is effectively synonymous with DoubleType, is not natively supported?

Load PMML to spark

Great, I succeeded in my demo.
Also, I want to load PMML to spark, have you considered this?

Float not supported by SparkMLEncoder

The code appears to accept only fields of data type String, Integer, Double, or Boolean.

My use case includes float columns and generates the exception below:

Caused by: java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got float type
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:303)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:232)
	at org.jpmml.sparkml.feature.VectorAssemblerConverter.encodeFeatures(VectorAssemblerConverter.java:43)
	at org.jpmml.sparkml.SparkMLEncoder.append(SparkMLEncoder.java:74)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:123)

Support for 32-bit float type?

Hi @vruusmann ,
I have java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got float type when the train data's schema is float type
I think we should add floatType to include more general circumstances, maybe could change code here?

switch(dataType){
case STRING:
feature = new WildcardFeature(this, dataField);
break;
case INTEGER:
case DOUBLE:
feature = new ContinuousFeature(this, dataField);
break;
case BOOLEAN:
feature = new BooleanFeature(this, dataField);
break;
default:
throw new IllegalArgumentException("Data type " + dataType + " is not supported");
}

Thank you!
Bests,
Yuanda

training model using spark and predict same data using jpmm-evaluator , but got low accuracy.

@vruusmann , When training model using spark and predict same data using jpmm-evaluator , but got low accuracy. What's wrong with my codes?

  1. spark train gbdt model and save as pmml format and training acc is 0.8516456322692403:
(training, test) = data.toDF("label","degree","tcNum","pageRank","commVertexNum","normQ","gtRate","eqRate","ltRate").randomSplit(Array(1.0 - fracTest, fracTest), 1234)
// Set up Pipeline
    val stages = new mutable.ArrayBuffer[PipelineStage]()
    // (1) For classification, re-index classes.
    val labelColName = if (algo == "classification") "indexedLabel" else "label"
    if (algo == "classification") {
      val labelIndexer = new StringIndexer()
        .setInputCol("label")
        .setOutputCol(labelColName)
      stages += labelIndexer
    }

    val vectorAssember = new VectorAssembler()
    vectorAssember.setInputCols(Array("degree","tcNum","pageRank","commVertexNum","normQ","gtRate","eqRate","ltRate"))
    vectorAssember.setOutputCol("features")
    val vectorData = vectorAssember.transform(training)

//    val vectorData = vectorAssember.transform(training)

    stages += vectorAssember
    // (3) Learn GBT.
    val dt = algo match {
      case "classification" =>
        new GBTClassifier()
          .setLabelCol(labelColName)
          .setFeaturesCol("features")
          .setMaxDepth(params.maxDepth)
          .setMaxBins(params.maxBins)
          .setMinInstancesPerNode(params.minInstancesPerNode)
          .setMinInfoGain(params.minInfoGain)
          .setCacheNodeIds(params.cacheNodeIds)
          .setCheckpointInterval(params.checkpointInterval)
          .setMaxIter(params.maxIter)
      case "regression" =>
        new GBTRegressor()
          .setFeaturesCol("features")
          .setLabelCol(labelColName)
          .setMaxDepth(params.maxDepth)
          .setMaxBins(params.maxBins)
          .setMinInstancesPerNode(params.minInstancesPerNode)
          .setMinInfoGain(params.minInfoGain)
          .setCacheNodeIds(params.cacheNodeIds)
          .setCheckpointInterval(params.checkpointInterval)
          .setMaxIter(params.maxIter)
      case _ => throw new IllegalArgumentException("Algo ${params.algo} not supported.")
    }
    stages += dt
    val pipeline = new Pipeline().setStages(stages.toArray)

    // Fit the Pipeline.
    val startTime = System.nanoTime()
    val pipelineModel = pipeline.fit(training)
    val elapsedTime = (System.nanoTime() - startTime) / 1e9
    println(s"Training time: $elapsedTime seconds")

    /**
      * write model pmml format to hdfs
      */
    val modelPmmlPath = "sjmei/pmmlmodel"
    val pmml = ConverterUtil.toPMML(training.schema, pipelineModel);
//    val conf = new Configuration();
//    HadoopFileUtil.deleteFile(modelPmmlPath)
//    val path = new Path(modelPmmlPath);
//    val fs = path.getFileSystem(conf);
//    val out = fs.create(path);
    MetroJAXBUtil.marshalPMML(pmml, new FileOutputStream(modelPmmlPath));

2.load pmml model and using jpmml-evaluator to predict data, but predict acc is only ๏ผš

acc count:4537
error count:5553
acc rate:0.44965312190287415
public class ScoreTest {

    public static void main(String[] args) throws Exception {
        PMML pmml = readPMML(new File("sjmei/pmmlmodel/rf.pmml"));
        ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
//        System.out.println(pmml.getModels().get(0));
//        Evaluator evaluator = modelEvaluatorFactory.newModelEvaluator(pmml);
        ModelEvaluator evaluator = new MiningModelEvaluator(pmml);

        List<InputField> inputFields = evaluator.getInputFields();

        InputStream is = new FileInputStream(new File("jrdm-dm/data/graph.result.final.vertices.wide.tbl/part-00000"));
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        String line;

        int diffDelta = 0;
        int sameDelta = 0;
        while((line = br.readLine()) != null) {
            String[] splits = line.split("\t",-1);

            String label = splits[14];

            Map<FieldName, FieldValue> arguments = readArgumentsFromLine(splits, inputFields);

            Map<FieldName, ?> results = evaluator.evaluate(arguments);
//            System.out.println(results);
            List<TargetField> targetFields = evaluator.getTargetFields();
            for(TargetField targetField : targetFields){
                FieldName targetFieldName = targetField.getName();
                Object targetFieldValue = results.get(targetFieldName);

                ProbabilityDistribution nodeMap = (ProbabilityDistribution)targetFieldValue;
                Object result = nodeMap.getResult();
                if(String.valueOf(transToDouble(label)).equalsIgnoreCase(result.toString())){
                    sameDelta +=1;
                }else{
                    diffDelta +=1;
                }
            }
        }

        System.out.println("acc count:"+sameDelta);
        System.out.println("error count:"+diffDelta);
        System.out.println("acc rate:"+(sameDelta*1.0d/(sameDelta+diffDelta)));

    }

    /**
     * ไปŽๆ–‡ไปถไธญ่ฏปๅ–pmmlๆจกๅž‹ๆ–‡ไปถ
     * @param file
     * @return
     * @throws Exception
     */
    public static PMML readPMML(File file) throws Exception {


        String pmmlString = new Scanner(file).useDelimiter("\\Z").next();
        InputStream is = new ByteArrayInputStream(pmmlString.getBytes());
        InputSource source = new InputSource(is);
        SAXSource transformedSource = ImportFilter.apply(source);

        return JAXBUtil.unmarshalPMML(transformedSource);
    }

    /**
     * ๆž„้€ ๆจกๅž‹่พ“ๅ…ฅ็‰นๅพๅญ—ๆฎต
     * @param splits
     * @param inputFields
     * @return
     */
    public static Map<FieldName, FieldValue> readArgumentsFromLine(String[] splits, List<InputField> inputFields) {

        List<Double> lists = new ArrayList<Double>();
        lists.add(Double.valueOf(splits[3]));
        lists.add(Double.valueOf(splits[4]));
        lists.add(Double.valueOf(splits[5]));
        lists.add(Double.valueOf(splits[7]));
        lists.add(Double.valueOf(splits[8]));
        lists.add(Double.valueOf(splits[9]));
        lists.add(Double.valueOf(splits[10]));
        lists.add(Double.valueOf(splits[11]));

        Map<FieldName, FieldValue> arguments = new LinkedHashMap<FieldName, FieldValue>();

        int i = 0;
        for(InputField inputField : inputFields){
            FieldName inputFieldName = inputField.getName();
            Object rawValue = lists.get(i);
            FieldValue inputFieldValue = inputField.prepare(rawValue);

            arguments.put(inputFieldName, inputFieldValue);
            i+=1;
        }

        return arguments;
    }

    public static Double transToDouble(String label) {
        try {
            return Double.valueOf(label);
        }catch (Exception e){
            return Double.valueOf(0);
        }
    }
}

java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

Hello,

I encountered some problems when using the JPMML model transformation. This is my data source:
val trainingDataFrame = spark.read.format("libsvm").load(libsvmDataPath).toDF("label", "features")
The schema of "trainingDataFrame" contains the VectorUDT type, so when I use ConverterUtil.toPMML (newSchema, loadedModel), it will prompt java.lang.IllegalArgumentException.
Here is the code:

  val training = spark.read.format("libsvm").load(libsvmDataPath).toDF("label", "features")

  val vi = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexed")
      .setMaxCategories(693)

   val pca = new PCA()
      .setInputCol("features")
      .setOutputCol("pcaFeatures")
      .setK(3)

   val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.3)
      .setElasticNetParam(0.8)
      .setProbabilityCol("myProbability")

    val pipeline = new Pipeline().setStages(Array(vi, pca, lr))

    val model = pipeline.fit(training)

    model.write.overwrite().save(modelSavePath)

    training.show(10)
    println("==========================")
    println("traing dataframe's schema is:  " + training.schema.mkString)
    println("==========================")
    val schema = training.schema
    val pmml = ConverterUtil.toPMML(schema, model)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

The full stack trace is:

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
+-----+--------------------+
only showing top 10 rows

==========================
traing dataframe's schema is: 	
StructField(label,DoubleType,true)StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)
==========================
Exception in thread "main" java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
	at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:160)
	at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:73)
	at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:56)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:75)
	at com.myhexin.oryx.batchlayer.TestPMML$.trainModel(TestPMML.scala:138)
	at com.myhexin.oryx.batchlayer.TestPMML$.main(TestPMML.scala:29)
	at com.myhexin.oryx.batchlayer.TestPMML.main(TestPMML.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What should I do to solve this VectorUDT unsupported problem?

mvn: Importing the multiarray numpy extension module failed

When I try to build a jpmml-sparkml package with pyspark profile:
mvn -Ppyspark clean package

I am getting an error:

Traceback (most recent call last):
  File "setup.py", line 3, in <module>
    from jpmml_sparkml import __license__, __version__
  File "/home/bluedata/jpmml-sparkml-package/target/egg-sources/jpmml_sparkml/__init__.py", line 4, in <module>
    from pyspark.ml.common import _py2java
  File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/__init__.py", line 22, in <module>
    from pyspark.ml.base import Estimator, Model, Transformer
  File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 21, in <module>
    from pyspark.ml.param import Params
  File "/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/pyspark/ml/param/__init__.py", line 26, in <module>
    import numpy as np
  File "/usr/lib64/python3.4/site-packages/numpy/__init__.py", line 142, in <module>
    from . import add_newdocs
  File "/usr/lib64/python3.4/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/usr/lib64/python3.4/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/usr/lib64/python3.4/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/usr/lib64/python3.4/site-packages/numpy/core/__init__.py", line 26, in <module>
    raise ImportError(msg)
ImportError:
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: **cannot import name multiarray**

[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
        at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
        at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
        at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:764)
        at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:711)
        at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:289)
        at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
        at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.jav
        at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:309)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:194)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:107)
        at org.apache.maven.cli.MavenCli.execute(MavenCli.java:993)
        at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:345)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:191)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
        at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
        at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

My PYTHONPATH variable:
/usr/lib64/python3.4/site-packages:/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip:/usr/lib/spark/spark-2.1.1-bin-hadoop2.7/python/:/opt/bluedata/vagent/vagent/python:/opt/bluedata/vagent/vagent/python

I uninstalled numpy (numpy-1.13.0) and installed again - no progress.

This error does not appear when I build without pyspark profile:
mvn clean package
However, no EGG file is created and when I try to run a code from Zeppelin:

from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, vectorized_CV_data, CV_model)
print(pmmlBytes.decode("UTF-8"))

I am getting:
Traceback (most recent call last): File "/tmp/zeppelin_pyspark-1803236182413842559.py", line 337, in <module> exec(code) File "<stdin>", line 1, in <module> ImportError: No module named 'jpmml_sparkml'

Any help on how to solve this issue would be appreciated.

Thanks, Michal

java.lang.NoClassDefFoundError: org/dmg/pmml/mining/MiningModel

Im testing jpmml-spark on the spark-shell. We are running it on top of Yarn, using Spark 2.0.1 and scala 2.11.

I built the jar for the package and start the session like :

$SPARK_HOME/bin/spark-shell --jars jpmml-sparkml-1.1-SNAPSHOT.jar --packages com.databricks:spark-avro_2.11:3.0.1 --master yarn --deploy-mode client

However, I get an error when exporting a pipeline toPMMLByteArray.

import org.jpmml.sparkml.ConverterUtil

.... all of  the code to create the pipeline

val sparkPipelinePMMLEstimator = new Pipeline().setStages( categoricalFeatureIndexers.union(categoricalFeatureOneHotEncoders.union(Seq(featureAssemblerLr, featureAssemblerRf) )) :+ randomForest)

val sparkPipelinePMML = sparkPipelinePMMLEstimator.fit(df)
val pmmlBytes = org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(df.schema, sparkPipelinePMML)

This fails with the following error:

scala> val pmmlBytes = org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(df.schema, sparkPipelinePMML)
java.lang.NoClassDefFoundError: org/dmg/pmml/mining/MiningModel
  ... 48 elided
Caused by: java.lang.ClassNotFoundException: org.dmg.pmml.mining.MiningModel
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 48 more

Which seems like there is a package org.dmg.pmml.mining missing. How can I fix this issue?

Can it support Imputer?

Imputer is supported in JPMML-SkLearn, and it will produce โ€˜missingValueReplacementโ€™ and 'missingValueTreatment' in PMML, Can jpmml-sparkml support it?

Add VectorToScalar transformer class

Use case: A classification model is returning a probability distribution. The data scientist wants to extract the probability of a specific class out of it, and apply further transformations to it ("decision engineering").

The probability distribution is returned as VectorUDT. It is possible to splice it into a one-element VectorUDT using ml.feature.VectorSlicer. However, most common transformer classes (eg. ml.feature.Bucketizer) refuse to accept vector as input.

The VectorToScalar pseudo-transformer class would simply unwrap a single-element vector to a scalar numeric value (ie. int, float or double). The data type of the output column can be manually overriden.

spark gbtmodel Segmentation, MiningField as feature?

hello, I pmml as fllows ,i do not know why โ€œlabelโ€ is usageType="target" in MiningSchema, but "labelโ€˜โ€™ is active in MiningModel/Segmentation[segment@id=1]?

<DataDictionary>
		<DataField name="label" optype="categorical" dataType="double">
			<Value value="0.0"/>
			<Value value="1.0"/>
		</DataField>
		<DataField name="feature_442" optype="continuous" dataType="double"/>
		<DataField name="feature_443" optype="continuous" dataType="double"/>
		<DataField name="feature_481" optype="continuous" dataType="double"/>
		<DataField name="feature_894" optype="continuous" dataType="double"/>
		<DataField name="feature_1862" optype="continuous" dataType="double"/>
	</DataDictionary>
	<MiningModel functionName="classification">
		<MiningSchema>
			<MiningField name="label" usageType="target"/>
			<MiningField name="feature_442"/>
			<MiningField name="feature_443"/>
			<MiningField name="feature_481"/>
			<MiningField name="feature_894"/>
			<MiningField name="feature_1862"/>
		</MiningSchema>
		<Segmentation multipleModelMethod="modelChain">
			<Segment id="1">
				<True/>
				<MiningModel functionName="regression">
					<MiningSchema>
						<MiningField name="feature_442"/>
						<MiningField name="feature_443"/>
						<MiningField name="feature_481"/>
						<MiningField name="feature_894"/>
						<MiningField name="feature_1862"/>
						<MiningField name="label"/>
					</MiningSchema>
					<Output>
						<OutputField name="gbtValue" optype="continuous" dataType="double" feature="predictedValue" isFinalResult="false"/>
						<OutputField name="binarizedGbtValue" optype="continuous" dataType="double" feature="transformedValue" isFinalResult="false">
							<Apply function="if">
								<Apply function="greaterThan">
									<FieldRef field="gbtValue"/>
									<Constant dataType="double">0</Constant>
								</Apply>
								<Constant dataType="double">-1</Constant>
								<Constant dataType="double">1</Constant>
							</Apply>
						</OutputField>
					</Output>
					<Segmentation multipleModelMethod="sum">
						<Segment id="1">
							<True/>
							<TreeModel functionName="regression" splitCharacteristic="binarySplit">
								<MiningSchema>
									<MiningField name="label"/>
								</MiningSchema>
								<Node score="-0.08980349484734046">
									<True/>
									<Node score="-1">
										<SimplePredicate field="label" operator="lessOrEqual" value="0"/>
									</Node>
									<Node score="1">
										<SimplePredicate field="label" operator="greaterThan" value="0"/>
									</Node>
								</Node>
							</TreeModel>
						</Segment>
						<Segment id="2">
							<True/>
							<TreeModel functionName="regression" splitCharacteristic="binarySplit">
								<MiningSchema>
									<MiningField name="feature_442"/>
									<MiningField name="feature_443"/>
									<MiningField name="feature_481"/>
									<MiningField name="feature_894"/>
									<MiningField name="feature_1862"/>
									<MiningField name="label"/>
								</MiningSchema>
								<Targets>
									<Target rescaleFactor="0.1"/>
								</Targets>
								<Node score="-0.04281935597440249">
									<True/>
									<Node score="-0.47681168808845653">
										<SimplePredicate field="label" operator="lessOrEqual" value="0"/>
										<Node score="-0.47681168808847174">
											<SimplePredicate field="feature_442" operator="lessOrEqual" value="-0.5888127277121523"/>
											<Node score="-0.4768116880884725">
												<SimplePredicate field="feature_894" operator="lessOrEqual" value="-0.6830283900955506"/>
											</Node>
											<Node score="-0.47681168808847285">
												<SimplePredicate field="feature_894" operator="greaterThan" value="-0.6830283900955506"/>
											</Node>
										</Node>
										<Node score="-0.47681168808847096">
											<SimplePredicate field="feature_442" operator="greaterThan" value="-0.5888127277121523"/>
											<Node score="-0.47681168808847013">
												<SimplePredicate field="feature_443" operator="lessOrEqual" value="-1.2352702594745397"/>
											</Node>
											<Node score="-0.4768116880884723">
												<SimplePredicate field="feature_443" operator="greaterThan" value="-1.2352702594745397"/>
											</Node>
										</Node>
									</Node>
									<Node score="0.47681168808845853">
										<SimplePredicate field="label" operator="greaterThan" value="0"/>
										<Node score="0.47681168808846963">
											<SimplePredicate field="feature_1862" operator="lessOrEqual" value="-1.38258310890975"/>
											<Node score="0.4768116880884702">
												<SimplePredicate field="feature_481" operator="lessOrEqual" value="-1.128558484240802"/>
											</Node>
											<Node score="0.4768116880884703">
												<SimplePredicate field="feature_481" operator="greaterThan" value="-1.128558484240802"/>
											</Node>
										</Node>
										<Node score="0.47681168808847163">
											<SimplePredicate field="feature_1862" operator="greaterThan" value="-1.38258310890975"/>
										</Node>
									</Node>
								</Node>
							</TreeModel>
						</Segment>
					</Segmentation>
				</MiningModel>
			</Segment>
...

Customizing the "missing value handling"-mode of models

Hi,

I am using "jpmml-sparkml, version 1.2.4" to generate pmml models in spark (using Scala) and saving that output to local file system, but can't figure out how to set the following properties

<xs:attribute name="missingValueStrategy" type="MISSING-VALUE-STRATEGY" default="none"/>
<xs:attribute name="missingValuePenalty" type="PROB-NUMBER" default="1.0"/>
<xs:attribute name="noTrueChildStrategy" type="NO-TRUE-CHILD-STRATEGY" default="returnNullPrediction"/>

I have searched online but haven't found any clues.

Really appreciate the help..

Thanks,
Raj

Maven Dependency For JPMML Libraries

I want to use jpmml libraries in my project through maven repositories. Currently,I am using it by installing jpmml jar in my system m2 directory,but want direct links from maven

Tests fail when installing

I have spark pipelines from spark 2.1
So, I checked out the tag 1.2.7 to build.

The build fails,
I don't think, it is related to the conflicts mentioned in the README.

These are the logs

mvn clean install

[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] Building JPMML-SparkML 1.2.7
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ jpmml-sparkml ---
[INFO] Deleting /home/delhivery/dev/jpmml-sparkml/target
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-maven) @ jpmml-sparkml ---
[INFO] 
[INFO] --- maven-enforcer-plugin:1.4.1:enforce (default) @ jpmml-sparkml ---
[INFO] 
[INFO] --- jacoco-maven-plugin:0.7.9:prepare-agent (pre-unit-test) @ jpmml-sparkml ---
[INFO] jacoco.agent set to -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ jpmml-sparkml ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ jpmml-sparkml ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 53 source files to /home/delhivery/dev/jpmml-sparkml/target/classes
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/WeightedTermFeature.java: Some input files use or override a deprecated API.
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/WeightedTermFeature.java: Recompile with -Xlint:deprecation for details.
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterUtil.java: /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterUtil.java uses unchecked or unsafe operations.
[INFO] /home/delhivery/dev/jpmml-sparkml/src/main/java/org/jpmml/sparkml/ConverterUtil.java: Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ jpmml-sparkml ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 62 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile) @ jpmml-sparkml ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 5 source files to /home/delhivery/dev/jpmml-sparkml/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ jpmml-sparkml ---
[INFO] Surefire report directory: /home/delhivery/dev/jpmml-sparkml/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Exception in thread "main" java.lang.reflect.InvocationTargetException
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:510)
	at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:522)
Caused by: java.lang.RuntimeException: Class java/util/UUID could not be instrumented.
	at org.jacoco.agent.rt.internal_8ff85ea.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:140)
	at org.jacoco.agent.rt.internal_8ff85ea.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:101)
FATAL ERROR in native method: processing of -javaagent failed
	at org.jacoco.agent.rt.internal_8ff85ea.PreMain.createRuntime(PreMain.java:55)
	at org.jacoco.agent.rt.internal_8ff85ea.PreMain.premain(PreMain.java:47)
	... 6 more
Caused by: java.lang.NoSuchFieldException: $jacocoAccess
	at java.base/java.lang.Class.getField(Class.java:1958)
	at org.jacoco.agent.rt.internal_8ff85ea.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138)
	... 9 more
Aborted (core dumped)

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8.266 s
[INFO] Finished at: 2018-05-23T16:26:03+05:30
[INFO] Final Memory: 79M/270M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project jpmml-sparkml: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project jpmml-sparkml: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:213)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:154)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:146)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:309)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:194)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:107)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:955)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:290)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:194)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:564)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:145)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:208)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:154)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:146)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:309)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:194)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:107)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:955)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:290)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:194)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:564)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:356)
Caused by: java.lang.RuntimeException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was /bin/sh -c cd /home/delhivery/dev/jpmml-sparkml && /usr/lib/jvm/java-11-openjdk-amd64/bin/java -javaagent:/home/delhivery/.m2/repository/org/jacoco/org.jacoco.agent/0.7.9/org.jacoco.agent-0.7.9-runtime.jar=destfile=/home/delhivery/dev/jpmml-sparkml/target/jacoco.exec -jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefirebooter236796039577445911.jar /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire9119811316382229884tmp /home/delhivery/dev/jpmml-sparkml/target/surefire/surefire_010927105446816610308tmp
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork (ForkStarter.java:590)
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork (ForkStarter.java:460)
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run (ForkStarter.java:229)
    at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run (ForkStarter.java:201)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider (AbstractSurefireMojo.java:1026)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked (AbstractSurefireMojo.java:862)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute (AbstractSurefireMojo.java:755)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:134)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:208)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:154)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:146)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:309)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:194)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:107)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:955)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:290)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:194)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:564)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:356)
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

Add "dummy" estimator classes

Hello:
When I use Converter like this

val oneHotPMML = ConverterUtil.toPMML(onehotSource.schema, oneHotModel)
,
I got a Error like this:

Exception in thread "main" java.lang.IllegalArgumentException: Expected a pipeline with one or more models, got a pipeline with zero models
	at com.netease.mail.yanxuan.rms.utils.ConverterUtil.toPMML(ConverterUtil.java:118)
	at com.netease.mail.yanxuan.rms.scala.nn.feature.FeatureModelExport$.main(FeatureModelExport.scala:29)
	at com.netease.mail.yanxuan.rms.scala.nn.feature.FeatureModelExport.main(FeatureModelExport.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

After debug, I got the reason.
There didn't have any ModelConverter in my model.
Is it necessary that must have a ModelConverter in my pipelinemodel?

java.lang.IllegalArgumentException when calling ConverterUtil.toPMML

Hello,

I'm trying to use pmml export on a very small data sample used with DecisionTreeRegressor and I am getting a java.lang.IllegalArgumentExceptionerror when calling ConverterUtil.toPMML
Here is the code:

val sourceData = session.read.format("myformat").
  load(DataFileURL)

val assembler = new VectorAssembler()
  .setInputCols(Array("X1", "X2", "X3"))
  .setOutputCol("features")

val dt = new DecisionTreeRegressor()
  .setLabelCol("Y")
  .setFeaturesCol("features")
  .setImpurity("variance")
  .setMaxDepth(30)
  .setMaxBins(32)

val pipeline = new Pipeline().setStages(Array(assembler, dt))

val model = pipeline.fit(sourceData)

val pmml = ConverterUtil.toPMML(sourceData.schema, model)

X1 and X2 have a NominalAttribute attribute while X3 has the NumericAttribute

If I print the DecisionTreeRegressionModel, I get this result:

DecisionTreeRegressionModel (uid=dtr_8e66c6f292fc) of depth 7 with 57 nodes
  If (feature 2 <= 20.0)
   If (feature 1 in {0.0,1.0})
    If (feature 1 in {1.0})
     If (feature 0 in {0.0})
      Predict: 309.38
     Else (feature 0 not in {0.0})
      Predict: 569.6666666666669
    Else (feature 1 not in {1.0})
     If (feature 0 in {0.0})
      Predict: 583.5585714285714
     Else (feature 0 not in {0.0})
      Predict: 591.8775
   Else (feature 1 not in {0.0,1.0})
    If (feature 0 in {1.0})
     Predict: 1882.7800000000002
    Else (feature 0 not in {1.0})
     Predict: 2435.3799999999997
  Else (feature 2 > 20.0)
   If (feature 1 in {0.0})
    If (feature 2 <= 22.0)
     If (feature 0 in {0.0})
      If (feature 2 <= 21.0)
       Predict: 160.80599999999998
      Else (feature 2 > 21.0)
       Predict: 418.02833333333336
     Else (feature 0 not in {0.0})
      If (feature 2 <= 21.0)
       Predict: 636.2533333333334
      Else (feature 2 > 21.0)
       Predict: 273.82000000000005
    Else (feature 2 > 22.0)
     If (feature 2 <= 24.0)
      If (feature 0 in {0.0})
       If (feature 2 <= 23.0)
        Predict: 196.11
       Else (feature 2 > 23.0)
        Predict: 214.44
      Else (feature 0 not in {0.0})
       Predict: 303.5300000000001
     Else (feature 2 > 24.0)
      Predict: 152.13000000000002
   Else (feature 1 not in {0.0})
    If (feature 2 <= 22.0)
     If (feature 1 in {2.0})
      If (feature 2 <= 21.0)
       If (feature 0 in {1.0})
        Predict: 238.91666666666666
       Else (feature 0 not in {1.0})
        Predict: 244.89999999999998
      Else (feature 2 > 21.0)
       Predict: 333.3599999999999
     Else (feature 1 not in {2.0})
      If (feature 2 <= 21.0)
       If (feature 0 in {0.0})
        Predict: 387.8825
       Else (feature 0 not in {0.0})
        Predict: 446.2525
      Else (feature 2 > 21.0)
       If (feature 0 in {1.0})
        Predict: 316.75
       Else (feature 0 not in {1.0})
        Predict: 402.85714285714283
    Else (feature 2 > 22.0)
     If (feature 2 <= 24.0)
      If (feature 0 in {0.0})
       If (feature 2 <= 23.0)
        Predict: 239.59000000000003
       Else (feature 2 > 23.0)
        If (feature 1 in {2.0})
         Predict: 541.51
        Else (feature 1 not in {2.0})
         Predict: 1087.8500000000001
      Else (feature 0 not in {0.0})
       If (feature 2 <= 23.0)
        If (feature 1 in {1.0})
         Predict: 842.3125000000001
        Else (feature 1 not in {1.0})
         Predict: 1059.3700000000003
       Else (feature 2 > 23.0)
        Predict: 384.6300000000001
     Else (feature 2 > 24.0)
      If (feature 1 in {1.0})
       Predict: 350.58000000000004
      Else (feature 1 not in {1.0})
       Predict: 477.96000000000004

What am I missing to be able to get a PMML model?

How to get features in encodeFeatures method of VectorIndexerModelConverter?

Suppose we obtained a VectorIndexerModel with a param inputCol = "features" which specifies the name of input column(its type is Vector). Now how to get features in encodeFeatures method of VectorIndexerModelConverter?

It seems that this project doesn't support Vector type.

Looking forward to your reply. Thanks!

How to convert spark rdd based gbdt model to pmml model?

Jpmml-sparkml can only convert Spark ML pipelines to PMML. But I trained a spark rdd based gbdt mllib model, how can i convert the mllib model to pmml model.

PMML model export - RDD-based API show that only KMeansModel, LinearRegressionModel, RidgeRegressionModel, LassoModel, SVMModel, Binary LogisticRegressionModel can be converted to pmml model. What about gbdt model, Is there no method to convert it to pmml model?

Can anyone help me?

UnsupportedOperationException when exporting StringIndexer with LogisticRegression

Hi,

I'm testing a very simple case just to evaluate the library and ran into an issue. Here's the code:

        // Load training data
        Dataset training = getTrainingData(jsc, sqlContext);
        StructType schema = training.schema();

        // Define the pipeline
        StringIndexer countryIndexer = new StringIndexer()
                .setInputCol("country")
                .setOutputCol("country_index");

        VectorAssembler assembler = new VectorAssembler()
                .setInputCols(new String[]{"country_index", "a", "b"})
                .setOutputCol("features");

        LogisticRegression lr = new LogisticRegression()
                .setMaxIter(10)
                .setRegParam(0.3)
                .setElasticNetParam(0.8);

        Pipeline pipeline = new Pipeline();
        pipeline.setStages(new PipelineStage[]{countryIndexer, assembler, lr});

        // Fit the model
        PipelineModel pipelineModel = pipeline.fit(training);

        // Predict
        Dataset testing = getTestingData(jsc, sqlContext);
        Dataset predictions = pipelineModel.transform(testing);
        predictions.show();

        // Export to PMML
        PMML pmml = ConverterUtil.toPMML(schema, pipelineModel);

Here's a piece of relevant output (predictions.show() and the exception):

+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+
|label|country|  a|   b|country_index|      features|       rawPrediction|         probability|prediction|
+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+
|  0.0|     FR|1.0|-0.2|          0.0|[0.0,1.0,-0.2]|[0.43756144584300...|[0.60767781895595...|       0.0|
|  1.0|     DE|0.9| 0.5|          1.0| [1.0,0.9,0.5]|[-0.7827870058785...|[0.31371953355157...|       1.0|
+-----+-------+---+----+-------------+--------------+--------------------+--------------------+----------+

Exception in thread "main" java.lang.UnsupportedOperationException
	at org.jpmml.converter.CategoricalFeature.toContinuousFeature(CategoricalFeature.java:63)
	at org.jpmml.converter.regression.RegressionModelUtil.createRegressionTable(RegressionModelUtil.java:232)
	at org.jpmml.converter.regression.RegressionModelUtil.createBinaryLogisticClassification(RegressionModelUtil.java:113)
	at org.jpmml.converter.regression.RegressionModelUtil.createBinaryLogisticClassification(RegressionModelUtil.java:87)
	at org.jpmml.sparkml.model.LogisticRegressionModelConverter.encodeModel(LogisticRegressionModelConverter.java:52)
	at org.jpmml.sparkml.model.LogisticRegressionModelConverter.encodeModel(LogisticRegressionModelConverter.java:39)
	at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:165)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:81)
	at com.vika.pmml.PmmlExample.run(PmmlExample.java:99)
	at com.vika.pmml.PmmlExample.main(PmmlExample.java:40)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

the training data:

    private static final StructType SCHEMA = new StructType(new StructField[]{
            createStructField("label", DoubleType, false),
            createStructField("country", StringType, false),
            createStructField("a", DoubleType, false),
            createStructField("b", DoubleType, false)
    });

    private Dataset getTrainingData(JavaSparkContext jsc, SQLContext sqlContext) {

        JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
                RowFactory.create(1.0, "DE", 1.1, 0.1),
                RowFactory.create(0.0, "FR", 1.0, -1.0),
                RowFactory.create(0.0, "FR", 1.3, 1.0),
                RowFactory.create(1.0, "DE", 1.2, -0.5)
        ));
        return sqlContext.createDataFrame(jrdd, SCHEMA);
    }

The exception is thrown when the country feature is handled in RegressionModelUtil.createRegressionTable().

Am I doing something wrong? Or it seems like using StringIndexer with LogisticRegression is not working right.

By the way, I also tried the same code with the library version 1.0.9 and Spark 1.6, it did get exported:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
    <Header>
        <Application name="JPMML-SparkML" version="1.0.9"/>
        <Timestamp>2017-07-14T16:20:50Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="country" optype="categorical" dataType="string">
            <Value value="FR"/>
            <Value value="DE"/>
        </DataField>
        <DataField name="a" optype="continuous" dataType="double"/>
        <DataField name="b" optype="continuous" dataType="double"/>
        <DataField name="label" optype="categorical" dataType="double">
            <Value value="0"/>
            <Value value="1"/>
        </DataField>
    </DataDictionary>
    <RegressionModel functionName="classification" normalizationMethod="softmax">
        <MiningSchema>
            <MiningField name="label" usageType="target"/>
            <MiningField name="country"/>
            <MiningField name="a"/>
            <MiningField name="b"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability_0" feature="probability" value="0"/>
            <OutputField name="probability_1" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="-0.4375614458430096" targetCategory="1">
            <NumericPredictor name="country" coefficient="1.2203484517215881"/>
            <NumericPredictor name="a" coefficient="0.0"/>
            <NumericPredictor name="b" coefficient="0.0"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="0"/>
    </RegressionModel>
</PMML>

however evaluating this PMML didn't work:

Exception in thread "main" org.jpmml.evaluator.TypeCheckException: Expected DOUBLE, but got STRING (FR)
	at org.jpmml.evaluator.TypeUtil.toDouble(TypeUtil.java:617)
	at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:424)
	at org.jpmml.evaluator.FieldValue.getValue(FieldValue.java:320)
	at org.jpmml.evaluator.FieldValue.asNumber(FieldValue.java:269)
	at org.jpmml.evaluator.RegressionModelEvaluator.evaluateRegressionTable(RegressionModelEvaluator.java:194)
	at org.jpmml.evaluator.RegressionModelEvaluator.evaluateClassification(RegressionModelEvaluator.java:146)
	at org.jpmml.evaluator.RegressionModelEvaluator.evaluate(RegressionModelEvaluator.java:70)
	at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:346)

Thank you very much beforehand!

Problem with underscore when using RegexTokenizer()

Hello,
here is an issue I'm facing when using RegexTokenizer:
When using RegexTokenizer in Spark pipeline, jpmml-sparkml allows two types of patterns:
"\s+" and "\W+".
When using "\W+" with gaps=True, it removes non alphanumerical characters, but also underscores ("_") are not removed.
However, in the case when underscores appear in the text, the function toPMMLBytes returns an error which is related to the underscore.
So it looks like underscores can not be removed, but also can't be left inside.

Thanks

The StringIndexerModelConverter stores labels instead of indices in .pmml file, right?

Sorry to bother you again.

The schema of training data is as follows:
column: a1; data type: double; role: feature
column: a2; data type: double; role: feature
column: a3; data type: double; role: label

And there are only two values(-1.0, 1.0) in the label column(a3).

In order to train a pipeline, I put StringIndexer, VectorIndexer and Decision Tree Classifier together.
new Pipeline() .setStages(Array(labelIndexer, vectorIndexer, classifier))

After fitted the pipeline, the model is tranformed to pmml.
ConverterUtil.toPMML(schema, model.model.asInstanceOf[PipelineModel])

What confuesed me is that StringIndexerModelConverter stores the labels("-1.0" and "1.0") in the pmml file instead of indices of labels("0" and "1"). Is it right? Then how jpmml-sparkml transforms the labels to indices? I just cannot find the related code. Sad...

<PMML xmlns="http://www.dmg.org/PMML-4_2" version="4.2">
    <Header>
        <Application/>
        <Timestamp>2017-02-20T03:17:32Z</Timestamp>
    </Header>
    <DataDictionary>
        <DataField name="a3" optype="categorical" dataType="double">
            <Value value="-1.0"/>
            <Value value="1.0"/>
        </DataField>
        <DataField name="a1" optype="continuous" dataType="double"/>
        <DataField name="a2" optype="continuous" dataType="double"/>
        <DataField name="prediction" optype="categorical" dataType="double"/>
    </DataDictionary>
    <TreeModel functionName="classification" splitCharacteristic="binarySplit">
        <MiningSchema>
            <MiningField name="a3" usageType="target"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability_-1.0" feature="probability" value="-1.0"/>
            <OutputField name="probability_1.0" feature="probability" value="1.0"/>
        </Output>
        <Node score="-1.0" recordCount="5300.0">
            <True/>
            <ScoreDistribution value="-1.0" recordCount="2924.0"/>
            <ScoreDistribution value="1.0" recordCount="2376.0"/>
        </Node>
    </TreeModel>
</PMML>

Another question is that does jpmml-spark supports loading a pipleline model(including transformers and a classifier or a regressor) from pmml file? I use jpmml-spark to load the above pipeline model from the pmml file, but it seems the StringIndexerModel doesn't work correctly.

Looking forward to your reply. Thanks a lot!

Add support for `FPGrowth` model type

Getting the following exception -

java.lang.IllegalArgumentException: Transformer class org.apache.spark.ml.fpm.FPGrowthModel is not supported

I am trying to convert an FPGrowth model into PMML. Is it not supported?

Add support for `TrainValidationSplitModel` transformation type

Most Spark ML tutorials include this (pseudo-)transformation type into sample workflows. From the PMML perspective this is a no-op transformation, which can be simply skipped.

Currently, users have to manually "re-package" their fitted pipeline models, which is prone to error. Example issue - repackaging a fitted pipeline model, and neglecting label and feature column definitions: #18 (comment)

Exception in thread "main" java.lang.IllegalArgumentException: skip

I am using spark2.1.1 and jpmml 1.2.12 in execution, reporting the following error:

Exception in thread "main" java.lang.IllegalArgumentException: skip
	at org.jpmml.sparkml.feature.StringIndexerModelConverter.encodeFeatures(StringIndexerModelConverter.java:65)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
	at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:114)
	at com.nubia.train.Ad_ctr_train$.main(Ad_ctr_train.scala:182)
	at com.nubia.train.Ad_ctr_train.main(Ad_ctr_train.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

setup with sbt rather than maven

Hello,

How should I include exclusions and shading within my build.sbt ?

My current build.sbt :

scalaVersion := "2.10.6"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.11" % "2.1.0" % "provided",
  "org.apache.spark" % "spark-sql_2.11" % "2.1.0",
  "org.apache.hbase" % "hbase-common" % "1.2.2",
  "org.apache.spark" % "spark-mllib_2.11" % "2.1.0",
  "org.jpmml" % "jpmml-sparkml" % "1.2.6"
)

excludeDependencies += "org.jpmml" % "pmml-model"

Thanks

Handling columns with null values

Exception in thread "main" java.lang.IllegalArgumentException: Field a1 has valid values [b, a]
	at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:189)
	at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:98)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:96)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:68)

I get the above exception when the column has null values. Any ideas on how to resolve this? Please comment if further details are needed.

Request matching mode and setMinTokenLength suports for RegexTokenizer

Hello Villu,
Thank you for this great package for exporting the spark ml models. But this package seems not easy to work with:

My input: a column named 'sentence'
My output: a column named 'prediction' produced by logistic classification for the column 'sentence'
My pipeline: RegexTokenizer -> NGram -> CountVectorizer -> IDF -> VectorAssembler -> LogisticRegression

Problem 1:
my RegexTokenizer code as below

tokenizer = feature.RegexTokenizer()
  .setGaps(False)\
  .setPattern("\\b[a-zA-Z]{3,}\\b")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

But it throws a error

IllegalArgumentException: 'Expected splitter mode, got token matching mode'

So, I'm think to implement the tokenizer by myself, pass a column of array of tokens as input, then I got:

Problem 2:

IllegalArgumentException: Expected string, integral, double or boolean type, got vector type

After tracking some issues, I understand that vector type is not support, so, I have to consider building the pipeline from tokenizer again. Then, I changed my tokenizer to splitter mode:

tokenizer = feature.RegexTokenizer()
  .setGaps(True)\
  .setPattern("\\s+")\
  .setInputCol("sentence")\
  .setOutputCol("words")\

Then I got:

Problem 3:

 java.lang.IllegalArgumentException: .
	at org.jpmml.sparkml.feature.CountVectorizerModelConverter.encodeFeatures(CountVectorizerModelConverter.java:118)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:80)

After checking with the source code at line 118, there is a requirement that the token cannot startwith and endwith a punctuation. But this happens a lot. For example, "This is a sentence.", after split by space, the last token endwith a period ( . ). In this case, if use catching mode, the pattern "\b[a-zA-Z]{3,}\b") can extract 'clean' tokens easily.
I have no choice, but continue hacking. Then, I try to split sentence by pattern \\b[^a-zA-Z]{0,}\\b which split the text by non English letter, then, filter the token by set the min token length at 3. This works fine in Spark, but when I export the pipeline, I got another error

Problem 4:_

java.lang.IllegalArgumentException: Expected 1 as minimum token length, got 3 as minimum token length
	at org.jpmml.sparkml.feature.RegexTokenizerConverter.encodeFeatures(RegexTokenizerConverter.java:51)

As what it reads, the getMinTokenLength is not supported in jpmml-sparkml.

I'm really frustrated, since this is a simply and typical task, I've tiried different means to overcome it, but all failed. Could you please point me a right direction, thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.