Running Spark 2.1.2, using jpmml-sparkml 1.2.7. While attempting to

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Looks like it can be closed for current version: <div class="snippet-clipboard-con

Looks like it can be closed for current version <p dir=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Support transformed labels about jpmml-sparkml HOT 8 OPEN

jpmml commented on September 15, 2024

Support transformed labels

from jpmml-sparkml.

Comments (8)

borisborowsky commented on September 15, 2024 1

@vruusmann Sorry for the off-topic i will delete the question but now i run into another issue when i try to buildFile from the pmmlBuilder object it says format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o57101.buildFile.
: java.lang.IllegalArgumentException: Expected 3 target categories, got 2 target category, raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'Expected 3 target categories, got 2 target categories'. I cannot understand why do you have a clue ?

from jpmml-sparkml.

vruusmann commented on September 15, 2024

The JPMML-SparkML library assumes that the label column of classification models is a "native" categorical label (in PMML, corresponds to a DataDictionary/DataField element), not a "transformed" categorical label (corresponds to a TransformationDictionary/DerivedField element).

I've been taking it granted, and forgot to actually implement this "native" vs "transformed" check around ModelConverter.java:82.

It's possible to make your example work, by applying the Binarize transformation to the dataset outside of the pipeline, and then treating its output column "DepDelay_Bin" as a "native" categorical label:

binarizer = Binarizer(threshold=15.0, inputCol="DepDelay_Double", outputCol="DepDelay_Bin")
data2007 = binarizer.transform(data2007) # THIS!

stringIndexer = StringIndexer(inputCol="DepDelay_Bin", outputCol="DepDelay_Bin_Label") # THIS!
featuresAssembler = VectorAssembler(inputCols=["Month", "CRSDepTime", "Distance"], outputCol="features")
rfc3 = RandomForestClassifier(labelCol="DepDelay_Bin_Label", featuresCol="features", numTrees=3, maxDepth=5, seed=10305)

pipelineRF3 = Pipeline(stages=[stringIndexer, featuresAssembler, rfc3]) # THIS: start the pipeline with StringIndexer not Binarizer

model3 = pipelineRF3.fit(data2007)

from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, data2007, model3)
print(pmmlBytes.decode("UTF-8"))

from jpmml-sparkml.

vruusmann commented on September 15, 2024

Technically, it shouldn't be much work to make JPMML-SparkML work with "transformed" labels, so keeping this issue open to track progress towards this functionality.

from jpmml-sparkml.

alex-krash commented on September 15, 2024

Looks like it can be closed for current version:

            Binarizer binarizer = new Binarizer()
                    .setInputCol("Sepal_Length")
                    .setOutputCol("Sepal_Length_Binar_")
                    .setThreshold(5.0)
            ;

            StringIndexer labelIndexer = new StringIndexer()
                    .setInputCol("Species")
                    .setOutputCol("Species_Bin");

            VectorAssembler vectorAssembler = new VectorAssembler()
                    .setInputCols(new String[]{
                            "Sepal_Length_Binar_",
                            "Sepal_Width",
                            "Petal_Length",
                            "Petal_Width"})
                    .setOutputCol("features");

            RandomForestClassifier classifier = new RandomForestClassifier()
                    .setLabelCol("Species_Bin");

            Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{binarizer, labelIndexer, vectorAssembler, classifier});
            PipelineModel model = pipeline.fit(dataset);

            PMMLBuilder builder = new PMMLBuilder(schema, model);
            final PMML build = builder.build();
            JAXBUtil.marshalPMML(build, new StreamResult(System.out));

from jpmml-sparkml.

vruusmann commented on September 15, 2024

Looks like it can be closed for current version

Nope, I'd like to be able to use Sepal_Length_Binar_ as the label column here.

from jpmml-sparkml.

borisborowsky commented on September 15, 2024

Can someone help me with this error: AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' error. I get it when i try to execute the PMMLBuilder()

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1, 2, 6])

             .addGrid(dt.maxBins, [570, 570])

             .build())

stages += [dt]
pipeline = Pipeline(stages=stages)


cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)

cvModel = cv.fit(dataSet)
train_dataset = cvModel.transform(dataSet)

train_dataset.show()
print(evaluator.evaluate(train_dataset))

pmmlBuilder = PMMLBuilder(spark, dataSet, cvModel) \
    .putOption(dt, "compact", True)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

I cannot find any fix to this what I am doing wrong ?

from jpmml-sparkml.

vruusmann commented on September 15, 2024

AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' error

This is clearly a low-level PySpark error, which has got nothing to do with PySpark2PMML or JPMML-SparkML.

Maybe your PySpark and Apache Spark versions are out of sync.

from jpmml-sparkml.

borisborowsky commented on September 15, 2024

@vruusmann Thank you. My PySpark and Apache versions are up to date. The problem was you must pass the pipeline's bestmodel in my case cvModel.bestModel do the work.

from jpmml-sparkml.

Support transformed labels about jpmml-sparkml HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent