Code Monkey home page Code Monkey logo

Comments (8)

borisborowsky avatar borisborowsky commented on September 15, 2024 1

@vruusmann Sorry for the off-topic i will delete the question but now i run into another issue when i try to buildFile from the pmmlBuilder object it says format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o57101.buildFile.
: java.lang.IllegalArgumentException: Expected 3 target categories, got 2 target category, raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'Expected 3 target categories, got 2 target categories'. I cannot understand why do you have a clue ?

from jpmml-sparkml.

vruusmann avatar vruusmann commented on September 15, 2024

The JPMML-SparkML library assumes that the label column of classification models is a "native" categorical label (in PMML, corresponds to a DataDictionary/DataField element), not a "transformed" categorical label (corresponds to a TransformationDictionary/DerivedField element).

I've been taking it granted, and forgot to actually implement this "native" vs "transformed" check around ModelConverter.java:82.

It's possible to make your example work, by applying the Binarize transformation to the dataset outside of the pipeline, and then treating its output column "DepDelay_Bin" as a "native" categorical label:

binarizer = Binarizer(threshold=15.0, inputCol="DepDelay_Double", outputCol="DepDelay_Bin")
data2007 = binarizer.transform(data2007) # THIS!

stringIndexer = StringIndexer(inputCol="DepDelay_Bin", outputCol="DepDelay_Bin_Label") # THIS!
featuresAssembler = VectorAssembler(inputCols=["Month", "CRSDepTime", "Distance"], outputCol="features")
rfc3 = RandomForestClassifier(labelCol="DepDelay_Bin_Label", featuresCol="features", numTrees=3, maxDepth=5, seed=10305)

pipelineRF3 = Pipeline(stages=[stringIndexer, featuresAssembler, rfc3]) # THIS: start the pipeline with StringIndexer not Binarizer

model3 = pipelineRF3.fit(data2007)

from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, data2007, model3)
print(pmmlBytes.decode("UTF-8"))

from jpmml-sparkml.

vruusmann avatar vruusmann commented on September 15, 2024

Technically, it shouldn't be much work to make JPMML-SparkML work with "transformed" labels, so keeping this issue open to track progress towards this functionality.

from jpmml-sparkml.

alex-krash avatar alex-krash commented on September 15, 2024

Looks like it can be closed for current version:

            Binarizer binarizer = new Binarizer()
                    .setInputCol("Sepal_Length")
                    .setOutputCol("Sepal_Length_Binar_")
                    .setThreshold(5.0)
            ;

            StringIndexer labelIndexer = new StringIndexer()
                    .setInputCol("Species")
                    .setOutputCol("Species_Bin");

            VectorAssembler vectorAssembler = new VectorAssembler()
                    .setInputCols(new String[]{
                            "Sepal_Length_Binar_",
                            "Sepal_Width",
                            "Petal_Length",
                            "Petal_Width"})
                    .setOutputCol("features");

            RandomForestClassifier classifier = new RandomForestClassifier()
                    .setLabelCol("Species_Bin");

            Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{binarizer, labelIndexer, vectorAssembler, classifier});
            PipelineModel model = pipeline.fit(dataset);

            PMMLBuilder builder = new PMMLBuilder(schema, model);
            final PMML build = builder.build();
            JAXBUtil.marshalPMML(build, new StreamResult(System.out));

from jpmml-sparkml.

vruusmann avatar vruusmann commented on September 15, 2024

Looks like it can be closed for current version

Nope, I'd like to be able to use Sepal_Length_Binar_ as the label column here.

from jpmml-sparkml.

borisborowsky avatar borisborowsky commented on September 15, 2024

Can someone help me with this error: AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' error. I get it when i try to execute the PMMLBuilder()

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1, 2, 6])

             .addGrid(dt.maxBins, [570, 570])

             .build())

stages += [dt]
pipeline = Pipeline(stages=stages)


cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)

cvModel = cv.fit(dataSet)
train_dataset = cvModel.transform(dataSet)

train_dataset.show()
print(evaluator.evaluate(train_dataset))

pmmlBuilder = PMMLBuilder(spark, dataSet, cvModel) \
    .putOption(dt, "compact", True)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

I cannot find any fix to this what I am doing wrong ?

from jpmml-sparkml.

vruusmann avatar vruusmann commented on September 15, 2024

AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' error

This is clearly a low-level PySpark error, which has got nothing to do with PySpark2PMML or JPMML-SparkML.

Maybe your PySpark and Apache Spark versions are out of sync.

from jpmml-sparkml.

borisborowsky avatar borisborowsky commented on September 15, 2024

@vruusmann Thank you. My PySpark and Apache versions are up to date. The problem was you must pass the pipeline's bestmodel in my case cvModel.bestModel do the work.

from jpmml-sparkml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.