Code Monkey home page Code Monkey logo

Comments (6)

vruusmann avatar vruusmann commented on August 11, 2024 1

Began to doubt myself, so I went and checked the list of SQL functions here:
https://spark.apache.org/docs/latest/api/sql/index.html

Looks like replace and regexp_replace are two different things:
https://spark.apache.org/docs/latest/api/sql/index.html#replace
https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace

The PMML built-in function replace is functionally equivalent to Apache Spark ML's regexp_replace SQL function.

The replace SQL function is currently unsupported.

The workaround is obvious - use the regexp_replace SQL function, and specify its regexp and rep arguments as literal strings (ie. should not contain any regexp meta-characters and stuff).

from jpmml-sparkml.

vruusmann avatar vruusmann commented on August 11, 2024 1

The CountVectorizer does not allow strings to contain punctuation anymore. Which is sad because the words in my use case contain dots.

@PowerToThePeople111 Could you please generate a reproducible test case, and open a new issue around this topic?

The PMML approach would be to tokenize using RegexTokenizer and then count using CountVectorizer. I wonder, if the RegexTokenizer is generating a "punctuated token", then is CounVectorizer really rejecting it? When did this regression happen (eg. some JIRA issue ref)?

from jpmml-sparkml.

vruusmann avatar vruusmann commented on August 11, 2024

The CountVectorizer does not allow strings to contain punctuation anymore.

What Apache Spark ML version are you talking about? If it's 3.3.X, then please append your complaint to #129

I realised that the replace function is not supported yet in SQLTransformers.

The "replace" SQL function is fully cupported. See jpmml/pyspark2pmml#40

from jpmml-sparkml.

PowerToThePeople111 avatar PowerToThePeople111 commented on August 11, 2024

I am currently using Apache Spark 3.2.1. And I am using scala. I am unsure if that is of importance, but since you mentioned pyspark2pmml I thought I should tell you.

And I got the message that replace is not supported when trying to export the pipeline. I would have to rerun the job if i want to reproduce the exact error message, but if that would help you, i can try to do it until end of next week latest.

For now i just replaced all non-alphanumeric characters in my words before training the pipeline with a constant string that will not turn up and did the same in the restserver. It seems to work.

from jpmml-sparkml.

vruusmann avatar vruusmann commented on August 11, 2024

The workaround is obvious - use the regexp_replace SQL function, and specify its regexp and rep arguments as literal strings (ie. should not contain any regexp meta-characters and stuff).

OK, reopening this issue, because the JPMML-SparkML library could/should be able to do this replace -> regexp_replace substitution automatically.

from jpmml-sparkml.

PowerToThePeople111 avatar PowerToThePeople111 commented on August 11, 2024

Thank you for having a look into this! I will create some short example to reproduce this. But I am very busy atm so it might take until end of next week.

from jpmml-sparkml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.