Code Monkey home page Code Monkey logo

spark-stringmetric's Introduction

spark-stringmetric

CI

String similarity functions and phonetic algorithms for Spark.

See ceja if you're using PySpark.

Project Setup

Update your build.sbt file to import the libraries.

libraryDependencies += "org.apache.commons" % "commons-text" % "1.1"

// Spark 3
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.4.0"

// Spark 2
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.3.0"

You can find the spark-daria Scala 2.11 versions here and the Scala 2.12 versions here.

SimilarityFunctions

  • cosine_distance
  • fuzzy_score
  • hamming
  • jaccard_similarity
  • jaro_winkler

How to import the functions.

import com.github.mrpowers.spark.stringmetric.SimilarityFunctions._

Here's an example on how to use the jaccard_similarity function.

Suppose we have the following sourceDF:

+-------+-------+
|  word1|  word2|
+-------+-------+
|  night|  nacht|
|context|contact|
|   null|  nacht|
|   null|   null|
+-------+-------+

Let's run the jaccard_similarity function.

val actualDF = sourceDF.withColumn(
  "w1_w2_jaccard",
  jaccard_similarity(col("word1"), col("word2"))
)

We can run actualDF.show() to view the w1_w2_jaccard column that's been appended to the DataFrame.

+-------+-------+-------------+
|  word1|  word2|w1_w2_jaccard|
+-------+-------+-------------+
|  night|  nacht|         0.43|
|context|contact|         0.57|
|   null|  nacht|         null|
|   null|   null|         null|
+-------+-------+-------------+

PhoneticAlgorithms

  • double_metaphone
  • nysiis
  • refined_soundex

How to import the functions.

import com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._

Here's an example on how to use the refined_soundex function.

Suppose we have the following sourceDF:

+-----+
|word1|
+-----+
|night|
|  cat|
| null|
+-----+

Let's run the refined_soundex function.

val actualDF = sourceDF.withColumn(
  "word1_refined_soundex",
  refined_soundex(col("word1"))
)

We can run actualDF.show() to view the word1_refined_soundex column that's been appended to the DataFrame.

+-----+---------------------+
|word1|word1_refined_soundex|
+-----+---------------------+
|night|               N80406|
|  cat|                 C306|
| null|                 null|
+-----+---------------------+

API Documentation

Here is the latest API documentation.

Release

  1. Create GitHub tag

  2. Build documentation with sbt ghpagesPushSite

  3. Publish JAR

Run sbt to open the SBT console.

Run > ; + publishSigned; sonatypeBundleRelease to create the JAR files and release them to Maven. These commands are made available by the sbt-sonatype plugin.

After running the release command, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the ~/.sbt/sonatype_credentials file in this format:

realm=Sonatype Nexus Repository Manager
host=oss.sonatype.org
user=$USERNAME
password=$PASSWORD

Post Maven release steps

  • Create a GitHub release/tag
  • Publish the updated documentation

spark-stringmetric's People

Contributors

mrpowers avatar nvander1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spark-stringmetric's Issues

Facing issues while file is getting large.

Hey I am getting issues while computing the cosine_distance while file size getting larger than 1m records. It runs fine with 10k records.


.withColumn(
"cd_SrcTrg",
cosine_distance(col("Src"), col("Trg"))
)

20/08/09 22:29:38 INFO DAGScheduler: ShuffleMapStage 3 (csv at CosineDistance.scala:133) failed in 6.574 s due to Job aborted due to stage failure: Task 9 in stage 3.0 failed 1 times, most recent failure: Lost task 9.0 in stage 3.0 (TID 34, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string, string) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Invalid text
at org.apache.commons.lang3.Validate.isTrue(Validate.java:158)
at org.apache.commons.text.similarity.RegexTokenizer.tokenize(RegexTokenizer.java:43)
at org.apache.commons.text.similarity.RegexTokenizer.tokenize(RegexTokenizer.java:34)
at org.apache.commons.text.similarity.CosineDistance.apply(CosineDistance.java:49)
at com.github.mrpowers.spark.stringmetric.SimilarityFunctions$.cosineDistanceFun(SimilarityFunctions.scala:17)
at com.github.mrpowers.spark.stringmetric.SimilarityFunctions$$anonfun$1.apply(SimilarityFunctions.scala:11)
at com.github.mrpowers.spark.stringmetric.SimilarityFunctions$$anonfun$1.apply(SimilarityFunctions.scala:11)
... 15 more

Cosine_distance and Fuzz_Score

Hi Matthew,

First of all, I'like to thank you for share with us your knowledge. I appreciate so much. I have learned a lot.

In relation to the functions cosine_distance and fuzz_score, I used in my dataframe and gave me very strange results.

For Cosine_distance, I should have results between -1 and 1 and I have -2.2240446049250313E-16. Please take a look:

full_name1 |full_name2 |cosine_distance |
+-------------------+-----------------------------+----------------------+
|John Stevenson Due |John Stevenson Due |-2.220446049250313E-16|

For Fuzzy_Score for the same example, I had a score of 52, instead of result between 0 and 1.

Fuzzy Score
+-------------------+-----------------------------+------------+
|full_name1 |full_name2 |fuzzy_score|
+-------------------+-----------------------------+-----------+
|John Stevenson Due |John Stevenson Due |52

Could you please check it out?

Thanks a lot,

Rodrigo

Add native implementations of algorithms

This library is currently just a wrapper around org.apache.commons.commons-text functions.

Native implementations would be a lot better... Who is ready for a hard Spark coding challenge?!

@afranzi @nvander1 - take a look at the README for this project. Want to get this on your radar ๐Ÿค“

Using stringmetric in PySpark

I was able to install the spark-stringmetric:0.2.0 library to a Databaricks cluster, but all the import commands I have tried in my code throw exceptions. What is the library name I should be using in PySpark?

Not able to build in maven.

I have been using this great package, but for the past two weeks, I have been trying to build my package on my new machine, for some reason, I am getting maven transfer error.

here is the dependency, i am using

MrPowers spark-stringmetric 0.2.0

this is coming from
this artifact is located at SparkPackages repository (https://dl.bintray.com/spark-packages/maven/)

Looks like either file is corrupted or something is missing, please look into it.

This tool really helped me to have fuzzy logic .

Any help highly appreciated.

Raghu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.