mrpowers / spark-stringmetric Goto Github PK

Spark functions to run popular phonetic and string matching algorithms

License: MIT License

Scala 100.00%

cosine-distance spark fuzzy-score hamming-distance jaccard-similarity jaro-winkler double-metaphone nysiis refined-soundex

spark-stringmetric's Introduction

spark-stringmetric

String similarity functions and phonetic algorithms for Spark.

See ceja if you're using PySpark.

Project Setup

Update your build.sbt file to import the libraries.

libraryDependencies += "org.apache.commons" % "commons-text" % "1.1"

// Spark 3
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.4.0"

// Spark 2
libraryDependencies += "com.github.mrpowers" %% "spark-stringmetric" % "0.3.0"

You can find the spark-daria Scala 2.11 versions here and the Scala 2.12 versions here.

SimilarityFunctions

cosine_distance
fuzzy_score
hamming
jaccard_similarity
jaro_winkler

How to import the functions.

import com.github.mrpowers.spark.stringmetric.SimilarityFunctions._

Here's an example on how to use the jaccard_similarity function.

Suppose we have the following sourceDF:

+-------+-------+
|  word1|  word2|
+-------+-------+
|  night|  nacht|
|context|contact|
|   null|  nacht|
|   null|   null|
+-------+-------+

Let's run the jaccard_similarity function.

val actualDF = sourceDF.withColumn(
  "w1_w2_jaccard",
  jaccard_similarity(col("word1"), col("word2"))
)

We can run actualDF.show() to view the w1_w2_jaccard column that's been appended to the DataFrame.

+-------+-------+-------------+
|  word1|  word2|w1_w2_jaccard|
+-------+-------+-------------+
|  night|  nacht|         0.43|
|context|contact|         0.57|
|   null|  nacht|         null|
|   null|   null|         null|
+-------+-------+-------------+

PhoneticAlgorithms

double_metaphone
nysiis
refined_soundex

How to import the functions.

import com.github.mrpowers.spark.stringmetric.PhoneticAlgorithms._

Here's an example on how to use the refined_soundex function.

Suppose we have the following sourceDF:

+-----+
|word1|
+-----+
|night|
|  cat|
| null|
+-----+

Let's run the refined_soundex function.

val actualDF = sourceDF.withColumn(
  "word1_refined_soundex",
  refined_soundex(col("word1"))
)

We can run actualDF.show() to view the word1_refined_soundex column that's been appended to the DataFrame.

+-----+---------------------+
|word1|word1_refined_soundex|
+-----+---------------------+
|night|               N80406|
|  cat|                 C306|
| null|                 null|
+-----+---------------------+

API Documentation

Here is the latest API documentation.

Release

Create GitHub tag
Build documentation with sbt ghpagesPushSite
Publish JAR

Run sbt to open the SBT console.

Run > ; + publishSigned; sonatypeBundleRelease to create the JAR files and release them to Maven. These commands are made available by the sbt-sonatype plugin.

After running the release command, you'll be prompted to enter your GPG passphrase.

The Sonatype credentials should be stored in the ~/.sbt/sonatype_credentials file in this format:

realm=Sonatype Nexus Repository Manager
host=oss.sonatype.org
user=$USERNAME
password=$PASSWORD

Post Maven release steps

Create a GitHub release/tag
Publish the updated documentation

spark-stringmetric's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger dajor murari-goswami zjiaksmc yangchenghuang fencekeeper

spark-stringmetric's Issues

Facing issues while file is getting large.

Hey I am getting issues while computing the cosine_distance while file size getting larger than 1m records. It runs fine with 10k records.

.withColumn(
"cd_SrcTrg",
cosine_distance(col("Src"), col("Trg"))
)

20/08/09 22:29:38 INFO DAGScheduler: ShuffleMapStage 3 (csv at CosineDistance.scala:133) failed in 6.574 s due to Job aborted due to stage failure: Task 9 in stage 3.0 failed 1 times, most recent failure: Lost task 9.0 in stage 3.0 (TID 34, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string, string) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Invalid text
at org.apache.commons.lang3.Validate.isTrue(Validate.java:158)
at org.apache.commons.text.similarity.RegexTokenizer.tokenize(RegexTokenizer.java:43)
at org.apache.commons.text.similarity.RegexTokenizer.tokenize(RegexTokenizer.java:34)
at org.apache.commons.text.similarity.CosineDistance.apply(CosineDistance.java:49)
at com.github.mrpowers.spark.stringmetric.SimilarityFunctions$.cosineDistanceFun(SimilarityFunctions.scala:17)
at com.github.mrpowers.spark.stringmetric.SimilarityFunctions$$anonfun$1.apply(SimilarityFunctions.scala:11)
at com.github.mrpowers.spark.stringmetric.SimilarityFunctions$$anonfun$1.apply(SimilarityFunctions.scala:11)
... 15 more

Make the README more expansive

Include a discussion of each algorithm, how they work, when they're useful, etc.

Quantify how much faster the native Hamming function is compared to the old implementation

This PR replaces a UDF Hamming function with a Spark native implementation: #4

@nvander1 - Let's quantify how much faster the new implementation runs. Benchmarking Spark code is hard... it'll be fun to work on this.

Once we quantify the performance gains, let's collaborate on a blog post.

Cosine_distance and Fuzz_Score

Hi Matthew,

First of all, I'like to thank you for share with us your knowledge. I appreciate so much. I have learned a lot.

In relation to the functions cosine_distance and fuzz_score, I used in my dataframe and gave me very strange results.

For Cosine_distance, I should have results between -1 and 1 and I have -2.2240446049250313E-16. Please take a look:

For Fuzzy_Score for the same example, I had a score of 52, instead of result between 0 and 1.

Could you please check it out?

Thanks a lot,

Rodrigo

Add native implementations of algorithms

This library is currently just a wrapper around org.apache.commons.commons-text functions.

Native implementations would be a lot better... Who is ready for a hard Spark coding challenge?!

@afranzi @nvander1 - take a look at the README for this project. Want to get this on your radar 🤓

Using stringmetric in PySpark

I was able to install the spark-stringmetric:0.2.0 library to a Databaricks cluster, but all the import commands I have tried in my code throw exceptions. What is the library name I should be using in PySpark?

Not able to build in maven.

I have been using this great package, but for the past two weeks, I have been trying to build my package on my new machine, for some reason, I am getting maven transfer error.

here is the dependency, i am using

MrPowers spark-stringmetric 0.2.0

this is coming from
this artifact is located at SparkPackages repository (https://dl.bintray.com/spark-packages/maven/)

Looks like either file is corrupted or something is missing, please look into it.

This tool really helped me to have fuzzy logic .

Any help highly appreciated.

Raghu

Look into org.apache.commons.text.similarity algorithms

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/JaccardSimilarity.html

Setup Maven releases

@nvander1 - I already went through the process to get approval to release stuff to com.github.mrpowers: https://issues.sonatype.org/browse/OSSRH-40521

Where do I need to upload my public keys?