xdrop / fuzzywuzzy Goto Github PK

View Code? Open in Web Editor NEW

786.0 25.0 113.0 425 KB

Java fuzzy string matching implementation of the well known Python's fuzzywuzzy algorithm. Fuzzy search for Java

License: GNU General Public License v2.0

Java 87.60% Groovy 12.29% Shell 0.11%

python-levenshtein fuzzywuzzy string-distance fuzzy-matching fuzzy-search java

fuzzywuzzy's People

Contributors

Stargazers

Watchers

Forkers

alexey-y pengrad faisal-w simararora7 yilmazerhakan kohry deniswsrosa xiaofeifei3 tatyanakamysheva rock999 michaeltandecki saphirepankaj harshchiki kashenfelter anukat2015 goyal-shubhu fashtimedotcom mayankomar kosheik kstatz12 imranhamzah sharayushinde divlv kigero kallemdias louiznk horizon07 lieutenantroger iabdullo bastian kulikov0 waenhill amitbd1508 maximilianofelice gitter-badger 340022268 shehangamage kanthgithub zhouxiaocao hamid13 ktp-forked-repos nks067 cybernetics qilinggg hhy5277 ejasahamed baidulinux lukashaken shaosuone sehnoh myhansel ortega-dan xwolf-wang ishandeva mayhs19 lucasmarfe dinxx databill86 hsun1115 comroid-git spatelregcorp shaikhanas1993 wpeikai thejanpasindu cptwonton polbadwolf anoop-phoenix leoalvtor jagamypriera bneveux jochy yangkile ergules ricey130 muzimin0222 moayyaed kristovatlas tanzhongjingyue mohamedgawad akashk2512 burdoto strawhat925 arunbang2000 gunawanwu akloya zjjhym ascopes dotnetfei123 murodin nwabudo coderhoader slashwill mirego brandonquintanilla xwkh vjolamuthyasai borovikovd hebruwu arbiss1 skon7

fuzzywuzzy's Issues

Can fuzzywuzzy be used in this case?

I have the description of a YouTube video and I want to find if a specific word appears in the text, including typos. For example take the following description

If you tell me you're super busy, I'm going to ask to see your written plan.\n\nMy book "10 Steps to Earning Awesome Grades" is now out and it's free! Get it here:\n\nhttp://collegeinfogeek.com/get-better-grades/\n\nIf you want to get even more strategies and tips on becoming a more productive, successful student, subscribe to my channel right here:\n\nhttp://buff.ly/1vQP5ar\n\nConnect with me on Twitter!\n\nhttps://twitter.com/TomFrankly\n\nCompanion blog post with notes and resource links: \n\nhttp://collegeinfogeek.com/massive-workloads/

I would like to know if the word twitter is present in the description. I would then do

FuzzySearch.extractOne(videoDescription, Arrays.asList("Twitter"))
// (string: Twitter, score: 57, index: 0)

And if the text has typos the score decreases as expected.

Is this a good use for the library?

#extractOne Methods for Strings in FuzzySearch should return ExtractedResult<String>

Reference:

Allow to use any object as a choice

First of all, thank you for this great library.
However, there's a small issue I have with it: For one of my projects I'm implementing a search for JavaDoc methods and have a class JavadocMethod with methods like getMethodName(), getClassName() and getUrl().
For searching it would be very convenient to just use the object itself for search, so I can access the url of the found method.
I'm thinking about a generic solution like this:

public static <T> List<ExtractedResult<T>> extractTop(String query, Collection<T> choices, Function<T, String> mapper, int limit)

which allows to use any object by just providing a function which maps this object to a string.

Collection<JavadocMethod> methods = ...;
FuzzySearch.extractTop("String#valeuOf(loong)", methods, method -> String.format("%s#%s", method.getClassName(), method.getMethodName()), 5);

Can you imagine implementing such a feature or accept a pull requests that adds it?

module me.xdrop.fuzzywuzzy cannot be resolved to a module

when creating the module-info.java file, it generates fuzzywuzzy and says "name unstable". Correcting it to me.xdrop.fuzzywuzzy as the instructions say, says "module me.xdrop.fuzzywuzzy cannot be resolved to a module"

How to set the scorer like the python fuzzywuzzy?

In the python fuzzy-wuzzy, we can set the scorer we want to use in extracting the result. How we can do it here?

process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)
    ("/music/library/good/System of a Down/2005 - Hypnotize/10 - She's Like Heroin.mp3", 61)

Do we have any gitter, discord in order to ask such questions?

Bug in search

Difference between java and python implementation: Spoiler, the problem is the round

Here is the example where I was stuck. The python implemention gets 22, and the java implementation gets 23:

fuzz.token_set_ratio(
  "Vêndo ou troco por outro carro pode ser atrasado negócio volta ",
  "Titan 150 ano 2005 ", 
  False
)

FuzzySearch.tokenSetRatio(
  "Vêndo ou troco por outro carro pode ser atrasado negócio volta", 
  "Titan 150 ano 2005"
);

Debugging both code I could find that the problem is when rounding the value: 22.5

Python code, located in utils.py:

int(round(n))

Java code, located in SimpleRatio class is:

(int) Math.round(100 * DiffUtils.getRatio(s1, s2));

TLDR:

Java: Math.round(22.5) => 23
Python: round(22.5) => 22

Don't know which one is correct for this algorithm...

Mismatch result if the keyword doesn't exist in the dataset

When I search word that doesn't exists in the data set for comparison it will suggest incorrectly or it cannot detect if the word misspelled or not

ArrayList<String> dataSet = new ArrayList<>();
dataSet.add("Iphone");
dataSet.add("white");
dataSet.add("black");
dataSet.add("Samsung");
dataSet.add("galaxy");
dataSet.add("gallileo");
dataSet.add("galaksi");
dataSet.add("harry");
dataSet.add("potter");

//string to be compared
String[] searchKeyword = new String[] {"hari poter", "smsung glxy", "xiaomi mi2", "jamu godhong telo"};
for(int i=0;i<searchKeyword.length;i++) {
	String[] keywords =  searchKeyword[i].split(" ");
	long start = System.currentTimeMillis();		
        List<String> checked = new ArrayList<>();
        Arrays.asList(keywords).stream().sequential().forEach(keyword ->{
		ExtractedResult res = FuzzySearch.extractOne(keyword, dataSet);
		checked.add(res.getString());
	});
	long end = System.currentTimeMillis() - start;
	System.out.println(String.format("keyword:%s , spell-checked: %s took:%d", searchKeyword[i], checked, end));
}

Result will be like this

keyword:hari poter , spell-checked: [harry, potter] took:123
keyword:smsung glxy , spell-checked: [Samsung, galaxy] took:6
keyword:xiaomi mi2 , spell-checked: [Iphone, white] took:5
keyword:jamu godhong telo , spell-checked: [Samsung, Iphone, gallileo] took:8

Strange result

Hi. Ratio value between words "гигантская" and "гигансткая" is 90.
In my opinion, here something is wrong. Or is this a normal result of the library?

Convert codes to apex class

Hi,

I want to convert your fuzzywuzzy codes to apex class language (language in Salesforce cloud), which has very similar syntax to Java. But currently I'm only planning to use SimpleRatio and PartialRatio. Am I allowed to do that? I also plan to opensource the result to my own project

Thank you in advance!

[v2] Create a module that enabled attaching to Android TextInput elements

install failure

mvn install causes the following test failure on win 7 in gitbash with java 1.8.0_144

Running me.xdrop.fuzzywuzzy.algorithms.DefaultStringProcessorTest
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.288 sec <<< FAILURE!
testProcess(me.xdrop.fuzzywuzzy.algorithms.DefaultStringProcessorTest) Time elapsed: 0.075 sec <<< FAILURE!
junit.framework.ComparisonFailure: expected:<s trim [μεγιουνικουντ] n o n a lph a n um> but was:<s trim [▒ ▒▒ ▒ ▒ ▒ ▒ ▒ ▒▒ ▒ ▒ ▒ ] n o n a lph a n um>
at junit.framework.Assert.assertEquals(Assert.java:100)
at junit.framework.TestCase.assertEquals(TestCase.java:261)
at groovy.util.GroovyTestCase.assertEquals(GroovyTestCase.java:284)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1466)
at org.codehaus.groovy.runtime.callsite.StaticMetaClassSite.callStatic(StaticMetaClassSite.java:65)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:56)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:194)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callStatic(AbstractCallSite.java:214)
at me.xdrop.fuzzywuzzy.algorithms.DefaultStringProcessorTest.testProcess(DefaultStringProcessorTest.groovy:9)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at junit.framework.TestCase.runTest(TestCase.java:176)
at junit.framework.TestCase.runBare(TestCase.java:141)
at junit.framework.TestResult$1.protect(TestResult.java:122)
at junit.framework.TestResult.runProtected(TestResult.java:142)
at junit.framework.TestResult.run(TestResult.java:125)
at junit.framework.TestCase.run(TestCase.java:129)
at junit.framework.TestSuite.runTest(TestSuite.java:252)
at junit.framework.TestSuite.run(TestSuite.java:247)
at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:86)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

License Question

Hello and thank you for publishing this awesome library! I had a question for you regarding the licensing. I wrote a collection of UDFs for Apache Drill that essentially is a wrapper for your library and would like to submit it to Drill, however the GPL license is not compatible with the Apache license.
Would you consider re-releasing this under a different license so that it could be included in a future release of Drill? (https://www.apache.org/legal/resolved.html#category-x)
Thanks!
-- Charles

fuzzywuzzy search gives 86% for all mismatches, or for incorrect match

Thanks for creating this Java API. it is really useful.

But i am facing one issue, I need to match some addresses in big address list (6000+ records). I am using ExtractOne method.

It works perfect if similar address is in the List. It give correct score (87%-100%).

But if it doesn't find good match, it always gives me 86% match even both addresses are totally different.
Example -
Addr 1 - HUNTINGTON NATIONAL BANK 328 SOUTH SAGINAW ST FLINT MI 48502
It matches to - BANK OF WEST PO BOX 2000 OMAHA NE 68103
and give Score - 86%

Still Incompatibility with the Python Version

I saw that there was a new version 1.3.4, so I used it, but I think that underscore-handling issue is not fixed - all the examples now return 100...

Here is how I run them in Java:
System.out.println("expected 58 -> got " + FuzzySearch.tokenSetPartialRatio("worm_mikeala", "mikeala rath"));
System.out.println("expected 80 -> got " + FuzzySearch.tokenSetPartialRatio("c_wasyluka", "crystal wasyluka"));
System.out.println( "expected 78 -> got " + FuzzySearch.tokenSetPartialRatio("a_bacdefg", "crystal bacdefg"));

I get:
expected 58 -> got 100
expected 80 -> got 100
expected 78 -> got 100

and here is how I run them in Python:
from fuzzywuzzy import fuzz
if name == 'main':
print(fuzz.partial_token_set_ratio("worm_mikeala", "mikeala rath"))
print(fuzz.partial_token_set_ratio("c_wasyluka", "crystal wasyluka"))
print(fuzz.partial_token_set_ratio("a_bacdefg", "crystal bacdefg"))

I get:
58
80
78

Am I doing something wrong or is there still an issue?

Inconsistent results from extractOne and extractTop

I could see different results are returned when using methods extractOne and extractTop on the same query string and collections.

I have a pretty long list of collection (15k Strings) to search for each query.

For Instance, let's say I have the following scenario
Query - ABC 1721
The collection has following strings in it
ABC1721
ABC1721-FGH/L9
ABC MERAKI Z1
EFGD3111/Z1-ABC
and many more

extractOne("ABC 1721", collection)
gives - ABC1721, Ratio - 95

extractTop("ABC 1721", collection,1)
gives - ABC1721, Ratio - 95

but the problem arose when I want the top 5 results
extractTop("ABC 1721", collection,5)
Match 1 - ABC1721-FGH/L9, Ratio - 86
Match 2 - ABC MERAKI Z1, Ratio - 86
Match 3 - EFGD3111/Z1-ABC, Ratio - 86
and so on

I tried using 'extractSorted' as well, it doesn't give consistent results as extractOne.

I used extractTop (for top 5) and extractOne for 1000+ queries. Around 70% of the 1st Match from extractTop doesn't match with the result of extractOne

BTW, I would like to appreciate your efforts on porting the python logic to Java without any performance lag

levenshtein distance issue

levEditDistance("sf&t co., ltd.","sft",1) = 13 when it is actually 11.

apache commons StringUtils.getLevenshteinDistance gives the correct result.

Wrong score in Partial Ratio

Hi,

I am using 1.4.0, this gives a wrong results with partial ratio:

FuzzySearch.partialRatio("ttttttttt virtuale ggggggggggggvo zizzrztuta mmmmmle", "virtuale");

the score is 50, it has to be 100 imho.

The python version returns 100 too:

>>> fuzz.partial_ratio("ttttttttt virtuale ggggggggggggvo zizzrztuta mmmmmle", "virtuale")
100

Thanks for the help

Is there a security scanning performed on this project?

I am very thankful to the contributors for this Java fuzzy match library with the most popular matching algorithms.

Is there a GitHub security scanning performed on this project? I did not observe a scanning policy under the security page but understand there are multiple options to implement scanning where that policy may not exist.

Include index in match result

It would be useful to also get the index of the matched item for each match in the result list.

Example

FuzzySearch.extractTop("goolge", ["google", "bing", "facebook", "linkedin", "twitter", "googleplus", "bingnews", "plexoogl"], 3)
[(string: google, score:83, index:0), (string: googleplus, score:63, index:5), (string: plexoogl, score:43, index:7)]

FuzzySearch.weightedRatio have ExceptionInInitializerError exception on Android 5 version

When I try to use FuzzySearch.weightedRatio("lupa","pupa") on Android 5.1, I receive ExceptionInInitializerError. I use FuzzySearch library by Gradle: compile 'me.xdrop:fuzzywuzzy:1.1.5'. Have you any idea?
P.S.: Android v4 do it well on all 4.* versions.

How does this library handle upper and lower case?

When comparing strings, the strings' capitalization affects the value returned. It appears this library is case sensitive. What are the parameters for CAPS vs lowercase? How much does the value decrease if a text such as "fuzzywuzzy" was matched with "FuZzYwUzZy" vs "fuzzywuzzy"?

Very curious!

Incorrect levenshtein distance for completely edited strings

When I calculate the ratio of "abcdef" - "fedcba" , it results in 17, even though I expected 0.

The ratio calculation is as I understand it: r = ( 1 - d/L)*100 ,
with d being the Levenshtein distance and L the sum of the two compared strings.

In this library the levenshtein distance is valued with 1 for each insert/delete and 2 for each replace.

The levenshtein distance in this library, for these two strings should be 12 (2 for each replace), resulting in a ratio = (1 - 12/12)*100 = 0

However, in your library, the ratio results in 17, instead of 0. This is because the distance it calculates is 10 instead of 12, resulting in (1-10/12)*100=17 .

This seems to be the case for string of any length, whith 100% replacements, as if 1 replacement is missed.

[maybe] Implement a default ignoreCase StringMapper implementation

Beforementioned in #29

StringIndexOutOfBoundsException in partialratio

java.lang.StringIndexOutOfBoundsException: String index out of range: 49
at java.lang.String.substring(String.java:1963)
at com.xdrop.fuzzywuzzy.ratios.PartialRatio.apply(PartialRatio.java:43)
at com.xdrop.fuzzywuzzy.FuzzySearch.partialRatio(FuzzySearch.java:45)

test case:
FuzzySearch.partialRatio("pros holdings, inc.","settlement facility dow corning trust")

Using custom object Instead of String would lead to performance issue?

Hi, First of all, thanks @xdrop for work on this project.

I have a Spring boot Webflux project and I need to do a fuzzy search on one of the fields. I am using in-memory loading, as soon as my Application starts, I would load the fuzzy search list data in the respective list. On subsequent API calls

After reading the API docs, I have two approaches in my mind.

1. Approach first

Use the list of string keys in a variable and a map of keys to the equivalent object in another variable. Fuzzy search using the list of keys. When I get the response back, map the key to the object and return

data class WeatherData(val key: String, val region: String)

// Service function for getting fuzzy search extracted Result
@Component
class FuzzySearchClient(val keys: MutableList<String>, val keysToWeatherDataMap: MutableMap<String, WeatherData> = mutableMapOf()) {

    fun fuzzySearchInMemory(query: String): Mono<List<SearchResponse>> {
        val result: List<ExtractedResult> = FuzzySearch.extractTop(query, keys, 5)
        val searchList: List<SearchResponse> = result.map { extractedResult: ExtractedResult ->
            val WeatherData = keysToWeatherDataMap[extractedResult.string]
            SearchResponse(WeatherData?.key!!, WeatherData.region!!)
        }
        return Mono.just(searchList)
    }
}

Function for adding keys in memory takes ~6s with approach one

@Component
class LoadDataInMemoryCache(
    private val weatherDataRepository: WeatherDataRepository,
    private val searchClient: FuzzySearchClient
) {

    private val logger = KotlinLogging.logger {}


    @EventListener(ApplicationReadyEvent::class)
    fun loadData() {
        val startTime = AtomicReference<Long>()
        weatherDataRepository.findAll()
            .doOnSubscribe { startTime.set(System.nanoTime()) }
            .doFinally { logger.info("Time taken for adding data in memory ${TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime.get())} milliseconds.") }
            .subscribe {
                searchClient.keys.add(WeatherData(it?.key!!, it.region!!))
            }
    }
}

2. Approach two

Use weather object keys and define ToStringFunction and get the result and map to the appropriate response.

data class WeatherData(val key: String, val region: String)

data class SearchResponse(val key: String, val region: String)

class WeatherSearchToStringFunction: ToStringFunction<WeatherData> {
    override fun apply(item: WeatherData?): String {
        return item?.key!!
    }
}


@Component
class SearchClient(val keys: MutableList<WeatherData>) {
    fun fuzzySearchInMemory(query: String): Mono<List<SearchResponse>> {
        val result: MutableList<BoundExtractedResult<WeatherData>> = FuzzySearch.extractTop(query, keys, WeatherSearchToStringFunction(), 5)
        val searchList: List<SearchResponse> = result.map { extractedResult: BoundExtractedResult<WeatherData> ->
            SearchResponse(extractedResult.referent?.key!!, extractedResult.referent.region)
        }
        return Mono.just(searchList)
    }
}

I am not sure which approach to perform. Suggestions are welcomed.

PartialRatio issue

This is essentially reopening issue #39, since the introduced fix does not solve the problem, but just makes it work for this explicit example.
E.g.

FuzzySearch.partialRatio("no", "bnonco");

should return the score 100.
This worked until #80, but returns the score 50 after reordering the cases

Is there somewhere I can find out what the different methods do?

I'm not familiar with the Python version of this; I gather from the readme that there are several different method calls that do different things with matching. I've found:

ratio
partialRatio
various tokenX methods
weightedRatio
various extractX methods

I generated the javadoc, but that didn't explain what these different methods do. I think the fuzzy matching could be very useful in what I'm doing, but just using ratio is a bit limiting, and I don't know what the other ones do. Is there documentation of what these things mean and do somewhere?

Thanks for the Library, Here's How I Used It!

Great Library and great work @xdrop ! Thank you so much for creating something that works for Android and sharing it! This library does something I have no idea how to do and would take me countless hours to create :D

Initially, I could not figure out a way to use the library. I have a SQLite database full of text values that I want to search. Unfortunately, neither SQL nor this library has an interface to do fuzzy search without a Full Table Search. Thankfully, I found a great workaround that uses my current dependencies.

This library works great with FlexibleAdapter (https://github.com/davideas/FlexibleAdapter). FlexibleAdapter has a builtin Async filtering mechanism that is extremely fast. Using the code below, I am able to filter my entire listview smoothly and with animations!

    @Override
    public boolean filter(String constraint) {
        Integer fuzzyRatio = FuzzySearch.partialRatio(title.toLowerCase(), constraint.toLowerCase());
        Log.d("Fuzzy Search Ratio", String.valueOf(fuzzyRatio));
        if (fuzzyRatio >= 70 || title.toLowerCase().trim().contains(constraint))
            return true;
        return false;
    }

I find that 70 is a really good value when using partial ratio.
Thanks to this library, I can provide an experience rivaling Google and Facebook! 🥇

Could you create new release in http://maven.org/ with the new license (GPLv2)?

Hi, looks like the 1.2.0 on http://maven.org/ is still GPLv3. Could you please create a new release with the new license (GPLv2)?

Thanks!

Divide by zero exception when using Basic Algorithm

Upon using a string that has only non-alphanumeric characters (Eg: "$#"), The Basic Algorithm throws the following exception
java.lang.ArithmeticException: / by zero at me.xdrop.fuzzywuzzy.algorithms.WeightedRatio.apply(WeightedRatio.java:32) at me.xdrop.fuzzywuzzy.algorithms.BasicAlgorithm.apply(BasicAlgorithm.java:22) at me.xdrop.fuzzywuzzy.Extractor.extractWithoutOrder(Extractor.java:43) at me.xdrop.fuzzywuzzy.Extractor.extractTop(Extractor.java:100)

I believe this is resulting due to the String processor replacing the characters by spaces and then trimming it which results in the string length to become zero

Could not find library in gradle

I am trying to use this library for my android studio project.
But I am facing this issue.

Could not find me.xdrop:fuzzywuzzy:1.3.0.
Required by:
    project :app
Search in build.gradle files

Can someone help with this?
Thanks

Add StringProcessor's for simple/partial ratios as well

Followed by #29 it would make sense to add StringProcessor overloads for the simple/partial ratios as well just so it is consistent with the rest.

Can we priortize results to push first appears over top

hi, I am using this library for a small set of data that has 10k records. But for some strings, I am getting results in the wrong order.

for list of choices query: "Visa"

choices = ["grupo televisa s.a.", "is", "sa", "visa inc.", "via"]

// result
('grupo televisa s.a.', 90), ('is', 90), ('sa', 90), ('visa inc.', 90)
``

I want the Visa string to appear in the first place. how can I achieve that?

wronng ratio

FuzzySearch.ratio("csr", "c s r") = 50.
Actual value is 75. ((8-2)/8)

Difference in extractOne results compared to Python version

I just noticed a difference in the results of extractOne between the Python and Java version.
My token is 19 craven park harlesden and my choices are ["NW10 8SU", "19 Craven Park, Harlesden", "Steven Gerrard"].

In the Python version, the following code:

process.extractOne(query, choices, scorer=fuzz.ratio)

produces:

('19 Craven Park, Harlesden', 98)

In the Java version, the following code:

 ExtractedResult result = FuzzySearch.extractOne(query, choices, new SimpleRatio());

matches 19 Craven Park, Harlesden but with a score of 86 score instead.

I dug a bit deeper into this and found that you can get 86 but doing a direct ratio comparison in the Python version:

fuzz.ratio("19 Craven Park, Harlesden", "19 craven park harlesden") gives 86

However, in the extractOne function in Python, it first processes the string by calling full_process in utils.py before calling the ratio function. From the results of the Java version, it seems this it is not processing the string in the same way before calling SimpleRatio().

It's either this or I am making some mistake in calling the function. Could you please shed some light on this.

GPL - v2 or v3?

Hello,

In #35 you've noted that "this is a rewrite of https://github.com/seatgeek/fuzzywuzzy, which forces this to be licensed under the same license (GPL) as the original library."

The Python package is licensed under GPL-2.0 without clarification if it's GPL-2.0-or-later or GPL-2.0-only, and some implication in the commit message and the timing of when the Python project was relicensed from MIT to GPL-2.0 that it was probably meant to be GPL-2.0-only.

This port has a GPL-3.0 license file.

Was it your intention to license this project under GPL-2.0 to match the license of the original project? If so, would you have any objection to taking the GPL-2.0 license text instead of GPL-3.0?

Thanks!

Difference in PartialScore between Java and Python Implementations

Hi,

I noticed when testing the values outputted from the Java implementation that given:
s1 = "haeagen dazs"
s2 = "liverpool altabrisa"
The Java implementation for PartialScore outputs 25, while the python implementation (fuzz.partial_ratio(s1,s2)) outputs 29. Wanted to report this discrepancy, and was wondering if anyone knew the cause of it (maybe rounding issues?)?

Thank you!

FuzzyWuzzy MIT?

There's a mit version in python

Can we have the same for java?

The license is the biggest issue i and 90%other developers are facing

And the worst thing is there is no alternate library in java with bare minimum performance like this library

I've searched everywhere

Levenshtein distance port for java is available but it performs very poorly for use case when you match users input (2-3chars) with list of strings
Eg matching "sai" with school names

NoClassDefFoundError

I got this error while calling

FuzzySearch.tokenSortRatio(stringA, stringB) + FuzzySearch.tokenSetRatio(stringA, stringB)

stackTrace: java.lang.RuntimeException: java.lang.NoClassDefFoundError: me/xdrop/fuzzywuzzy/FuzzySearch

I imported this library as a gradle dependency

implementation 'me.xdrop:fuzzywuzzy:1.3.1'

It doesn't look like an issue caused by transitive dependency.

./gradlew dependencies

+--- com.jayway.jsonpath:json-path:2.4.0 (*)
+--- me.xdrop:fuzzywuzzy:1.3.1

v1.3.0 I can't find the .pom file

Hello,
I'm trying to use v 1.3.0 but I'm facing the following error

Could not find me.xdrop:fuzzywuzzy:1.3.0.
Searched in the following locations:

https://repo.maven.apache.org/maven2/me/xdrop/fuzzywuzzy/1.3.0/fuzzywuzzy-1.3.0.pom

https://jcenter.bintray.com/me/xdrop/fuzzywuzzy/1.3.0/fuzzywuzzy-1.3.0.pom

Possible solution:

Declare repository providing the artifact, see the documentation at https://docs.gradle.org/current/userguide/declaring_repositories.html

I can't find the .pom file in the following directories
https://repo.jfrog.org/artifactory/libs-release-bintray/me/xdrop/fuzzywuzzy/1.3.0/
https://repo.maven.apache.org/maven2/me/xdrop/fuzzywuzzy/1.3.0/

am I missing something?
I have the following repositories defined in my build.gradle

repositories {
  jcenter()
  mavenCentral()
}

partial ratio issue

FuzzySearch.partialRatio("chicago transit authority" , "cta") expected value=67

The actual value is 33.

partialRatio issue

FuzzySearch.partialRatio("kaution", "kdeffxxxiban:de1110010060046666666datum:16.11.17zeit:01:12uft0000899999tan076601testd.-20-maisonette-z4-jobas-hagkautionauszug");

Result is "57", I expect "100".

Using 1.1.9.

Performance issue

Thank you for this awesome library, using it for my android project. it taking a lot of time as I am inputting array list of strings for comparison for each time user enters new character it will be called.
is there anyway I can improve its performance....

[todo] Set up the publication in develop branch

Results differ from python library

Hi, while porting some python code to java I discovered that the Token Sort and Token Set Ratios calculated by this library oftentimes do not match the ones calculated by the python fuzzywuzzy library.

Here is an example:
Python Code:

from fuzzywuzzy import fuzz 
print(str(fuzz.token_sort_ratio("efwe fwef","wef wefwef"))) 
print(str(fuzz.token_set_ratio("efwe fwef","wef wefwef")))

Output:

53
53

Java Code:

import me.xdrop.fuzzywuzzy.FuzzySearch;

public class Main {
	public static void main(String[] args) {
		System.out.println(FuzzySearch.tokenSortRatio("efwe fwef","wef wefwef"));
		System.out.println(FuzzySearch.tokenSetRatio("efwe fwef","wef wefwef"));
	}
}