Code Monkey home page Code Monkey logo

Comments (4)

econabhishek avatar econabhishek commented on May 29, 2024

Thanks! We are thinking about deprecating this. This had different logic before, but we reverted to the same one because fuzzy merge is slightly different from a standard merge.
For now, we treat the "right" columns as the "corpus" in retrieval terminology and "left" columns as "queries".
We believe that with most use cases of this tool, it is highly unlikely 1:1 is ever going to be used because of the very fuzzy nature of the task. In order to keep the API simple, we got rid of the logic in 1:m and m:1 which are standard in exact matching in statistical packages. In most at-scale tasks, we have no idea if there are any fuzzy duplicates in either the left and right columns - so even m:m is supported (though anyone who does record linkage would recommend against it)

However, we are open to suggestions, and if you have a case for keeping these and adjusting the logic appropriately, we are happy to revisit this. 1:m and m:1 would just be symmetric and 1:1 is a special case of 1:m.

Also, fun side note, python's standard merge has an option called "Validate" where these options can be specified. These only check for duplication of keys- it has no bearing on the match itself - we can probably rename this option as validate as well.

from linktransformer.

seanoconnor23 avatar seanoconnor23 commented on May 29, 2024

Interesting, thanks for the explanation and getting back to me so quickly! I'll describe the type of datasets I am using and how the 1:m would be really helpful for me.

In dataset one I have about 30,000 users each with a unique id assigned to them.
In dataset two I have about 50,000 users each with a unique id assigned to them.

The problem I'm running into is I need to map one id from dataset one to multiple ids in dataset two as in dataset two the company can change an individuals id more than once. The reason I need 1:m is I want to be able to map 1 id to multiple ids so I can create a universal id for that person. However, after the merge() function the results you get back are only equal to the size of dataset one so len(dataset_one) = 30,000 or 1:1 so it's dropping a potential mapping for me.

I know I could switch the datasets around but I need to keep the 30,000 dataset as df1 and the 50,000 dataset as df2.

If you could provide the snippet I need for a 1:m mapping that would be really appreciated 🙏

Thanks once again!

from linktransformer.

econabhishek avatar econabhishek commented on May 29, 2024

Thanks! Just so I understand this clearly, what is preventing you from switching the two datasets around?
What the code is doing in words is "For each row in df1 (left columns) find a nearest neighbour from df2 (right columns)".
Hence, you only see rows = len(df1). You have multiple options within the current framework - tell me if these sound neat enough to you.

  1. Use merge_knn() and use a generous k (like 10) - this will return k nearest neighbors of left rows from the right df. Then, you can choose a score threshold which makes sense (like >0.8 score has reasonable matches) and filter the matches. This should be very close to what you would need. K can be very large - but we recommend keeping it as low as possible for speed. Meta's FAISS library (which forms our retrieval backbone) recommends it to be less than 900. An example is in this notebook

  2. Switch the dataframes around and use merge(). This time it will be equal to len(50,000). You can again check which score makes sense (score is symmetric as we are using cosine similarity) and drop matches <threshold similarity. I am not sure why you don't want to switch this around.

If you think about this a bit more conceptually, when we want a 1:m fuzzy merge we think : "I want to have several suitable merges for each row in the df1 from df2". This goal can be achieved by either 1 or 2 - does this make sense?

from linktransformer.

econabhishek avatar econabhishek commented on May 29, 2024

Closing due to inactivity; solutions are provided above.

from linktransformer.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.