Thanks for such a great repository! I just was going through the code where you're abl

Different merge types results in the same logic about linktransformer HOT 4 CLOSED

seanoconnor23 commented on May 29, 2024

Different merge types results in the same logic

from linktransformer.

Comments (4)

econabhishek commented on May 29, 2024

Thanks! We are thinking about deprecating this. This had different logic before, but we reverted to the same one because fuzzy merge is slightly different from a standard merge.
For now, we treat the "right" columns as the "corpus" in retrieval terminology and "left" columns as "queries".
We believe that with most use cases of this tool, it is highly unlikely 1:1 is ever going to be used because of the very fuzzy nature of the task. In order to keep the API simple, we got rid of the logic in 1:m and m:1 which are standard in exact matching in statistical packages. In most at-scale tasks, we have no idea if there are any fuzzy duplicates in either the left and right columns - so even m:m is supported (though anyone who does record linkage would recommend against it)

However, we are open to suggestions, and if you have a case for keeping these and adjusting the logic appropriately, we are happy to revisit this. 1:m and m:1 would just be symmetric and 1:1 is a special case of 1:m.

Also, fun side note, python's standard merge has an option called "Validate" where these options can be specified. These only check for duplication of keys- it has no bearing on the match itself - we can probably rename this option as validate as well.

from linktransformer.

seanoconnor23 commented on May 29, 2024

Interesting, thanks for the explanation and getting back to me so quickly! I'll describe the type of datasets I am using and how the 1:m would be really helpful for me.

In dataset one I have about 30,000 users each with a unique id assigned to them.
In dataset two I have about 50,000 users each with a unique id assigned to them.

The problem I'm running into is I need to map one id from dataset one to multiple ids in dataset two as in dataset two the company can change an individuals id more than once. The reason I need 1:m is I want to be able to map 1 id to multiple ids so I can create a universal id for that person. However, after the merge() function the results you get back are only equal to the size of dataset one so len(dataset_one) = 30,000 or 1:1 so it's dropping a potential mapping for me.

I know I could switch the datasets around but I need to keep the 30,000 dataset as df1 and the 50,000 dataset as df2.

If you could provide the snippet I need for a 1:m mapping that would be really appreciated 🙏

Thanks once again!

from linktransformer.

econabhishek commented on May 29, 2024

Thanks! Just so I understand this clearly, what is preventing you from switching the two datasets around?
What the code is doing in words is "For each row in df1 (left columns) find a nearest neighbour from df2 (right columns)".
Hence, you only see rows = len(df1). You have multiple options within the current framework - tell me if these sound neat enough to you.

Use merge_knn() and use a generous k (like 10) - this will return k nearest neighbors of left rows from the right df. Then, you can choose a score threshold which makes sense (like >0.8 score has reasonable matches) and filter the matches. This should be very close to what you would need. K can be very large - but we recommend keeping it as low as possible for speed. Meta's FAISS library (which forms our retrieval backbone) recommends it to be less than 900. An example is in this notebook
Switch the dataframes around and use merge(). This time it will be equal to len(50,000). You can again check which score makes sense (score is symmetric as we are using cosine similarity) and drop matches <threshold similarity. I am not sure why you don't want to switch this around.

If you think about this a bit more conceptually, when we want a 1:m fuzzy merge we think : "I want to have several suitable merges for each row in the df1 from df2". This goal can be achieved by either 1 or 2 - does this make sense?

from linktransformer.

econabhishek commented on May 29, 2024

Closing due to inactivity; solutions are provided above.

from linktransformer.

Different merge types results in the same logic about linktransformer HOT 4 CLOSED

Comments (4)

Related Issues (10)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent