Comments (4)
Thanks! We are thinking about deprecating this. This had different logic before, but we reverted to the same one because fuzzy merge is slightly different from a standard merge.
For now, we treat the "right" columns as the "corpus" in retrieval terminology and "left" columns as "queries".
We believe that with most use cases of this tool, it is highly unlikely 1:1 is ever going to be used because of the very fuzzy nature of the task. In order to keep the API simple, we got rid of the logic in 1:m and m:1 which are standard in exact matching in statistical packages. In most at-scale tasks, we have no idea if there are any fuzzy duplicates in either the left and right columns - so even m:m is supported (though anyone who does record linkage would recommend against it)
However, we are open to suggestions, and if you have a case for keeping these and adjusting the logic appropriately, we are happy to revisit this. 1:m and m:1 would just be symmetric and 1:1 is a special case of 1:m.
Also, fun side note, python's standard merge has an option called "Validate" where these options can be specified. These only check for duplication of keys- it has no bearing on the match itself - we can probably rename this option as validate as well.
from linktransformer.
Interesting, thanks for the explanation and getting back to me so quickly! I'll describe the type of datasets I am using and how the 1:m would be really helpful for me.
In dataset one I have about 30,000 users each with a unique id assigned to them.
In dataset two I have about 50,000 users each with a unique id assigned to them.
The problem I'm running into is I need to map one id from dataset one to multiple ids in dataset two as in dataset two the company can change an individuals id more than once. The reason I need 1:m is I want to be able to map 1 id to multiple ids so I can create a universal id for that person. However, after the merge()
function the results you get back are only equal to the size of dataset one so len(dataset_one) = 30,000
or 1:1 so it's dropping a potential mapping for me.
I know I could switch the datasets around but I need to keep the 30,000 dataset as df1 and the 50,000 dataset as df2.
If you could provide the snippet I need for a 1:m mapping that would be really appreciated 🙏
Thanks once again!
from linktransformer.
Thanks! Just so I understand this clearly, what is preventing you from switching the two datasets around?
What the code is doing in words is "For each row in df1 (left columns) find a nearest neighbour from df2 (right columns)".
Hence, you only see rows = len(df1). You have multiple options within the current framework - tell me if these sound neat enough to you.
-
Use merge_knn() and use a generous k (like 10) - this will return k nearest neighbors of left rows from the right df. Then, you can choose a score threshold which makes sense (like >0.8 score has reasonable matches) and filter the matches. This should be very close to what you would need. K can be very large - but we recommend keeping it as low as possible for speed. Meta's FAISS library (which forms our retrieval backbone) recommends it to be less than 900. An example is in this notebook
-
Switch the dataframes around and use merge(). This time it will be equal to len(50,000). You can again check which score makes sense (score is symmetric as we are using cosine similarity) and drop matches <threshold similarity. I am not sure why you don't want to switch this around.
If you think about this a bit more conceptually, when we want a 1:m fuzzy merge we think : "I want to have several suitable merges for each row in the df1 from df2". This goal can be achieved by either 1 or 2 - does this make sense?
from linktransformer.
Closing due to inactivity; solutions are provided above.
from linktransformer.
Related Issues (10)
- Typo in GitHub pages model zoo HOT 1
- Fixing `dedup_rows` example HOT 1
- KeyError: 'metric' when running cluster_rows HOT 7
- Parameter "suffixes" does not work for lt.merge() HOT 2
- Wikipedia Alias' and Bigger Base Model for Training HOT 3
- Suggestion to implement range_search HOT 1
- Inconsistent merge type comment HOT 1
- AttributeError: 'BinaryClassificationEvaluator_wandb' object has no attribute 'truncate_dim' HOT 3
- Option to use precomputed corpus (df2) for matching single new rows? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from linktransformer.