View Code? Open in Web Editor
NEW
An R package for blocking records for record linkage / data deduplication based on approximate nearest neighbours algorithms.
Home Page: https://ncn-foreigners.github.io/blocking/
blocking's People
Contributors
Watchers
blocking's Issues
Support for RcppAnnoy
in:
List of backends for approximate nearest neighbours (ANN) search:
Ideas for improving performance:
- if a large dataset is present for index then index should be created iteratively as converting sparse to dense matrix is a bottleneck
- if a large query data is present the same procedure should be applied.
Consider using:
- sparse matrix (
Matrix
) as an input for x, y
bigmemory::big.matrix
as an input for x, y
Quality metrics about blocking. This would require specifying new argument: true_block
> pair_ann(x = df_example, on = "txt")
First data set: 8 records
Second data set: 8 records
Total number of pairs: 10 pairs
Blocking on: 'txt'
.x .y block
<int> <int> <num>
1: 1 2 1
2: 1 3 1
3: 1 4 1
4: 2 3 1
5: 2 4 1
6: 5 6 2
7: 5 7 2
8: 5 8 2
9: 6 7 2
10: 6 8 2
> pair_ann(x = setDT(df_example), on = "txt")
Error: j (the 2nd argument inside [...]) is a single symbol but column name 'on' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..on]. The .. prefix conveys one-level-up similar to a file system path.
Vignette with examples:
Codes from tinytest
dir can be used
One may consider using result from the package as an input for reclin2
package. This may be done by creating an instance of the pairs
class. See https://github.com/djvanderlaan/reclin2/blob/master/R/pair_minsim.R.
This would require a new function as.pairs
to create an an object of this type or function pair_ann
as the function reclin2::pair_minsim
.
Support for mlpack
in:
lsh
functions
knn
functions
Allow user to specify block vector before ANN blocking. For instance, user may want to block records by gender / letter before applying ANN blocking.