The blocking's discuss from ncn-foreigners

bug when `true_blocks` are provided

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

testing <- blocking(x = df_example$txt,
                    deduplication = T)

testing2 <- blocking(x = df_example$txt,
                    deduplication = T,
                    true_blocks = testing$result[c(1,4,6), .(x,y,block)])

Error message

Error in modularity.igraph(graph, membership) : 
  At vendor/cigraph/src/community/modularity.c:132 : Membership vector size differs from number of vertices. Invalid value
> traceback()
4: modularity.igraph(graph, membership)
3: modularity(graph, membership)
2: igraph::make_clusters(eval_g1, membership = eval_blocks$block.x)
1: blocking(x = df_example$txt, deduplication = T, true_blocks = testing$result[c(1, 
       4, 6), .(x, y, block)])

blocking by blocking variables

Allow user to specify block vector before ANN blocking. For instance, user may want to block records by gender / letter before applying ANN blocking.

add quality metrics

Quality metrics about blocking. This would require specifying new argument: true_block

standard evaluation of blocking procedures

Evaluations based on the standard metrics as in klsh::confusion.from.blocking or in [blockr]

error in the `confusion` construction

Just an issue to remind that there is some error construction of confusion matrix

> confusion
          same_truth
same_block   FALSE    TRUE
     FALSE 1926684      11
     TRUE       11     960

cell (1, 2) is exactly the same as (2,1).

Full `RcppAnnoy` support

Support for RcppAnnoy in:

deduplication (the same dataset, given by x)
record blocking with two datasets given by x, y
blocking from saved index
saving and reading index

Full `rnndescent` support

deduplication (the same dataset, given by x)
record blocking with two datasets given by x, y
blocking from saved index
saving and reading index

Release 0.2.0

Plans:

Support for rnndescent as it will be shipped to CRAN.

Improvement of performance

Ideas for improving performance:

if a large dataset is present for index then index should be created iteratively as converting sparse to dense matrix is a bottleneck
if a large query data is present the same procedure should be applied.

Consider using:

sparse matrix (Matrix) as an input for x, y
bigmemory::big.matrix as an input for x, y

remove duplicated pairs from the result

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

result <- blocking(x = df_example$txt,
                   ann = "hnsw",
                   control_ann = controls_ann(hnsw = list(M = 5, ef_c = 10, ef_s = 10)))

result$result

The result contains pairs that refer to the same records (i.e. 1 - 2, 2-1)

> result$result
       x     y block       dist
   <int> <int> <num>      <num>
1:     1     2     1 0.09999990
2:     2     1     1 0.09999990
3:     2     3     1 0.14188361
4:     2     4     1 0.28286278
5:     5     6     2 0.08333343
6:     5     7     2 0.13397467
7:     6     5     2 0.08333343
8:     6     8     2 0.27831221

Full `RcppHNSW` support

Support for RcppHNSW in:

deduplication (the same dataset, given by x)
record blocking with two datasets given by x, y
blocking from saved index
saving and reading index

One may consider using result from the package as an input for reclin2 package. This may be done by creating an instance of the pairs class. See https://github.com/djvanderlaan/reclin2/blob/master/R/pair_minsim.R.

This would require a new function as.pairs to create an an object of this type or function pair_ann as the function reclin2::pair_minsim.

Approximate nearest neighbours algorithms

List of backends for approximate nearest neighbours (ANN) search:

vignette with basic examples

Vignettes:

Full `mlpack` support

Support for mlpack in:

lsh functions

deduplication (the same dataset, given by x)
record blocking with two datasets given by x, y
blocking from saved index
saving and reading index

knn functions

deduplication (the same dataset, given by x)
record blocking with two datasets given by x, y
blocking from saved index
saving and reading index

`pair_ann` does not work with `data.table`

> pair_ann(x = df_example, on = "txt")
  First data set:  8 records
  Second data set: 8 records
  Total number of pairs: 10 pairs
  Blocking on: 'txt'

       .x    .y block
    <int> <int> <num>
 1:     1     2     1
 2:     1     3     1
 3:     1     4     1
 4:     2     3     1
 5:     2     4     1
 6:     5     6     2
 7:     5     7     2
 8:     5     8     2
 9:     6     7     2
10:     6     8     2
> pair_ann(x = setDT(df_example), on = "txt")
Error: j (the 2nd argument inside [...]) is a single symbol but column name 'on' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..on]. The .. prefix conveys one-level-up similar to a file system path.

ncn-foreigners / blocking Goto Github PK

blocking's Issues

Recommend Projects

Recommend Topics

Recommend Org