Code Monkey home page Code Monkey logo

blocking's Issues

bug when `true_blocks` are provided

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

testing <- blocking(x = df_example$txt,
                    deduplication = T)

testing2 <- blocking(x = df_example$txt,
                    deduplication = T,
                    true_blocks = testing$result[c(1,4,6), .(x,y,block)])

Error message

Error in modularity.igraph(graph, membership) : 
  At vendor/cigraph/src/community/modularity.c:132 : Membership vector size differs from number of vertices. Invalid value
> traceback()
4: modularity.igraph(graph, membership)
3: modularity(graph, membership)
2: igraph::make_clusters(eval_g1, membership = eval_blocks$block.x)
1: blocking(x = df_example$txt, deduplication = T, true_blocks = testing$result[c(1, 
       4, 6), .(x, y, block)])

blocking by blocking variables

Allow user to specify block vector before ANN blocking. For instance, user may want to block records by gender / letter before applying ANN blocking.

add quality metrics

Quality metrics about blocking. This would require specifying new argument: true_block

error in the `confusion` construction

Just an issue to remind that there is some error construction of confusion matrix

> confusion
          same_truth
same_block   FALSE    TRUE
     FALSE 1926684      11
     TRUE       11     960

cell (1, 2) is exactly the same as (2,1).

Full `RcppAnnoy` support

Support for RcppAnnoy in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Full `rnndescent` support

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Release 0.2.0

Plans:

  • Support for rnndescent as it will be shipped to CRAN.

Improvement of performance

Ideas for improving performance:

  • if a large dataset is present for index then index should be created iteratively as converting sparse to dense matrix is a bottleneck
  • if a large query data is present the same procedure should be applied.

Consider using:

  • sparse matrix (Matrix) as an input for x, y
  • bigmemory::big.matrix as an input for x, y

remove duplicated pairs from the result

df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

result <- blocking(x = df_example$txt,
                   ann = "hnsw",
                   control_ann = controls_ann(hnsw = list(M = 5, ef_c = 10, ef_s = 10)))

result$result

The result contains pairs that refer to the same records (i.e. 1 - 2, 2-1)

> result$result
       x     y block       dist
   <int> <int> <num>      <num>
1:     1     2     1 0.09999990
2:     2     1     1 0.09999990
3:     2     3     1 0.14188361
4:     2     4     1 0.28286278
5:     5     6     2 0.08333343
6:     5     7     2 0.13397467
7:     6     5     2 0.08333343
8:     6     8     2 0.27831221

Full `RcppHNSW` support

Support for RcppHNSW in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Full `mlpack` support

Support for mlpack in:

  1. lsh functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index
  1. knn functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

`pair_ann` does not work with `data.table`

> pair_ann(x = df_example, on = "txt")
  First data set:  8 records
  Second data set: 8 records
  Total number of pairs: 10 pairs
  Blocking on: 'txt'

       .x    .y block
    <int> <int> <num>
 1:     1     2     1
 2:     1     3     1
 3:     1     4     1
 4:     2     3     1
 5:     2     4     1
 6:     5     6     2
 7:     5     7     2
 8:     5     8     2
 9:     6     7     2
10:     6     8     2
> pair_ann(x = setDT(df_example), on = "txt")
Error: j (the 2nd argument inside [...]) is a single symbol but column name 'on' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..on]. The .. prefix conveys one-level-up similar to a file system path.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.