Code Monkey home page Code Monkey logo

blocking's People

Contributors

berenz avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

blocking's Issues

Full `RcppAnnoy` support

Support for RcppAnnoy in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Full `RcppHNSW` support

Support for RcppHNSW in:

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Release 0.2.0

Plans:

  • Support for rnndescent as it will be shipped to CRAN.

Improvement of performance

Ideas for improving performance:

  • if a large dataset is present for index then index should be created iteratively as converting sparse to dense matrix is a bottleneck
  • if a large query data is present the same procedure should be applied.

Consider using:

  • sparse matrix (Matrix) as an input for x, y
  • bigmemory::big.matrix as an input for x, y

add quality metrics

Quality metrics about blocking. This would require specifying new argument: true_block

`pair_ann` does not work with `data.table`

> pair_ann(x = df_example, on = "txt")
  First data set:  8 records
  Second data set: 8 records
  Total number of pairs: 10 pairs
  Blocking on: 'txt'

       .x    .y block
    <int> <int> <num>
 1:     1     2     1
 2:     1     3     1
 3:     1     4     1
 4:     2     3     1
 5:     2     4     1
 6:     5     6     2
 7:     5     7     2
 8:     5     8     2
 9:     6     7     2
10:     6     8     2
> pair_ann(x = setDT(df_example), on = "txt")
Error: j (the 2nd argument inside [...]) is a single symbol but column name 'on' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..on]. The .. prefix conveys one-level-up similar to a file system path.

vignette with basic examples

Vignette with examples:

  • deduplication with character vector
  • deduplication with matrix
  • record linkage with character vector
  • record linakge with matrix
  • known clusters

Codes from tinytest dir can be used

Full `rnndescent` support

  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

Full `mlpack` support

Support for mlpack in:

  1. lsh functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index
  1. knn functions
  • deduplication (the same dataset, given by x)
  • record blocking with two datasets given by x, y
  • blocking from saved index
  • saving and reading index

blocking by blocking variables

Allow user to specify block vector before ANN blocking. For instance, user may want to block records by gender / letter before applying ANN blocking.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.