Code Monkey home page Code Monkey logo

cvm-lib's Introduction

CVM Library

Estimate the number of distinct values in a set using the simple and space-efficient CVM algorithm.

Version JSR Maintenance License codecov npm bundle size

Getting Started

Install

NPM:

npm install cvm-lib

Yarn:

yarn add cvm-lib

JSR:

jsr add @rojas/cvm

Examples

See the examples/ directory for all examples.

Hamlet

Estimate unique words in Shakespeare's Hamlet:

node ./examples/hamlet/index.js
  • Total words: 31991
  • CVM capacity: 2161
  • Expected uniques: 4762 ± 10.00%
  • Estimated uniques: 4728 (-0.71%)

1M

Estimate unique integers among 1 million random integers.

node ./examples/hamlet/index.js
  • Total values: 1000000
  • CVM capacity: 10631
  • Expected uniques: 994384 ± 5.00%
  • Estimated uniques: 996480 (0.21%)

API

Functions

calculateCapacity(n, epsilon?, delta?)

Calculates the space required to estimate the number of distinct values in a set with a given accuracy and confidence.

  • n: The total number of values in the set, or an estimate if unknown. Must be a positive number.
  • epsilon (optional): An estimate's relative error. Controls accuracy. Must be between 0 and 1. Defaults to 0.05.
  • delta (optional): The probability an estimate is not accurate. Controls confidence. Must be between 0 and 1. Defaults to 0.01.

Classes

Estimator<T>

Estimates the number of distinct values in a set using the CVM algorithm.

  • Constructors

    • new (capacity): Create an instance with a given capacity. Must be a positive integer.
    • new (config): Create an instance using a config object.
  • Properties

    • capacity: Gets the maximum number of samples in memory.
    • randomFn: Gets or sets the random number generator function (e.g. Math.random).
    • sampleRate Gets the base sample rate (e.g. 0.5).
    • size: Gets the number of samples in memory.
  • Methods

    • add(value): Adds a value.
    • clear(): Clears/resets the instance.
    • estimate(): Gets the estimated number of distinct values.

Interfaces

EstimatorConfig<T>

A configuration object used to create Estimator instances.

  • capacity: The maximum number of samples in memory. Must be a positive integer.
  • randomFn (optional): The random number generator function. Should return random or pseudorandom values between 0 and 1.
  • sampleRate (optional): The sampling rate for managing samples. Must be between 0 and 1.
    • Note: Custom values may negatively affect accuracy. In general, the further from 0.5, the more it's affected. If capacity was calculated via calculateCapacity, expected accuracy / confidence may be invalidated.
  • storage (optional): An object that implements SampleSet for storing samples.

SampleSet<T>

Represents a generic set for storing samples.

  • size: The number of values in the set.
  • add(value): Adds a value to the set.
  • clear(): Clears all values from the set.
  • delete(value): Removes a specified value from the set.
  • [Symbol.iterator](): Iterates through the set's values.

Community and Support

Contributions are welcome!

  • Questions / Dicussions: Please contact us via GitHub discussions.

  • Bug Reports: Please use the GitHub issue tracker to report any bugs. Include a detailed description and any relevant code snippets or logs.

  • Feature Requests: Please submit feature requests as issues, clearly describing the feature and its potential benefits.

  • Pull Requests: Please ensure your code adheres to the existing style of the project and include any necessary tests and documentation.

For more information, check out the contributor's guide.

Build

  1. Clone the project from github
git clone [email protected]:havelessbemore/cvm-lib.git
cd cvm-lib
  1. Install dependencies
npm install
  1. Build the project
npm run build

This will output ECMAScript (.mjs) and CommonJS (.cjs) modules in the dist/ directory.

Format

To run the code linter:

npm run lint

To automatically fix linting issues, run:

npm run format

Test

To run tests:

npm test

To run tests with a coverage report:

npm run test:coverage

A coverage report is generated at ./coverage/index.html.

References

  1. Source paper: Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2023). Distinct Elements in Streams: An Algorithm for the (Text) Book
  2. Notes by Donald Knuth: Knuth, D. E. (2023). The CVM Algorithm for Estimating Distinct Elements in Streams. Stanford Computer Science Department.
  3. Wikipedia: CVM Algorithm.
  4. High-level summary: Nadis, S. (2024, May 16). Computer Scientists Invent an Efficient New Way to Count. Quanta Magazine..

cvm-lib's People

Contributors

havelessbemore avatar dependabot[bot] avatar

Stargazers

Cristian avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.