Code Monkey home page Code Monkey logo

bedrock-tokenization's Introduction

bedrock-tokenization

A Bedrock module for tokenizing identifiers and storing encrypted related data

bedrock-tokenization's People

Contributors

dlongley avatar mandyvenables avatar mattcollier avatar dmitrizagidulin avatar jsassassin avatar aljones15 avatar gannan08 avatar davidlehn avatar msporny avatar

Stargazers

Phuong D. Nguyen avatar Eduardo Chongkan avatar

Watchers

 avatar  avatar  avatar  avatar James Cloos avatar  avatar Eduardo Chongkan avatar  avatar  avatar  avatar  avatar

bedrock-tokenization's Issues

Determine if `updated` can ever be false when a registration record is found (but not changed)

See: https://github.com/digitalbazaar/bedrock-tokenization/blob/v21.1.0/lib/documents.js#L320

From testing, it appears that updated will always be set to true whether or not the record in the database changed based on lastErrorObject.n being set to 1 (this is how updated is implemented internally). If updated is always true, we can simplify at least the comment exposition that follows after the above link in the function (and remove the FIXME) -- it may not change the code.

Find another AES-KW implementation that is not limited to 32 bytes

Current AES-KW implementations are limited to a maximum of 32 bytes (by telling the API that it is wrapping a 256-bit key). AES-KW itself has no such limitation and our use of it here could be for larger sizes. We should find an implementation that can cover larger inputs.

Consider optimizing await extra query after refreshing (updating) registration record

If a registration record is "refreshed" (i.e., there was a matching database record to update for a matching document registration), we run an additional mongodb query to pull out the record to return it. Alternatively, we could use findAndUpdate as a potential optimization in that case. This call includes the shard key, so it should be a viable option. It is unknown what the locking characteristics or performance gains would be if we implemented this.

Investigate shard configuration in code

This library is generally designed to be deployed on a sharded mongodb cluster. Investigate whether or not it would be a good idea and technically possible to define shards in code here.

Investigate whether token batch invalidation count needs to be store and checked on unpinned batches

Instead of marking unpinned batches as invalid, we may need to store the token batch invalidation count on them and check to ensure it has not changed wrt to the entity. This should be an internal change only, but will affect what is stored in the database (not indexes, just extra fields in the database). It should not require any additional lookups as the entity record is already fetched for unpinned batches.

It can potentially be done in a backwards compatible way. It may also eliminate the need to update all token batches to mark them as invalid -- as determining whether or not an unpinned batch is invalid would require comparing the entity record's invalidation count against what is stored in the unpinned batch (instead of checking an invalid field -- which we can probably remove).

Potential optimization of open batch/create token batch

In order to increase the number of writes to the database that can happen in parallel, the following changes should be considered:

  1. Concurrently writing a token batch record and an open token batch record (instead of doing these in serial).

There's a note here (which has a typo This this btw) that talks about the importance of only writing the open token batch after writing the token batch, however, it is primarily around data cleanliness and efficiency as opposed to correctness:

/* Note: An `openTokenBatch` record must only be added *after* a record
is inserted into `tokenBatch` collection *first*. This this ensures that if
we find a record in `openTokenBatch` that references a record that does not
exist in `tokenBatch`, it can *only* be because the `tokenBatch` record has
auto-expired, which means we can safely delete the reference to it in
`openTokenBatch`. Here, we also ignore the returned record and, if an error
occurs, we accept that the just created record in `tokenBatch` will not ever
be used, but will eventually expire and get cleaned up. */
await _createOpenBatchReference({internalId, batchId: id});

If we write the records in parallel, it will be possible for an open token batch to reference a token batch that is not expired and just presently being written to the database, however, this is likely to be a rare occurrence. The common behavior (99%+ of cases) will be such that a call to create tokens will only be happening once for a given internalId. Even if multiple concurrent calls happen, what would result would be only in those cases where the open batch record happens to get created first and where the token batch record was not written before a subsequent call (subsequent to getting the open batch record) to find it, would the software behave as if the token batch had expired and been removed.

In this case, then open batch record would be removed and the token batch would never be filled (if it was, in fact, still open). This would result in some extra token batch records in the database -- a minor inefficiency. These would eventually expire and be removed from the database. An attacker could exploit this -- but not in a way that would be significantly more effective than if they simply generated a lot of filled token batches instead.

Add validation in bitstring.js

the decode method for instance does not validate the parameters and undefined will pass through the function successfully.

assert-plus should be used.

Improve document registration to ensure entity records are shared for the same external ID

The entity and document registration records are optimistically inserted concurrently to achieve important performance gains.

However, it's possible for multiple entity records (with different internal IDs) to be created for two different registration records that have the same externalIdHash. In fact, with the current implementation, this is the expected outcome.

On document registration, a check is performed to find a registration record that matches an externalIdHash and documentHash, i.e., a record that would match the exact document being registered. However, if a document that matches the same externalIdHash but has a different documentHash were to be registered, no matching registration record would be found and, therefore, the entity associated with the existing any registration (and externalIdHash) would not be found and reused.

In order to properly support multiple documents with the same externalIdHash, the follow changes are needed:

  1. The initial call to find a registration record should be changed to also find any registration record matching externalIdHash. Note that it should find just the earliest of these as it should provide the "canonical" internalId. However, we would still want to also find the specific registration record that matches the documentHash (if this is not the same record) -- as it is required to be returned in the API. The most efficient way to do this might be to just make two concurrent calls that may happen to return the same record.
  2. The call that creates a registration record may create one with the "wrong" internalId because the entity record with the "right" internalId may be inserted concurrently -- for degenerate cases where two documents are registered at the same time. Any registration record that is inserted must have its internalId checked against the earliest registration record's internalId value -- and be updated if it does not match. Note: Using the created timestamp on records may not be sufficient to make this distinction -- as multiple records could use the same value and then a decision about which internalId to use will need to be made ... in the face of asynchronous / competing processes. The best way to resolve this is TBD.

Once these changes are implemented, the impact to any existing system with multiple registration records will need to be analyzed (should it be upgraded to support the improved constraints).

Make `kmsBaseUrl` used in `kms.js` `createKeystore` a configuration option

bedrock-tokenization should have to be configured to use a particular WebKMS system via a config var for kmsBaseUrl. Right now, kmsBaseUrl is based on an assumption that the WebKMS used will be accessible via the same domain as where bedrock-tokenization is installed -- which also presumes that bedrock-tokenization uses a server, which may not be true (but likely is). Notably, the WebKMS system used should be architecturally split from whatever system is running bedrock-tokenization even if the domains are shared externally.

In short, we should make kmsBaseUrl a configuration option and require it to be set.

Explore solutions to preserving usage of document registrations when HMACs rotate

Currently when a tokenizer rotates (hmac rotation) and a new one becomes current, all existing document registrations will no longer be reused when generating new tokens. In other words, a "new identity" will be created for every document registration that occurs after the rotation. There are a number of ways to address this for workflows where it matters. At least two ways include:

  1. Pre-duplicating all document registrations using the new tokenizer prior to making it current.
  2. Hashing for every possibly "still in use" tokenizer -- instead of just the current one -- when looking up existing records. When an existing record is found, it could be updated to use the next tokenizer. Eventually, the "previous" tokenizer would drop off. See bedrock-resource-restriction for some patterns to consider.

Potential optimization of document registration and token batch creation

In order to increase the number of writes to the database that can happen in parallel, the following changes should be considered:

  1. Splitting out the internalId generation code as a standalone call and making it so that the internalId must be passed into documents.register.
  2. Adding an additional query to check for an existing document registration prior to performing the registration itself. This would fetch an internalId to reuse (from the existing record) or it would signal that one needs to be generated using the method split out from item 1.
  3. Once the query in item 2 is performed, the internalId that will end up in the registration record will be known such that writing a token batch can occur in parallel while the registration record is either refreshed (updated) or inserted.

It's possible that the registration record returned from item 2 will be deleted between items 2 and 3, however, it should be infeasible for the same internalId (a large random number) to have been claimed by a different externalId and documentHash concurrently. Therefore, what would happen in this case is that an insert of the full registration record would happen instead of an update. We can keep the registration code that first attempts a refresh and then does an insert the same here (but we will have passed in the internalId instead of generating it within the documentsregister call). This would continue to preserve the optimization we have that avoids unnecessary encryption of the document unless strictly needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.