digitalbazaar / bedrock-tokenization Goto Github PK

View Code? Open in Web Editor NEW

3.0 11.0 2.0 474 KB

A Bedrock module for tokenizing identifiers and storing encrypted related data

License: Other

JavaScript 100.00%

bedrock-tokenization's Introduction

bedrock-tokenization

A Bedrock module for tokenizing identifiers and storing encrypted related data

bedrock-tokenization's People

Contributors

Stargazers

Watchers

Forkers

thedheeruthakur jacobnguyenn

bedrock-tokenization's Issues

Determine if `updated` can ever be false when a registration record is found (but not changed)

See: https://github.com/digitalbazaar/bedrock-tokenization/blob/v21.1.0/lib/documents.js#L320

From testing, it appears that updated will always be set to true whether or not the record in the database changed based on lastErrorObject.n being set to 1 (this is how updated is implemented internally). If updated is always true, we can simplify at least the comment exposition that follows after the above link in the function (and remove the FIXME) -- it may not change the code.

Find another AES-KW implementation that is not limited to 32 bytes

Current AES-KW implementations are limited to a maximum of 32 bytes (by telling the API that it is wrapping a 256-bit key). AES-KW itself has no such limitation and our use of it here could be for larger sizes. We should find an implementation that can cover larger inputs.

Put a limit on the number of token batches outstanding for a user at any given time.

(so we don't run out of storage). Note: We can only delete existing token batches if none of them have been resolved (if we get to the point we need to "rotate" batches to make space for more for a particular user).

Define how to migrate documents and preserve entity linkage when rotating tokenizer

Handle case that registration options TTL is NaN or invalid

https://github.com/digitalbazaar/bedrock-tokenization/blob/v19.1.0/lib/tokens/index.js#L116

Note that Math.max(NaN, anything) will return NaN which is a problem.

Fix internalId length and add tests

More information:
https://gist.github.com/JoshDunnDev/e1865fc831d303e235878b343f087b7b

Consider optimizing await extra query after refreshing (updating) registration record

If a registration record is "refreshed" (i.e., there was a matching database record to update for a matching document registration), we run an additional mongodb query to pull out the record to return it. Alternatively, we could use findAndUpdate as a potential optimization in that case. This call includes the shard key, so it should be a viable option. It is unknown what the locking characteristics or performance gains would be if we implemented this.

Investigate shard configuration in code

This library is generally designed to be deployed on a sharded mongodb cluster. Investigate whether or not it would be a good idea and technically possible to define shards in code here.

Ensure that token batch expiration does not exceed registration expiration

Expose API surface to add meta data in registration and entity records

Determine if the ability to add fields to meta for each record should be exposed or if another field within entity or registration should be used.

Investigate whether token batch invalidation count needs to be store and checked on unpinned batches

Instead of marking unpinned batches as invalid, we may need to store the token batch invalidation count on them and check to ensure it has not changed wrt to the entity. This should be an internal change only, but will affect what is stored in the database (not indexes, just extra fields in the database). It should not require any additional lookups as the entity record is already fetched for unpinned batches.

It can potentially be done in a backwards compatible way. It may also eliminate the need to update all token batches to mark them as invalid -- as determining whether or not an unpinned batch is invalid would require comparing the entity record's invalidation count against what is stored in the unpinned batch (instead of checking an invalid field -- which we can probably remove).

Consider registering CIT context mapping to CBOR-LD registry

Consider registering the Concealed Id Token CBOR-LD context mapping (0x8000 -> 'https://w3id.org/cit/v1.jsonld' currently) in the CBOR-LD registry (currently over at the cbor-ld ContextCodec).

Add tests that check that TTL expiration indexes are / are not created

Potential optimization of open batch/create token batch

In order to increase the number of writes to the database that can happen in parallel, the following changes should be considered:

Concurrently writing a token batch record and an open token batch record (instead of doing these in serial).

There's a note here (which has a typo This this btw) that talks about the importance of only writing the open token batch after writing the token batch, however, it is primarily around data cleanliness and efficiency as opposed to correctness:

bedrock-tokenization/lib/tokens.js

Lines 606 to 614 in 044a100

    
             /* Note: An `openTokenBatch` record must only be added *after* a record 
        
             is inserted into `tokenBatch` collection *first*. This this ensures that if 
        
             we find a record in `openTokenBatch` that references a record that does not 
        
             exist in `tokenBatch`, it can *only* be because the `tokenBatch` record has 
        
             auto-expired, which means we can safely delete the reference to it in 
        
             `openTokenBatch`. Here, we also ignore the returned record and, if an error 
        
             occurs, we accept that the just created record in `tokenBatch` will not ever 
        
             be used, but will eventually expire and get cleaned up. */ 
        
             await _createOpenBatchReference({internalId, batchId: id});

If we write the records in parallel, it will be possible for an open token batch to reference a token batch that is not expired and just presently being written to the database, however, this is likely to be a rare occurrence. The common behavior (99%+ of cases) will be such that a call to create tokens will only be happening once for a given internalId. Even if multiple concurrent calls happen, what would result would be only in those cases where the open batch record happens to get created first and where the token batch record was not written before a subsequent call (subsequent to getting the open batch record) to find it, would the software behave as if the token batch had expired and been removed.

In this case, then open batch record would be removed and the token batch would never be filled (if it was, in fact, still open). This would result in some extra token batch records in the database -- a minor inefficiency. These would eventually expire and be removed from the database. An attacker could exploit this -- but not in a way that would be significantly more effective than if they simply generated a lot of filled token batches instead.

Decide if bitstrings should be big or little endian

Add a comment explaining the batch version "NEXT_OPTIONS"

Add a comment explaining the batch version "NEXT_OPTIONS" and that it is not a mistake/bug:

https://github.com/digitalbazaar/bedrock-tokenization/blob/master/lib/batchVersions.js#L19

Add proper result checking in test suite for isRegistration in helpers.js.

Currently the function isRegistration() in /test/mocha/helpers.js is only logging the result instead of actually checking if the result is correct in the test suite. We need to add proper checks here so we ensure that the documents.register() result is correct.

Update use of old mongodb projection calls

Add validation in bitstring.js

the decode method for instance does not validate the parameters and undefined will pass through the function successfully.

assert-plus should be used.

Add a cborld AppContextMap for the Concealed Id Token

Something like

    const appContextMap = new Map();
    appContextMap.set(CIT_CONTEXT_URL, 0x8000);

And pass it into the cborldEncode() call in _createToken().

Upgrade to cborld 2.0, update documentLoader usage.

Once cborld library 2.0 is fixed, update the dependency to it, and update the documentLoader usage.

Change batch version default TTL for tokens to 120 days

Improve document registration to ensure entity records are shared for the same external ID

The entity and document registration records are optimistically inserted concurrently to achieve important performance gains.

However, it's possible for multiple entity records (with different internal IDs) to be created for two different registration records that have the same externalIdHash. In fact, with the current implementation, this is the expected outcome.

On document registration, a check is performed to find a registration record that matches an externalIdHash and documentHash, i.e., a record that would match the exact document being registered. However, if a document that matches the same externalIdHash but has a different documentHash were to be registered, no matching registration record would be found and, therefore, the entity associated with the existing any registration (and externalIdHash) would not be found and reused.

In order to properly support multiple documents with the same externalIdHash, the follow changes are needed:

The initial call to find a registration record should be changed to also find any registration record matching externalIdHash. Note that it should find just the earliest of these as it should provide the "canonical" internalId. However, we would still want to also find the specific registration record that matches the documentHash (if this is not the same record) -- as it is required to be returned in the API. The most efficient way to do this might be to just make two concurrent calls that may happen to return the same record.
The call that creates a registration record may create one with the "wrong" internalId because the entity record with the "right" internalId may be inserted concurrently -- for degenerate cases where two documents are registered at the same time. Any registration record that is inserted must have its internalId checked against the earliest registration record's internalId value -- and be updated if it does not match. Note: Using the created timestamp on records may not be sufficient to make this distinction -- as multiple records could use the same value and then a decision about which internalId to use will need to be made ... in the face of asynchronous / competing processes. The best way to resolve this is TBD.

Once these changes are implemented, the impact to any existing system with multiple registration records will need to be analyzed (should it be upgraded to support the improved constraints).

Ensure there's a limit on the number of tokens per entity

Make `kmsBaseUrl` used in `kms.js` `createKeystore` a configuration option

bedrock-tokenization should have to be configured to use a particular WebKMS system via a config var for kmsBaseUrl. Right now, kmsBaseUrl is based on an assumption that the WebKMS used will be accessible via the same domain as where bedrock-tokenization is installed -- which also presumes that bedrock-tokenization uses a server, which may not be true (but likely is). Notably, the WebKMS system used should be architecturally split from whatever system is running bedrock-tokenization even if the domains are shared externally.

In short, we should make kmsBaseUrl a configuration option and require it to be set.

Check error types.

Some TypeErrors should be RangeErrors:
#13 (comment)

Ensure pairwise token resolution collection does not grow indefinitely

Currently, the tokenization-pairwiseToken will grow indefinitely. We should have a plan for how to reduce its size over time -- either with external tools or expiration on records. The latter may be trickier to implement cleanly than the former.

Add tests for Bitstring.js

There is currently 3% coverage of this file:

https://codecov.io/gh/digitalbazaar/bedrock-tokenization/src/d26e6bfb21feb9b881abc0d985533a51f1d91bd5/lib/Bitstring.js

All these APIs need some coverage.

Explore solutions to preserving usage of document registrations when HMACs rotate

Currently when a tokenizer rotates (hmac rotation) and a new one becomes current, all existing document registrations will no longer be reused when generating new tokens. In other words, a "new identity" will be created for every document registration that occurs after the rotation. There are a number of ways to address this for workflows where it matters. At least two ways include:

Pre-duplicating all document registrations using the new tokenizer prior to making it current.
Hashing for every possibly "still in use" tokenizer -- instead of just the current one -- when looking up existing records. When an existing record is found, it could be updated to use the next tokenizer. Eventually, the "previous" tokenizer would drop off. See bedrock-resource-restriction for some patterns to consider.

Potential optimization of document registration and token batch creation

In order to increase the number of writes to the database that can happen in parallel, the following changes should be considered:

Splitting out the internalId generation code as a standalone call and making it so that the internalId must be passed into documents.register.
Adding an additional query to check for an existing document registration prior to performing the registration itself. This would fetch an internalId to reuse (from the existing record) or it would signal that one needs to be generated using the method split out from item 1.
Once the query in item 2 is performed, the internalId that will end up in the registration record will be known such that writing a token batch can occur in parallel while the registration record is either refreshed (updated) or inserted.

It's possible that the registration record returned from item 2 will be deleted between items 2 and 3, however, it should be infeasible for the same internalId (a large random number) to have been claimed by a different externalId and documentHash concurrently. Therefore, what would happen in this case is that an insert of the full registration record would happen instead of an update. We can keep the registration code that first attempts a refresh and then does an insert the same here (but we will have passed in the internalId instead of generating it within the documentsregister call). This would continue to preserve the optimization we have that avoids unnecessary encryption of the document unless strictly needed.

Add feature for optionally recording fraud / authentic scores for registered documents

Additional registration information should be able to be tracked for registered documents such as how likely the document is to be an authentic (which is not the same as how likely it is to be related to a particular entity that may have presented it).

Add timeouts to while(true) loops for concurrency handling to prevent DoS

Add 'minAssuranceForResolution' mechanism to document registrations

Add index to `meta.created` on `tokenization.registration`.

Allow undefined TTL / non-expiring documents and entities

Support use cases where documents and entities do not expire.

Related to #101.

Assert `levelOfAssurance` was passed or mark it optional in `tokens.resolve` docs

Enable placing a limit on number of non-expired token batches per internal ID

Consider adding optimistic memoize-cache refresh for latest tokenizer batch versions

Whenever the cached batch version expires for a tokenizer, a hit to the database occurs during the lookup. Ideally, the cache would use memoized promises instead and, shortly before the record expires, a background call would be made to look up the batch version and refresh it in the cache (provided the cache had room, etc.).

Consider removing duplicative non-encoded requester ID

We store the encoded requester (must be encoded to be a map in mongodb to avoid illegal characters) and the decoded request in the database at present. We could remove the decoded one and save some space.

https://github.com/digitalbazaar/bedrock-tokenization/blob/master/lib/tokens.js#L248

	/* Note: An `openTokenBatch` record must only be added after a record
	is inserted into `tokenBatch` collection first. This this ensures that if
	we find a record in `openTokenBatch` that references a record that does not
	exist in `tokenBatch`, it can only be because the `tokenBatch` record has
	auto-expired, which means we can safely delete the reference to it in
	`openTokenBatch`. Here, we also ignore the returned record and, if an error
	occurs, we accept that the just created record in `tokenBatch` will not ever
	be used, but will eventually expire and get cleaned up. */
	await _createOpenBatchReference({internalId, batchId: id});