Code Monkey home page Code Monkey logo

iscc-specs's People

Contributors

alexander-n avatar patricia92 avatar titusz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iscc-specs's Issues

Add command line support for reference implementation

To make it easier to get started an play around with the reference implementation we should add a minimal command line interface. This would enable non-developers with some basic technical skills to test the ISCC.

Add and document adjustable parameters for ISCC generation

This would promote experimentation and support internal use cases that don't require interoperability but have special requirements regarding algorithm sensitivity. Default values for interoperability should be fixed when the specification becomes stable.

Re-use of existing identifiers as ISCC components.

It would be possible to re-use (re-encoded) existing identifiers as separate or replacement-components in the ISCC Scheme. For example an ISBN-13 could be encoded to the ISCC component format. Of course such a component would not be similarity-preserving, but that is not an absolute requirement. The Instance-ID for example is also not similarity-preserving. I see at least the following requirements for such an integration:

  • The existing standard identifier can be encoded in 64-bits.
  • The ISCC standard assigns custom self-describing component header for the existing standard identifier within the ISCC scheme.
  • An agreement can be found about how and where the existing identifier is placed within the structure of a fully qualified ISCC Code.
  • The respective standards body is interested in and agrees to such an integration.

Interested standard bodies are invited to add to this discussion or propose integration of their identifiers into the ISCC Scheme

Support granular similarity hashes for Content-ID

Use-Case:
A user has a small chunk of text and wants to find longer text that contain this chunk or a similar chunk.

Proposed solution draft:
Apply shift-invariant text-chunking (for example ~1000 characters). Create separate Content-IDs for each chunk. Supply the chunk ids as metadata to the full ISCC.

Revise text tokenization to better support CJK languages

Text tokenization should be designed to be simple and generic while also supporting CJK languages.
It must yield appropriate results with similarity encoding independent of language and character set. Tokenization should not assume that text can be extracted without word boundary and separation issues.

Change wording for text extraction scope.

Currently:
"While text-extraction is out of scope for this specification ..."

Proposed Change:
"While detailed procedures for text-extraction from various document formats are out of scope for this specification ..."

For reproducible Content-ID-Text components the definition of the extraction tool/version is part of the normative specification. It might be updated with some future version of the ISCC (ideally only after some compatibility tests). Due to the comprehensive text-normalization (especially with the upcoming ISCC v1.1) the impact of different text extraction tools/versions should be minimal. Even if two different implementations of the ISCC would generate slightly different Content-IDs this is not regarded as a failure to produce a valid ISCC code. The similarity preserving nature of the component would still produce a match or near-duplicate match when comparing ISCC codes.

Text `trim` function must guarantee max byte length.

The text trim function for metadata must ensure that the results stored as bytes on chain will not exceed a given number of bytes. So we must make sure that the UTF-8 encoded byte representation of the trimmed Unicode text does not exceed the target byte-length.

Elaborate re-use of existing standard identifiers in the bound 'extra'-field.

If cross-sector clustering of the Meta-ID is not required, then users may add existing identifiers to the bound “extra”-field for Meta-ID generation. Encoding existing identifiers into the Meta-ID at random is discouraged and does not provide any practical value besides disambiguation from other similar Meta-IDs. If at all, specific industries should first agree about such conventions. Ideally such conventions should also be documented by the ISCC standard.

Implement optional performance optimized reference implementation

The reference code is currently implemented with minimal dependencies and no performance optimizations. For real world and large scale testing of the ISCC we need faster ISCC generation.

Create an optional parallel iscc_opt.py module with the same interface that is optimized by using packages like numpy, numba, gmpy, cython (to be researched).

Without making these hard dependencies we could automatically use the optimized version if the required packages are available in the users environment. Performance gains of 100x or more are to be expected of such optimizations.

Add tests for discrete cosine transform (document floating point precision)

DCT calculations my yield different results depending on platform (32/64bit) due to different floating point implementations. Investigate possible MPFR based solutions:

mpmath may be a good option as it internally uses Python's builtin long integers by default, but automatically switches to GMP/MPIR for much faster high-precision arithmetic if gmpy is installed.

See also: https://stackoverflow.com/a/2232650/51627

Specify deterministic JPEG decoding

See related discussion with tensorflow: tensorflow/tensorflow#11623
Here is a pure python jpeg decoder: https://github.com/xxyxyz/flat/blob/master/flat/jpeg.py
ImageMagick may be able to do it: https://stackoverflow.com/a/32257778/51627
Others have the same issues: https://stackoverflow.com/questions/45195880/why-does-tensorflow-decode-jpeg-images-differently-from-scipy-imread

libjpeg-turbo supports machine independent deterministic integer decoding with dct_method JDCT_ISLOW or JDCT_IFAST :

The FLOAT method may also give different results on different machines due to varying roundoff behavior, whereas the integer methods should give the same results on all machines.

Source: https://raw.githubusercontent.com/libjpeg-turbo/libjpeg-turbo/master/libjpeg.txt

Add specification for unique, owned and authenticated Meta-IDs

At Meta-ID level users might want global uniqueness and be in control of the semantics by “owning” a Meta-ID as an ISCC prefix. This turns out to be a registration related concern.

We propose to introduce two new variations of the Meta-ID together with the planned blockchain registry.

The first variation would add an “owned”-flag to the Meta-ID-header, indicating that the last byte of the Meta-ID is a variable length uniqueness counter. The counter would be interactively incremented by the client software during the blockchain registration procedure to guarantee uniqueness and fixate ownership semantics for the given id to the signatory of the registering transaction. This would retain global clustering and de-duplication features while at the same time offering “owned”, authenticated and globally unique Meta-IDs.

The second variation would not depend on any metadata at all to better support automated identifier creation. For example many digital media assets (like photos or granular content) might not have a “title” at all. This variation would be a random or time based surrogate key, again with a uniqueness counter.

Both variations should include protocol specifications for blockchain registration, ownership-transfer and multi-party-ownership.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.