ipfs / go-cid Goto Github PK

View Code? Open in Web Editor NEW

155.0 31.0 47.0 245 KB

Content ID v1 implemented in go

License: MIT License

Go 99.87% Makefile 0.13%

cid multiformats ipld

go-cid's Introduction

go-cid

A package to handle content IDs in Go.

This is an implementation in Go of the CID spec. It is used in go-ipfs and related packages to refer to a typed hunk of data.

Lead Maintainer

Eric Myhre

Install
Usage
API
Contribute
License

Install

go-cid is a standard Go module which can be installed with:

go get github.com/ipfs/go-cid

Usage

Running tests

Run tests with go test from the directory root

go test

Examples

Parsing string input from users

// Create a cid from a marshaled string
c, err := cid.Decode("bafzbeigai3eoy2ccc7ybwjfz5r3rdxqrinwi4rwytly24tdbh6yk7zslrm")
if err != nil {...}

fmt.Println("Got CID: ", c)

Creating a CID from scratch

import (
  cid "github.com/ipfs/go-cid"
  mc "github.com/multiformats/go-multicodec"
  mh "github.com/multiformats/go-multihash"
)

// Create a cid manually by specifying the 'prefix' parameters
pref := cid.Prefix{
	Version: 1,
	Codec: uint64(mc.Raw),
	MhType: mh.SHA2_256,
	MhLength: -1, // default length
}

// And then feed it some data
c, err := pref.Sum([]byte("Hello World!"))
if err != nil {...}

fmt.Println("Created CID: ", c)

Check if two CIDs match

// To test if two cid's are equivalent, be sure to use the 'Equals' method:
if c1.Equals(c2) {
	fmt.Println("These two refer to the same exact data!")
}

Check if some data matches a given CID

// To check if some data matches a given cid, 
// Get your CIDs prefix, and use that to sum the data in question:
other, err := c.Prefix().Sum(mydata)
if err != nil {...}

if !c.Equals(other) {
	fmt.Println("This data is different.")
}

Contribute

PRs are welcome!

Small note: If editing the Readme, please conform to the standard-readme specification.

License

MIT © Jeromy Johnson

go-cid's People

Contributors

Stargazers

Watchers

Forkers

magik6k stebalien orenyodfat rmeissner gtfierro dms3-fs luceas jonnycrunch overbool kjzz tinychain samli88 madper wade-welles lidel forkkit b5 tron-us darrennong vulcanize vmx holajiawei alexjg dennis-tra zhp1254 isabella232 iand crypt0r3n3g4d3 stratosnet yuan520521 yasik livehl yoshikawa0711 mg98 banyancomputer icodein lotdeef 4812571 susarlanikhilesh mayhemheroes j3b3d4ihk3rm4n floflis seanpm2001 sukunrt davidcheest

go-cid's Issues

Decide how to handle -1 in Prefix

Currently, Prefix.MhLength is an int and -1 can be (and is) used to mean "default length". Unfortunately, this means:

cid1.Bytes() == cid2.Bytes() does not imply cid1.Prefix() == cid2.Prefix().
Prefix.Bytes() is broken.

Solutions:

Make it a uint64, provide some convenience constructors constructors (e.g. func V1Prefix(codec uint64, mhType uint64) Prefix). This will break things.
Fix Prefix.Bytes() and provide an Equals method (less convenient in the long run).

Thoughts?

Store CIDs as strings

Having to convert them to-and-from strings to use them as map keys is really painful and inefficient.

Bloom filter backed version of Cid.Set

Would be great to have, possibly letting us save a lot of memory in some palces

Support parse (github.com/ipfs/go-path).Path

	r, err := ipns.ResolveIPNS(context.TODO(), p.NameResolver(), name)
	if err != nil {
		return errors.Wrap(err, "failed to resolve ipns")
	}

	c, err := cid.Parse(r) <- here
	if err != nil {
		return errors.Wrapf(err, "failed to parse CID %s", r)
	}

Extract cid-fmt formatter out of main package so it can be used as a lib.

undefined: Protobuf

I go get go-libp2p-kad-dht and got the following output:

go get -u github.com/libp2p/go-libp2p-kad-dht
# github.com/ipfs/go-cid
../ipfs/go-cid/cid.go:33: undefined: Protobuf
../ipfs/go-cid/cid.go:117: undefined: Protobuf

I think this belongs in ipld org

Provide Way of Representing CidV0 objects in an alternative base

It may be useful to represent a CidV0 object in an alternative base. For example if we need to include it in as the domain part of a URL.

I propose we provide allow a non non-standard encoding of:

<multibase prefix><version><multihash>

where version is always 0.

Related to ipfs/kubo#4143.

Note: I made this more complicated than it needs be, see this comment: #34 (comment)

Allow CID to reference public keys (possibly make it extensible)

See message on IRC: https://botbot.me/freenode/ipfs/msg/78509267/

Currently, when looking up public keys, the generic value store interface of the DHT is used.

This has the effect that when, for example looking up a public key (but this is the same for naming), the DHT is queried twice. Once for the correspondance between the public key fingerprint to know the block id and once more to get the list of peers that can provide the block with the id we just got.

Instead, the key could be directly lookup up using the content lookup interface of the DHT. This interface takes a CID and respond with a list of peers directly. This is exactly what we need.

I suggest that we add an encoding in the CID to represent a public key from the key hash to avoid the two phase lookup. This is what this ticket is for.

I also suggest that instead of taking a pointer to a CID value, the routing interface references an interface type instead, that could be implemented differently. This is to be tracked in libp2p/go-libp2p-routing.

Don't hesitate to chat with me over IRC, realtime makes it easier to remove ambiguities

What about backing cid.Set by a sorted slice rather than a map?

It's pretty common when using cid.Set to want them to remain sorted. For example, if you're going to compare two sets, you need them to be sorted. If you're going to compute a hash representing the set, you need to do it over a sorted set, otherwise two hashes for equal sets won't be equal.

We could introduce a new cid.SortedSet for this use case, but I think it is worth considering making cid.Set always sorted and backing it by a sorted slice rather than a map.

I can't recall the exact details but I think insertion and lookup in a sorted slice in Go is negligibly different performance than into a map. More importantly, with a slice, you have the option of using a stack allocated array as the backing buffer, eliminating a heap allocation, reducing gc pressure.

StringOfBase function

Currently String() function returns base58 encoded cid.
It would be very nice to have function that allows for any encoding.
It should error out in case of version 0 cid.

Intern CIDs

Interning CIDs would make pointers to them them usable as keys in maps (for fast caching) and reduce memory usage.

Serializable Cid type - Exporting .Version .Codec .Hash

The Cid-Type fields are unexported:

https://github.com/ipfs/go-cid/blob/master/cid.go#L46

This means that Cids need to be stringified/jsonified etc. when serialized (encoded/decoded) and re-decoded when deserialized.

What do people think of exporting those fields?

cid Undef cannot be marshalled

Cid implements multiple go encoding interfaces, including golangs "encoding.BinaryUnmarshaler". For this it uses the function "CidFromBytes" via "Cast". However, this function cannot handle empty bytes. As the default value of a undefined cid, CidUndef, is represented via empty string/byte sequence, a problem arises. One can marshal all cids, except cid Undef, which results in an error. Imho this should be possible, as CidUndef is a valid part of the implementation.

To reproduce the problem:
`

      func main() {
          gob.Register(new(cid.Cid))
      
          c1 := cid.Undef
          var network bytes.Buffer
          enc := gob.NewEncoder(&network)
          err := enc.Encode(c1)
          if err != nil {
	          fmt.Printf("encode error: %v", err)
          }
      
          dec := gob.NewDecoder(&network)
          var c2 cid.Cid
          err = dec.Decode(&c2)
          if err != nil {
	          fmt.Printf("decode error: %v\n", err)
          }
      }

`
This results in "decode error: varints malformed, could not reach the end"

CidFromReader wraps valid io.EOF in ErrInvalidCid

When reading from an io.Reader that has no data, the io.EOF error should not be wrapped in ErrInvalidCid. This is not an invalid CID, and is not the same as a partial read which is indicated by io.ErrUnexpectedEOF.

This is an issue because existing code that uses CidFromReader may check for the end of an input stream by if err == io.EOF instead of the preferred if errors.Is(err, io.EOF), and that code will break at runtime after upgrading to go-cid v0.4.0.

Suggestion to avoid exra layer of indirection by using a more compact representation of Cid

The conversion from plain Key to Cid introduced an extra layer of indirection that can likely be avoided.

Currently Cid takes up the width of 5*64 bits. With it's current definition of:

type Cid struct {
    version uint64
    codec   uint64
    hash    mh.Multihash
}

May I suggest

type Cid struct {
    version uint32
    codec   uint32
    hash    mh.Multihash
}

As 32 bits should be more than enough for version and codec which I don't think will ever get very large (even 16 bits should be enough). If combined with multiformats/go-multihash#29 the size will now be 3*64 bits, which is the same size an array slice.

As the same size of a slice it becomes practical to pass it around by value which avoids an extra layer of indirection. Even though it is passed by value, the internals can still be kept hidden, as is done with many other go datatypes.

Just a suggestion. It might make a difference when you have an array of 1000's of Cids (which I might in some of my maintenance commands for the filestore, ipfs/kubo#2634).

Extract non-core functionaly into new package

go-cid is used by any package that does anything IPFS or IPLD related. we should probably move non-core functionally into a new package to avoid having to update so many packages when non-core functionality changes.

Code in the following packages should likely be moved:

set.go and tests
format.go and tests
Everything in cid-fmt/.

Prefix.Sum creates bad inline CIDs

inlineCid.Prefix().Sum("foobar") uses the length from inlineCid instead of -1.

Support for reading concatenated CIDs

CIDs carry a prefix including the length of the multihash payload, so it should be possible to read concatenated CIDs from a stream or buffer without the need for additional metadata or delimiters. Unfortunately the methods available don't expose quite enough information to achieve this.

Some ways this could be fixed:

have Cast return the number of bytes actually used from the argument slice
add a new Read(io.Reader) Cid that leaves the input reader positioned at the first byte after the CID consumed

Generated CIDv0 differs from the one generated by IPFS

Hi, i'm receiving some files and want to verify the CID sent, thus I'm doing (summarized):

genCID  := cid.Decode(receivedCID)
fileCID := genCID.Prefix().Sum(filecontent)
genCID.Equals(fileCID)

The CID I'm receiving is the one generated by doing ipfs add -n path/to/file, but it doesn't match the one generated by go-cid.

Something I'm doing wrong?

PS: This works fine for CIDv1

Needs README

@hsanjuan Do you understand enough of this module to add more than the skeleton readme I would provide?

Suggestion: Change CID type so it can be passed around directly and not by pointer

Right now the Cid type is:

type Cid struct {
    version uint64
    codec   uint64
    hash    mh.Multihash
}

The internals are not exposed but it is expected to be passed around by pointer which in a way exposes some of the implementation details. For example in #3 I suggested we use a more compact structural representation and I think @Stebalien is suggesting to a serialized string instead. The best representation may be debatable, but either change will require that the Cid no longer be passed around by a pointer. We may even decide that a Cid is better represented by an interface and some point.

So for now may I suggestion that we start passing around the Cid type directly and change the representation to.

type Cid struct {
    *cid
} 
type cid struct {
    version uint64
    codec   uint64
    hash    mh.Multihash
}

This will then give us the freedom to try different internal representations without making requiring any API changes.

@whyrusleeping @Stebalien thoughts?

It would be nice to add Less()

It's common to want to sort Cid, which requires an implementation of Less(). This can be done more efficiently inside Cid than outside because it won't require allocation if done inside (can directly access member fields which aren't exposed outside the package).

Implement 'K'-encoding (base36 cids)

Based on multiformats/multibase#65

Proposal: Moving CID to multiformats Org

Please review multiformats/cid#26

Add admin team

NewPrefixV0 should not take a multihash

They is only one legal multihash in CidV0 so requiring one is redundant.

Inaccurate godoc comment for uvarint

The godoc comment for uvarint() does not accurately represent the actual functionality (likely left over from the original varint function?):

No mention of the error value returned
The number of characters read n never returns a value n < 0

go-cid/varint.go

Lines 7 to 22 in 8b9ff39

    
           // Version of varint function that work with a string rather than 
        
           // []byte to avoid unnecessary allocation 
        
           // Copyright 2011 The Go Authors. All rights reserved. 
        
           // Use of this source code is governed by a BSD-style 
        
           // license as given at https://golang.org/LICENSE 
        
           // uvarint decodes a uint64 from buf and returns that value and the 
        
           // number of characters read (> 0). If an error occurred, the value is 0 
        
           // and the number of bytes n is <= 0 meaning: 
        
           // 
        
           // 	n == 0: buf too small 
        
           // 	n  < 0: value larger than 64 bits (overflow) 
        
           // 	        and -n is the number of bytes read 
        
           // 
        
           func uvarint(buf string) (uint64, int, error) {

Not a big deal because the function is not exported but, if it could be possible for it to return n < 0 on error, Prefix() could try to use a negative offset and raise an "index out of range" panic.

go-cid/cid.go

Lines 530 to 534 in 8b9ff39

    
           offset := 0 
        
           version, n, _ := uvarint(c.str[offset:]) 
        
           offset += n 
        
           codec, n, _ := uvarint(c.str[offset:]) 
        
           offset += n

Qm hash identifiers duplicated within the IPFS ecosystem

Hi,

My issue is in connection with adding IPFS metadata to another chain.

In my case Ravencoin. We use the CID to link to the immutable file (record, pdf, etc) of the digital token on the ravencoin chain.

We expect the Qm (46 character hash starting with Qm) hash to be immutable.

However we have found that the IPFS node peer ID (used in IPNS) is also a Qm hash with 46 characters. That hash is mutable or changes to whatever that peer is publishing.

Two questions -
Is there a way of identifying which Qm hash is immutable?

I have seen PR #86 which seems to indicate that the Peer ID's will be changing from the Qm 46 character hash. Is that correct? If so my issue will be fixed anyway...i live in hope...

protobuf vs dag-pb

The name for 0x70 here is "protobuf":

go-cid/cid.go

Line 88 in 6e296c5

"protobuf": DagProtobuf,

In the multicodec table the name for 0x70 is defined as "dag-pb":
https://github.com/multiformats/multicodec/blob/master/table.csv#L408

"protobuf" exists in the table but is mapped to 0x50:
https://github.com/multiformats/multicodec/blob/master/table.csv#L30

This is creating an interop issue with /api/v0/block/put?format=dag-pb - it's not understood by go-ipfs.

Can this be fixed?

add CidFromReader

We now have 2 CID decoder functions in go-car that really belong here and we should move (and properly test) them. The basic functionality is that I have either an io.Reader or just a []byte but I don't know how big the CID is but I'm pretty sure I know where the start of it is. I should be able to extract the CID and get an offset to the end of the parsed CID bytes.

ReadCid(buf []byte) (cid.Cid, int, error) - read a Cid from buf and tell me the offset after read: https://github.com/ipld/go-car/blob/71cfa2fc2a619d646606373c5946282934270bd4/util/util.go#L22

ReadCid(store io.ReaderAt, at int64) (cid.Cid, int, error) - read a Cid from store and tell me the offset after read: https://github.com/ipld/go-car/blob/wip/v2/v2/internal/io/cid.go (wip/v2 branch, not yet in master).

We have decodeFirst(bytes) in js-multiformats to serve a very similar purpose. Having it in the core library has uncovered some other uses outside of CAR decoding too.

Suggestions for more explicit naming of these functions welcome!

LRU cache

Now that we have nice, string-backed CIDs, we should consider caching them in an LRU cache. My hypothesis is that, when working with CIDs, we likely regenerate them several times. For example, with bitswap, we'll re-create the CIDs when we receive the blocks we're looking for. When we do this, we may end up storing each CID in memory twice.

Note: we'll have to be very careful with this. This is the kind of optimization that could end up hurting performance (or even memory usage) more than it helps if we're not careful. It may not even be worth it in practice.

CodecToStr and Codecs in v0.3.1 could cause data corruption

v0.2.0 we made a conscious decision to make a breaking change to FORCE people to refactor their code, and not be surprised by code changes. Rationale in #137.

In golang major bump is essentially a different package – bumping major does not help people who are already using INVALID mappings. That is why we did it in v0.2.

Recently merged PR #142 removed that safety and could cause unexpected code change if someone updates from v1.0 to v3.0. IIUC we now have this:

What if someone used cbor string and expected it to point at 0x71 (dag-cbor)?
- v0.3.1 SILENTLY changes the mapping to 0x51 (cbor)
What if someone used protobuf string and expected it to point at0x70 (dag-pb)
- v0.3.1 SILENTLY changes the mapping to 0x50 (protobuf)

Upgrading v0.1.x to v0.3.1 can now produce DAGs with different codec and get NO warning about this BREAKING CHANGE.

@Jorropo @rvagg If you reallly want to keep CodecToStr and Codecs, you should make sure they return hard error when someone tries to use them for impacted mappings above. Third-party apps use this library. People don't read release notes. Silent data corruption is not acceptable.

Harden varint encoding.

We should reject multihashes with non-minimally encoded varints. See multiformats/unsigned-varint#19.

Provide an efficient API to check whether a CID has `IDENTITY` multihash code

CIDs with multihash code IDENTITY typically require special handling when encountered in blockstores. This is because, such CIDs contain the data within themselves; the data is simply the multihash digest of that CID, since multihash code IDENTITY corresponds to copy hash function.

To handle them gracefully checks are needed to indicate whether a given CID has IDENTITY code or not, and checks would have to run for almost all operations on blockstore API. It is therefore, highly desirable to check as efficiently as possible.

The current APIs offered provide two ways to perform the check:

cid.Prefix().MhType
decode of cid.Hash() via go-multihash API to extract the code

Blockstore implementations would benefit from an API that checks whether a given CID or digest of a CID has IDENTITY code in a "fail-fast" manner. This is where the check would return as fast as possible if a CID is not an IDENTITY without checking for the validity of the CID first, then decoding digest, then comparing multihash code.

The rationale for a "fail-fast" check is:

if a CID does not have IDENTITY multihash code, it doesn't always need to be fully decoded in order for a block to be returned (e.g. when CID is used as key in a map)
the majority of CIDs interacted with are not IDENTITY therefore we want to pay the price of decoding only when we have to, and certainly not for every call to blockstore.

I therefore propose to:

Write benchmarks that compare the efficiency of the current APIs when checking for IDENTITY code.
Provide an alternative API that aims to improve efficiency for the checks.

Should the JSON represenetation of CID always be formatted as links.

My understanding is that the idea behind {"/": <cid-string>} is when Cid are represented as links. Should they always be serialized like this. In many cases a Cid acts more as an identifier than a link.

I see it just adding additional overhead to the JSON stream, espacally when returning a list of CIDs

Also it seams that even the idea of encoding them this was in conversational, at least based on @Stebalien comments in ipld/specs#70 when he said:

I'm really not happy baking this into IPLD. {'/': ...} was a hack to get JSON working.

This will need some strong arguments/motivations.

and

this was never intended to be the canonical representation. It was a hack to get JSON working.

Possible to calculate cid using Writer interface instead of reading full file into memory?

As Prefix.Sum() accepts []byte i need to read the full file into memory and then feed to .Sum(). How can i make use of streaming with io.Copy() instead and reduce memory requirements?

	// Version of varint function that work with a string rather than
	// []byte to avoid unnecessary allocation

	// Copyright 2011 The Go Authors. All rights reserved.
	// Use of this source code is governed by a BSD-style
	// license as given at https://golang.org/LICENSE

	// uvarint decodes a uint64 from buf and returns that value and the
	// number of characters read (> 0). If an error occurred, the value is 0
	// and the number of bytes n is <= 0 meaning:
	//
	// n == 0: buf too small
	// n < 0: value larger than 64 bits (overflow)
	// and -n is the number of bytes read
	//
	func uvarint(buf string) (uint64, int, error) {

	offset := 0
	version, n, _ := uvarint(c.str[offset:])
	offset += n
	codec, n, _ := uvarint(c.str[offset:])
	offset += n