Feature request (or current workaround) set meta on existing o

See <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id

Add method to set metadata about gcsfs HOT 21 CLOSED

fsspec commented on June 11, 2024

Add method to set metadata

from gcsfs.

Comments (21)

martindurant commented on June 11, 2024

The relevant end-point appears to be (object/patch)[https://cloud.google.com/storage/docs/json_api/v1/objects/patch] with a body like '{"metadata": {"key": "value"}}'. It should not be difficult to implement set/get metadata using that.
It is also possible to set the metadata at file creation, but that is probably a bit more complicated.

from gcsfs.

yan-hic commented on June 11, 2024

Actually it is simple to do at file creation - see here for resumable but same in simple_upload:

 def _initiate_upload(self):
        r = requests.post('https://www.googleapis.com/upload/storage/v1/b/%s/o'
                          % quote_plus(self.bucket),
                          params={'uploadType': 'resumable'},
                          headers=self.gcsfs.header, json={'name': self.key, 'metadata': {'test': '2.7'}})
        self.location = r.headers['Location']

The metadata can be passed as parameter in the GCSFile instantiation.

from gcsfs.

martindurant commented on June 11, 2024

OK, then I guess both should be done.
In s3fs we have getxattr, setxattr for existing files, so we can also do calls for individual keys alongside the whole metadata.

from gcsfs.

yan-hic commented on June 11, 2024

https://cloud.google.com/storage/docs/json_api/v1/how-tos/simple-upload indicates that if metadata needs to be added, simple is no good. Actually it can be used but through a different URL call.
I suggest to limit to resumable uploads - which fits my use case.
Will submit a PR

from gcsfs.

martindurant commented on June 11, 2024

You would be doubling the time for uploads on small files without metadata. Could you not stay with simple upload, but call the PATCH method after close in the case that there is metadata?

from gcsfs.

yan-hic commented on June 11, 2024

Not sure I understand - I have no use for simple upload as none of my writes are < 5Mb. The change proposal will only affect (benefit) resumable uploads in _initiate_upload, which is the main reason to use GCSFS and a file-obj.

For simple upload, metadata parameter if passed on open() would be ignored. To support that scenario, an after-fact PATCH is best indeed.

The change is a very minor. Feel free to reject or bundle otherwise but it's working nicely for me.

from gcsfs.

martindurant commented on June 11, 2024

Sorry, I thought you meant avoiding simple upload altogether. I would like metadata added in both cases, rather than silently ignoring them in some cases.

So the plan:

open() should take metadata=dict parameter
on _initiate_upload, set the metadata, if there is any
add set_metadata and get_metadata methods for existing files; should getting cache?
if upload is simple, and there is metadata, calls set_metadata after close
(optional) add set_xattr and get_xattr for accessing a single key.

from gcsfs.

yan-hic commented on June 11, 2024

Sorry but I can only help with the 2 first bullets.
The get_metadata might be redundant as GCSFSFile.details renders all. Could be a helper though.

from gcsfs.

martindurant commented on June 11, 2024

See #39

from gcsfs.

martindurant commented on June 11, 2024

@yiga2 , I'll leave this issue open despite your PR until the rest of the functionality is complete.

from gcsfs.

yan-hic commented on June 11, 2024

@martindurant I have found a easy way to get simple_upload to work but i's using the XML API instead the JSON API.
It uses the x-goog-meta- headers which unfortunately are ignored by the JSON API endpoint.

        if self.metadata is not None:
            for k, v in self.metadata.items():
                head['x-goog-meta-' + k] = v
        path = 'https://storage.googleapis.com/%s/%s' % (quote_plus(self.bucket), self.key)
        r = requests.put(path, headers=head, data=data)

Since response is not json, subsequent code to read size and md5 need a bit of rewrite.

Are you fine with it or want to stick with a JSON API approach ?

BTW, Google's XML API syntax is apparently compatible with AWS: https://cloud.google.com/storage/docs/migrating

from gcsfs.

martindurant commented on June 11, 2024

Have you tried setting the same header keys with the original end-point? It would be odd if the two syntaxes had different capabilities. I don't mind particularly, but it would make the code less friendly, I suspect.
On S3 compatibility: it's not really true, in my experience :) I can't remember what didn't work.

from gcsfs.

yan-hic commented on June 11, 2024

I tried and as I wrote, it is ignored. Odd indeed and there is no indication that either endpoint will be deprecated.
On the additional change, it would be:

size, md5 = int(r.headers['x-goog-stored-content-length']), \
            re.findall('md5=(.+)', r.headers['x-goog-hash'])[0]

The md5 retrieval is indeed more cumbersome but the path and r declarations are shorter and more readable.

Also that's the only way to set the metadata at object creation, which saves an API call and risk of error.

from gcsfs.

yan-hic commented on June 11, 2024

@martindurant
I have just changed to use the (new) Google Client libraries as I needed service account support badly (across projects and accounts). It has metadata, prefix and more.
For me, no need for GCS-level file-obj anymore as I managed to get the input sent in compressed format and will compress the output back to GCS. Hence the read/write iteration are on the zip/gz objects and both compressed blobs fit in memory.

Considering the existence of Google's abstraction layer and their move to under-the-hood gRPC, I wonder if GCSFS should only focus on helpers and gaps e.g. ioBase gap. On the latter, you went through the effort of defining the required methods already and there is no sign Google is working on it. It would require to adapt to use https://googlecloudplatform.github.io/google-resumable-media-python/latest/google.resumable_media.requests.html (not Simple down/up as covered by standard lib).

Sure the changes are profound and beyond my knowledge. The fact that GCSFS relies on deprecated oauth2client could be a trigger to revisit.

See you back on other projects I am following, fastparquet and fastavro !

from gcsfs.

asford commented on June 11, 2024

@martindurant Would you mind reopening this issue? I've a use case that likely requires metadata on small uploads and this seems like a natural place to plan that work.

from gcsfs.

yan-hic commented on June 11, 2024

@martindurant et al. - I am back ! Turned out I needed gcsfs and metadata on _simple_upload.
So as per G, I have switched uploadType from media to multipart as follows:

    '''  old
    r = self.gcsfs.session.post(
       path, params={'uploadType': 'media', 'name': self.key}, data=data)
    '''

    metadata = {'name': self.key}
    if self.metadata is not None:
        metadata['metadata'] = self.metadata
    metadata = str(metadata)
    data = ('--==0=='
            '\nContent-Type: application/json; charset=UTF-8'
            '\n\n' + metadata +
            '\n--==0=='
            '\nContent-Type: application/octet-stream'
            '\n\n').encode() + data + b'\n--==0==--'

    r = self.gcsfs.session.post(path,
                                headers={'Content-Type': 'multipart/related; boundary="==0=="'},
                                params={'uploadType': 'multipart'},
                                data=data)`

Sorry I haven't touched Git for a while and am very unfamiliar with PR. The above code works and we use it in production so thought of sharing.

Hope someone will take this through and close the loop.

from gcsfs.

martindurant commented on June 11, 2024

Thank, @yiga2 !
I'll try to make a little time to push it through, I could do with a simple test case that demonstrates it working, and I can sort out the business of making VCR recordings.

from gcsfs.

martindurant commented on June 11, 2024

metadata = str(metadata) - should this not be JSON encoding, which is not necessarily the same thing?

from gcsfs.

yan-hic commented on June 11, 2024

The encode() will format it to JSON. Merging name with metadata is just easier when using dict.update() then inserting in JSON (without the JSON library). But your call really, I am by no mean a Py expert.

On the test case, count on me !

But reflecting on this as I was playing with different chunk (block) size for optimal performance (which for GCP-to-GCP is the max 8MB), there is no real performance benefit between simple upload and resumable upload. Actually as G indicates here, resumable is even recommended for data integrity.

Hence question: what about skipping simple upload altogether and use resumable for any file size ?
If you want to keep it, to minimize side effects (if any), you could introduce a top-level boolean (or GCSFile argument) e.g. use_simple_upload

At the other hand, resumable with 5MB blocksize gives you the same plus you can pass easily custom metadata (already supported by GCSFS)

from gcsfs.

yan-hic commented on June 11, 2024

Not sure I would reopen this but if anyone is reading this claiming it doesn't work: make sure the dict values are strings as gcsfs (or fsspec) will not error out. I figure that when passing a nested dict and no metadata got saved.

One should do as follows:

..., metadata={'address': json.dumps({'street': '1 my avenue', 'county': None, ...

and deserialize back when reading the metadata.

@martindurant not sure if wrapping this in the core code i.e. for each value, if not str, jsonify, is worth it but raising an error would be good.

from gcsfs.

martindurant commented on June 11, 2024

Sounds reasonable to either make this work as expected or to document for those that come across the current behaviour and would otherwise be surprised.

from gcsfs.

Add method to set metadata about gcsfs HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent