Comments (21)
The relevant end-point appears to be (object/patch)[https://cloud.google.com/storage/docs/json_api/v1/objects/patch] with a body like '{"metadata": {"key": "value"}}'
. It should not be difficult to implement set/get metadata using that.
It is also possible to set the metadata at file creation, but that is probably a bit more complicated.
from gcsfs.
Actually it is simple to do at file creation - see here for resumable but same in simple_upload:
def _initiate_upload(self):
r = requests.post('https://www.googleapis.com/upload/storage/v1/b/%s/o'
% quote_plus(self.bucket),
params={'uploadType': 'resumable'},
headers=self.gcsfs.header, json={'name': self.key, 'metadata': {'test': '2.7'}})
self.location = r.headers['Location']
The metadata can be passed as parameter in the GCSFile
instantiation.
from gcsfs.
OK, then I guess both should be done.
In s3fs we have getxattr, setxattr for existing files, so we can also do calls for individual keys alongside the whole metadata.
from gcsfs.
https://cloud.google.com/storage/docs/json_api/v1/how-tos/simple-upload indicates that if metadata needs to be added, simple is no good. Actually it can be used but through a different URL call.
I suggest to limit to resumable uploads - which fits my use case.
Will submit a PR
from gcsfs.
You would be doubling the time for uploads on small files without metadata. Could you not stay with simple upload, but call the PATCH method after close in the case that there is metadata?
from gcsfs.
Not sure I understand - I have no use for simple upload as none of my writes are < 5Mb. The change proposal will only affect (benefit) resumable uploads in _initiate_upload
, which is the main reason to use GCSFS and a file-obj
.
For simple upload, metadata
parameter if passed on open()
would be ignored. To support that scenario, an after-fact PATCH
is best indeed.
The change is a very minor. Feel free to reject or bundle otherwise but it's working nicely for me.
from gcsfs.
Sorry, I thought you meant avoiding simple upload altogether. I would like metadata added in both cases, rather than silently ignoring them in some cases.
So the plan:
- open() should take metadata=dict parameter
- on _initiate_upload, set the metadata, if there is any
- add set_metadata and get_metadata methods for existing files; should getting cache?
- if upload is
simple
, and there is metadata, calls set_metadata after close - (optional) add set_xattr and get_xattr for accessing a single key.
from gcsfs.
Sorry but I can only help with the 2 first bullets.
The get_metadata might be redundant as GCSFSFile.details renders all. Could be a helper though.
from gcsfs.
See #39
from gcsfs.
@yiga2 , I'll leave this issue open despite your PR until the rest of the functionality is complete.
from gcsfs.
@martindurant I have found a easy way to get simple_upload to work but i's using the XML API
instead the JSON API
.
It uses the x-goog-meta-
headers which unfortunately are ignored by the JSON API
endpoint.
if self.metadata is not None:
for k, v in self.metadata.items():
head['x-goog-meta-' + k] = v
path = 'https://storage.googleapis.com/%s/%s' % (quote_plus(self.bucket), self.key)
r = requests.put(path, headers=head, data=data)
Since response
is not json
, subsequent code to read size
and md5
need a bit of rewrite.
Are you fine with it or want to stick with a JSON API
approach ?
BTW, Google's XML API
syntax is apparently compatible with AWS: https://cloud.google.com/storage/docs/migrating
from gcsfs.
Have you tried setting the same header keys with the original end-point? It would be odd if the two syntaxes had different capabilities. I don't mind particularly, but it would make the code less friendly, I suspect.
On S3 compatibility: it's not really true, in my experience :) I can't remember what didn't work.
from gcsfs.
I tried and as I wrote, it is ignored. Odd indeed and there is no indication that either endpoint will be deprecated.
On the additional change, it would be:
size, md5 = int(r.headers['x-goog-stored-content-length']), \
re.findall('md5=(.+)', r.headers['x-goog-hash'])[0]
The md5
retrieval is indeed more cumbersome but the path
and r
declarations are shorter and more readable.
Also that's the only way to set the metadata at object creation, which saves an API call and risk of error.
from gcsfs.
@martindurant
I have just changed to use the (new) Google Client libraries as I needed service account support badly (across projects and accounts). It has metadata, prefix and more.
For me, no need for GCS-level file-obj anymore as I managed to get the input sent in compressed format and will compress the output back to GCS. Hence the read/write iteration are on the zip/gz objects and both compressed blobs fit in memory.
Considering the existence of Google's abstraction layer and their move to under-the-hood gRPC, I wonder if GCSFS
should only focus on helpers and gaps e.g. ioBase gap. On the latter, you went through the effort of defining the required methods already and there is no sign Google is working on it. It would require to adapt to use https://googlecloudplatform.github.io/google-resumable-media-python/latest/google.resumable_media.requests.html (not Simple down/up as covered by standard lib).
Sure the changes are profound and beyond my knowledge. The fact that GCSFS
relies on deprecated oauth2client
could be a trigger to revisit.
See you back on other projects I am following, fastparquet and fastavro !
from gcsfs.
@martindurant Would you mind reopening this issue? I've a use case that likely requires metadata on small uploads and this seems like a natural place to plan that work.
from gcsfs.
@martindurant et al. - I am back ! Turned out I needed gcsfs
and metadata on _simple_upload
.
So as per G, I have switched uploadType
from media
to multipart
as follows:
''' old
r = self.gcsfs.session.post(
path, params={'uploadType': 'media', 'name': self.key}, data=data)
'''
metadata = {'name': self.key}
if self.metadata is not None:
metadata['metadata'] = self.metadata
metadata = str(metadata)
data = ('--==0=='
'\nContent-Type: application/json; charset=UTF-8'
'\n\n' + metadata +
'\n--==0=='
'\nContent-Type: application/octet-stream'
'\n\n').encode() + data + b'\n--==0==--'
r = self.gcsfs.session.post(path,
headers={'Content-Type': 'multipart/related; boundary="==0=="'},
params={'uploadType': 'multipart'},
data=data)`
Sorry I haven't touched Git
for a while and am very unfamiliar with PR. The above code works and we use it in production so thought of sharing.
Hope someone will take this through and close the loop.
from gcsfs.
Thank, @yiga2 !
I'll try to make a little time to push it through, I could do with a simple test case that demonstrates it working, and I can sort out the business of making VCR recordings.
from gcsfs.
metadata = str(metadata)
- should this not be JSON encoding, which is not necessarily the same thing?
from gcsfs.
The encode()
will format it to JSON. Merging name
with metadata
is just easier when using dict.update()
then inserting in JSON (without the JSON
library). But your call really, I am by no mean a Py expert.
On the test case, count on me !
But reflecting on this as I was playing with different chunk (block) size for optimal performance (which for GCP-to-GCP is the max 8MB), there is no real performance benefit between simple upload
and resumable upload
. Actually as G indicates here, resumable
is even recommended for data integrity.
Hence question: what about skipping simple upload
altogether and use resumable
for any file size ?
If you want to keep it, to minimize side effects (if any), you could introduce a top-level boolean (or GCSFile argument) e.g. use_simple_upload
At the other hand, resumable
with 5MB blocksize gives you the same plus you can pass easily custom metadata (already supported by GCSFS
)
from gcsfs.
Not sure I would reopen this but if anyone is reading this claiming it doesn't work: make sure the dict values are strings as gcsfs
(or fsspec
) will not error out. I figure that when passing a nested dict and no metadata got saved.
One should do as follows:
..., metadata={'address': json.dumps({'street': '1 my avenue', 'county': None, ...
and deserialize back when reading the metadata.
@martindurant not sure if wrapping this in the core code i.e. for each value, if not str, jsonify, is worth it but raising an error would be good.
from gcsfs.
Sounds reasonable to either make this work as expected or to document for those that come across the current behaviour and would otherwise be surprised.
from gcsfs.
Related Issues (20)
- _put_file does not allow to specify content encoding HOT 4
- __init__() got multiple values for keyword argument 'listings_expiry_time' HOT 1
- Is there already a way to list metadata attributes? HOT 2
- gcsfs: Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object HOT 3
- Inconsistent `STORAGE_EMULATOR_HOST` format HOT 2
- GCSFileSystem does not accept token, file.json or instance of service credentials HOT 13
- Errors when deleting a directory with huge number of files HOT 13
- The info object in GCSFS does not have mtime/ctime HOT 2
- `find` performance regression HOT 4
- Stale cache info leading to failing isfile check HOT 2
- _find does not support maxdepth
- GCS High Performance Parallel Listing
- Strange behavior with `HTTPError` and multiprocessing HOT 3
- isdir/info method works incorrectly HOT 20
- Clarify how to pass JSON credentials HOT 2
- Is it needed to be so strict about dependency on fsspec HOT 1
- Release 2023.10.0 and consider relaxing fsspec dependency HOT 3
- Error when listing large directory with versions=True
- Request: add chmod
- Issues when using identity_pool.Credentials for connecting GCSFileSystem HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gcsfs.