fsspec / ossfs Goto Github PK
View Code? Open in Web Editor NEWfsspec filesystem for Alibaba Cloud (Aliyun) Object Storage System (OSS)
License: Apache License 2.0
fsspec filesystem for Alibaba Cloud (Aliyun) Object Storage System (OSS)
License: Apache License 2.0
Some of the method in oss-emulator are not available.
Need a way to directly pipe stream to ossfs files.
Some of the functionalities which are on the bucket level mkdir
,rmdir
, etc, hadn't been implemented.
Now the ci test is in a failure state. It is because of the wrong ci script and the lack of secrets. To solve this we must repair the script and add secrets to it.
Currently, ossfs is still on 2021.6.0 and depends on fsspec==2021.04.0,
where the newest version is 2021.6.1.
As the project gets more and more complex, can not keep All things in a single file.
Currently, in some of the tests , a repeated 1234567890A
is used for testing, some data related bug might coincidently be concealed by this repeatness.
like what other fsspec do.
OSS itself supports version objects, in some other fs like s3fs, it has already supported this kind of feature. ossfs should do this too.
Add _isfile
, _isdir
, _size
, _sizes
, _exists
and _info
methods and some tests for it.
ossfs 2023.1.0 has requirement fsspec==2022.11.0, but you have fsspec 2023.1.0.
In an oss real server. object_exist
on none exist file would return False. While in an emulator it would raise ServerError
. Need to handle it before emulator repair this error.
Add _rm_file
and _rm
methods and some tests for it.
The resumable_upload
can continue uploading objects after a network disconnection.
Delete the existing object in prvious would cause problems in this situation.
https://github.com/iterative/ossfs/blob/b81f8ac94451f9d99f53d07640626d7a575b9608/ossfs/core.py#L545-L547
Need to create a new AioOSSFileSystem
class to differentiate from the current sync one. And the initialization is required.
Currently, because of the instability of the network some of the tests would fail.Dynamic value of block_size would be helpful on this.
E.g. https://github.com/iterative/dvc/blob/55bf18d8f92832610c816f2bbe25f804bd2c1338/dvc/tree/gs.py#L20
Currently, the OSSFS's handling for the network exceptions is distributed and hard to maintain. We can add a middle level _call_ossfs
to manage all of the network exceptions and translate them.
We should not use ossfs's result to verify itself. Use oss2 instead.
Add _get_file
and _get
methods and some tests for it.
Similar to s3fs: fsspec/s3fs#486
Add _pipe_file
and _pipe
methods and some tests for it.
Using simulator can:
Currently, in the put_file
method resumable_upload
would cause problems with the deleting
the exist object in previous.
to new version of fsspec
The code in core.py showed that the current implementation doesn't support ECS ram role authentication.
The capability is provided by oss2 with the combination of ProviderAuth and EcsRamRoleCredentialsProvider.
This is a very important authentication scenario and we hope this ossfs can prioritize the support.
Add _cat_file
, _cat
and _cat_ranges
methods and some tests for it.
Currently, the ossfs.ls
would first call on the parent directory and then on itself. But the parent directory might have different permission if the object is at the root of the bucket and cause failure.
The speed of ci tests depends on the server's network. No need to be started together.
none
conda-forge requires LICENSE
and test_requirements
to install
Add _cp_file
and _copy
methods and some tests for it.
An async version of OSSFile is required to make the whole system async.
In current ossfs
every operation would create a new session of their own. This operations not only slow down the performance but also would cause max connection
exceptions in some intense work.
In OSS, we didn't use an address like oss://bucket/path
like in S3 or hdfs.
Instead, endpoints are always required for an OSS address.
Currently, the ossfs can only recognize an address like http://oss-cn-hangzhou.aliyuncs.com/mybucket/myobj
, while the input might be oss://bucket/path
Get file status (isdir, isfile, size...) in remote is a very slow operation. Lots of these actions are required in some of the other operations. With the help of dircache, we could store these data in memory, these would greatly improve the performance.
Some of the tests on an OSS-Emulator would fail because of the different behaivor of the emulator and real server.
These codes shouldn't be included into ossfs.
ossfs should come with a default endpoint, otherwise if you don't set anything in your environment / config it will fail like this;
(.venv38) (Python 3.8.5+) [ 11:35ÖÖ ] [ <hey>@<hey>:~/temp_remotes/ossfs_remote(master✗) ]
$ dvc push -vv
#2021-07-15 11:35:36,932 TRACE: Namespace(all_branches=False, all_commits=False, all_tags=False, cd='.', cmd='push', cprofile=False, cprofile_dump=None, func=<class 'dvc.command.data_sync.CmdDataPush'>, glob=False, instrument=False, instrument_open=False, jobs=None, pdb=False, quiet=0, recursive=False, remote=None, run_cache=False, targets=[], verbose=2, version=None, with_deps=False)
2021-07-15 11:35:37,022 DEBUG: Check for update is enabled.
2021-07-15 11:35:37,526 DEBUG: Preparing to upload data to 'oss://dvc-test-github/batuhan-cache'
2021-07-15 11:35:37,526 DEBUG: Preparing to collect status from oss://dvc-test-github/batuhan-cache
2021-07-15 11:35:37,526 DEBUG: Collecting information from local cache...
2021-07-15 11:35:37,527 TRACE: Assuming '/home/isidentical/temp_remotes/ossfs_remote/.dvc/cache/8d/da64717821b0fcbbcdb48afe082822' is unchanged since it is read-only
2021-07-15 11:35:37,528 DEBUG: Collecting information from remote cache...
2021-07-15 11:35:37,528 DEBUG: Matched '0' indexed hashes
2021-07-15 11:35:37,528 DEBUG: Querying 1 hashes via object_exists
2021-07-15 11:35:37,537 ERROR: unexpected error - 'NoneType' object has no attribute 'strip'
------------------------------------------------------------
Traceback (most recent call last):
File "/home/isidentical/dvc/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/home/isidentical/dvc/dvc/command/base.py", line 50, in do_run
return self.run()
File "/home/isidentical/dvc/dvc/command/data_sync.py", line 57, in run
processed_files_count = self.repo.push(
File "/home/isidentical/dvc/dvc/repo/__init__.py", line 51, in wrapper
return f(repo, *args, **kwargs)
File "/home/isidentical/dvc/dvc/repo/push.py", line 44, in push
pushed += self.cloud.push(objs, jobs, remote=remote)
File "/home/isidentical/dvc/dvc/data_cloud.py", line 79, in push
return remote_obj.push(
File "/home/isidentical/dvc/dvc/remote/base.py", line 57, in wrapper
return f(obj, *args, **kwargs)
File "/home/isidentical/dvc/dvc/remote/base.py", line 488, in push
ret = self._process(
File "/home/isidentical/dvc/dvc/remote/base.py", line 345, in _process
dir_status, file_status, dir_contents = self._status(
File "/home/isidentical/dvc/dvc/remote/base.py", line 193, in _status
self.hashes_exist(
File "/home/isidentical/dvc/dvc/remote/base.py", line 145, in hashes_exist
return indexed_hashes + self.odb.hashes_exist(list(hashes), **kwargs)
File "/home/isidentical/dvc/dvc/objects/db/base.py", line 468, in hashes_exist
remote_hashes = self.list_hashes_exists(hashes, jobs, name)
File "/home/isidentical/dvc/dvc/objects/db/base.py", line 419, in list_hashes_exists
ret = list(itertools.compress(hashes, in_remote))
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/isidentical/dvc/dvc/objects/db/base.py", line 410, in exists_with_progress
ret = self.fs.exists(path_info)
File "/home/isidentical/dvc/dvc/fs/fsspec_wrapper.py", line 84, in exists
return self.fs.exists(self._with_bucket(path_info))
File "/home/isidentical/ossfs/ossfs/core.py", line 450, in exists
bucket = self._get_bucket(bucket_name, connect_timeout)
File "/home/isidentical/ossfs/ossfs/core.py", line 139, in _get_bucket
return oss2.Bucket(
File "/home/isidentical/.venv38/lib/python3.8/site-packages/oss2/api.py", line 347, in __init__
super(Bucket, self).__init__(auth, endpoint, is_cname, session, connect_timeout,
File "/home/isidentical/.venv38/lib/python3.8/site-packages/oss2/api.py", line 191, in __init__
self.endpoint = _normalize_endpoint(endpoint.strip())
AttributeError: 'NoneType' object has no attribute 'strip'
------------------------------------------------------------
^CTraceback (most recent call last):
File "/home/isidentical/dvc/dvc/main.py", line 55, in main
ret = cmd.do_run()
File "/home/isidentical/dvc/dvc/command/base.py", line 50, in do_run
return self.run()
File "/home/isidentical/dvc/dvc/command/data_sync.py", line 57, in run
processed_files_count = self.repo.push(
File "/home/isidentical/dvc/dvc/repo/__init__.py", line 51, in wrapper
return f(repo, *args, **kwargs)
File "/home/isidentical/dvc/dvc/repo/push.py", line 44, in push
pushed += self.cloud.push(objs, jobs, remote=remote)
File "/home/isidentical/dvc/dvc/data_cloud.py", line 79, in push
return remote_obj.push(
File "/home/isidentical/dvc/dvc/remote/base.py", line 57, in wrapper
return f(obj, *args, **kwargs)
File "/home/isidentical/dvc/dvc/remote/base.py", line 488, in push
ret = self._process(
File "/home/isidentical/dvc/dvc/remote/base.py", line 345, in _process
dir_status, file_status, dir_contents = self._status(
File "/home/isidentical/dvc/dvc/remote/base.py", line 193, in _status
self.hashes_exist(
File "/home/isidentical/dvc/dvc/remote/base.py", line 145, in hashes_exist
return indexed_hashes + self.odb.hashes_exist(list(hashes), **kwargs)
File "/home/isidentical/dvc/dvc/objects/db/base.py", line 468, in hashes_exist
remote_hashes = self.list_hashes_exists(hashes, jobs, name)
File "/home/isidentical/dvc/dvc/objects/db/base.py", line 419, in list_hashes_exists
ret = list(itertools.compress(hashes, in_remote))
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/isidentical/dvc/dvc/objects/db/base.py", line 410, in exists_with_progress
ret = self.fs.exists(path_info)
File "/home/isidentical/dvc/dvc/fs/fsspec_wrapper.py", line 84, in exists
return self.fs.exists(self._with_bucket(path_info))
File "/home/isidentical/ossfs/ossfs/core.py", line 450, in exists
bucket = self._get_bucket(bucket_name, connect_timeout)
File "/home/isidentical/ossfs/ossfs/core.py", line 139, in _get_bucket
return oss2.Bucket(
File "/home/isidentical/.venv38/lib/python3.8/site-packages/oss2/api.py", line 347, in __init__
super(Bucket, self).__init__(auth, endpoint, is_cname, session, connect_timeout,
File "/home/isidentical/.venv38/lib/python3.8/site-packages/oss2/api.py", line 191, in __init__
self.endpoint = _normalize_endpoint(endpoint.strip())
AttributeError: 'NoneType' object has no attribute 'strip'
Add _expand_path
, du
, glob
and _find
methods and some tests for it.
Too many logs from oss2, hard to found useful information
Add _ls
and _walk
methods and some tests for it.
Add _put_file
and _put
methods and some tests for it.
Currently, the same test on different versions and platforms shares the same object on oss. This would cause failures when there is a write operation in it. We should use a unique path to separate them.
In ossfs.ls
, we first ls
the object, then ls
the directory. While in info
it first ls
the directory info then ls
the special object. This is because the directory ls
can benefit from dircache
. Reordering the ls
object and directory we can accelerate all of info-related operations (du
, stat
, size
, isdir
, isfile
, ukey
, etc.)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.