aai-institute / lakefs-spec Goto Github PK

View Code? Open in Web Editor NEW

39.0 4.0 4.0 9.95 MB

An fsspec implementation for the lakeFS project

Home Page: http://lakefs-spec.org/

License: Apache License 2.0

Python 99.54% Shell 0.46%

fsspec lakefs python

lakefs-spec's Introduction

lakeFS-spec: An fsspec backend for lakeFS

Welcome to lakeFS-spec, a filesystem-spec backend implementation for the lakeFS data lake. Our primary goal is to streamline versioned data operations in lakeFS, enabling seamless integration with popular data science tools such as Pandas, Polars, and DuckDB directly from Python.

Highlights:

Simple repository operations in lakeFS
Easy access to underlying storage and versioning operations
Seamless integration with the fsspec ecosystem
Directly access lakeFS objects from popular data science libraries (including Pandas, Polars, DuckDB, Hugging Face Datasets, PyArrow) with minimal code
Transaction support for reliable data version control
Smart data transfers through client-side caching (up-/download)
Auto-discovery configuration

Note

We are seeking early adopters who would like to actively participate in our feedback process and shape the future of the library. If you are interested in using the library and want to get in touch with us, please reach out via Github Discussions.

Installation

lakeFS-spec is published on PyPI, you can simply install it using your favorite package manager:

$ pip install lakefs-spec
  # or
$ poetry add lakefs-spec

Usage

The following usage examples showcase two major ways of using lakeFS-spec: as a low-level filesystem abstraction, and through third-party (data science) libraries.

For a more thorough overview of the features and use cases for lakeFS-spec, see the user guide and tutorials sections in the documentation.

Low-level: As a fsspec filesystem

The following example shows how to upload a file, create a commit, and read back the committed data using the bare lakeFS filesystem implementation. It assumes you have already created a repository named repo and have lakectl credentials set up on your machine in ~/.lakectl.yaml (see the lakeFS quickstart guide if you are new to lakeFS and need guidance).

from pathlib import Path

from lakefs_spec import LakeFSFileSystem

REPO, BRANCH = "repo", "main"

# Prepare example local data
local_path = Path("demo.txt")
local_path.write_text("Hello, lakeFS!")

# Upload to lakeFS and create a commit
fs = LakeFSFileSystem()  # will auto-discover config from ~/.lakectl.yaml

# Upload a file on a temporary transaction branch
with fs.transaction(repository=REPO, base_branch=BRANCH) as tx:
    fs.put(local_path, f"{REPO}/{tx.branch.id}/{local_path.name}")
    tx.commit(message="Add demo data")

# Read back committed file
f = fs.open(f"{REPO}/{BRANCH}/demo.txt", "rt")
print(f.readline())  # "Hello, lakeFS!"

High-level: Via third-party libraries

A variety of widely-used data science tools are building on fsspec to access remote storage resources and can thus work with lakeFS data lakes directly through lakeFS-spec (see the fsspec docs for details). The examples assume you have a lakeFS instance with the quickstart repository containing sample data available.

# Pandas -- see https://pandas.pydata.org/docs/user_guide/io.html#reading-writing-remote-files
import pandas as pd

data = pd.read_parquet("lakefs://quickstart/main/lakes.parquet")
print(data.head())


# Polars -- see https://pola-rs.github.io/polars/user-guide/io/cloud-storage/
import polars as pl

data = pl.read_parquet("lakefs://quickstart/main/lakes.parquet", use_pyarrow=True)
print(data.head())


# DuckDB -- see https://duckdb.org/docs/guides/python/filesystems.html
import duckdb
import fsspec

duckdb.register_filesystem(fsspec.filesystem("lakefs"))
res = duckdb.read_parquet("lakefs://quickstart/main/lakes.parquet")
res.show()

Contributing

We encourage and welcome contributions from the community to enhance the project. Please check discussions or raise an issue on GitHub for any problems you encounter with the library.

For information on the general development workflow, see the contribution guide.

License

The lakeFS-spec library is distributed under the Apache-2 license.

lakefs-spec's People

Contributors

Stargazers

Watchers

Forkers

maxmynter orhankalyon renesat ozkatz

lakefs-spec's Issues

Implement proper file upload / download events

Right now, we put a FSEvent.PUT_FILE awkwardly into the _upload_chunk method - for reading, we do not have any hook support.

We should change this by:

Defining FSEvent.FILE{UPLOAD,DOWNLOAD}.
Firing either of these hooks in LakeFSFile.close, depending on whether mode == wb or mode == rb.
Add test(s).

Investigate (and implement) instance caching by client configuration

fsspec exposes the mechanism of caching filesystem instances, e.g. for reusing API clients. We could implement this by overriding the fsid attribute to give a unique identifier, e.g. based on client configuration.

Objectives:

Find a way to access the lakeFS client configuration - this should be possible as fs.client._api.configuration (?).
Create an ID (via hashing, object IDs, etc.) out of this config, loop in the desired members only (e.g., the password probably does not need to go into the ID).
Test that the file system is reused, for example in pd.read_parquet calls. This is especially useful for downstream package users.

fs.put of an unchanged file does not execute a newly registered Hook

When I add a hook, e.g. a hook on FSEVENT.PUT and then call fs.put(...) with a file that is unchanged on the remote, the commithook is not executed as the checksums match.

I can force the fs.put with precheck=False.

Do we want hook execution even when a file-upload is skipped?

E.g. git would throw an error (Nothing to Commit) when no changes are present. But people might add other functionality in their hooks which they want executed despite no change.

If we adapt this, we need to update the Demo.

Set up automatic version management

In order to keep versions consistent across locations (package metadata, __init__.py, tags, releases), we should set up automatic version management.

We have used bump2version for this task in the past with good success, so I would recommend it here. (update: upon closer inspection and taking into account that bump2version isn't actively maintained anymore, it seems that setuptools-scm fits the bill quite well).

Acceptance Criteria

Automatic version management has been integrated
A version bump and accompanying GH release has been made (see also #14)

Add wheel build & publishing job

🚧 Blocked by PyPI token issuance. 🚧

Implementation details:

Primed on published release and manual dispatch
Publishing e.g. via pypa/gh-action-pypi-publish (1st party solution)

Acceptance criteria:

PyPI token has been set as GH Actions secret PYPI_API_TOKEN / TEST_PYPI_API_TOKEN (for TestPyPI)
Package metadata has been correctly set:
- Author email: [email protected] (in line with other aAI tools)
- Maintainers have been set
- License is correctly specified in the built wheel
- Trove classifiers are set
Added py.typed file, since we are using type hints
GitHub action for package publishing (MR pushes -> TestPyPI, tags -> PyPI) has been added
Package has been published initially

Improve public API docs coverage

In order to make the generated API docs (more) useful, the code should include docstrings for (at a minimum):

All modules: otherwise, subpages in the API reference will appear empty (since the are generated from the module docstring)
All public members (function, classes): e.g., the config module is entire undocumented
Any missing type annotations

Handle backend errors on commit attempts for no changes

Found in #15:

Pushing changes with postcommit = True results in an error when no changes are on the branch. The error:

HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Request-Id': '490cc54c-a8f9-41c4-974b-25e56c9b9aa3', 'Date': 'Tue, 22 Aug 2023 13:54:27 GMT', 'Content-Length': '33'})
HTTP response body: {"message":"commit: no changes"}

I think we should handle that error since we also have a logging statement informing that there was no update to the resource.

Possible fixes:

Use the diff API to check if there are actual changes, abort if the diff is empty.
Force commit and ignore errors (but maybe at least log them)

Implement `cp_file`

Should simply be a wrapper around objects_api.copy_object.

Added impl
Added test(s)

Brainstorm ideas to extend the lakeFS hook concept

Based on internal discussions with AR / JWK.

Support for post-operation commits is nice, but the user might want to do more, like merging a branch back into the source branch if some condition is fulfilled, revert a previous commit, push a tag, etc.

Our current model does not allow for this user flexibility. So, in order to give the user more customization power, we could open up the post-operation commit hook concept into a more general one. For it, the following design considerations should be had:

Do we stay with only a post-operation hook, or do we allow a pre-operation hook as well?
Do we remain with one hook function, or do we allow a mapping of op type -> hook? This is e.g. what fsspec does with its callbacks, which is very extensible.
Do we expose the client altogether, or pre-selected lakeFS operations? It might be best to create a read-only view of the client, so that the file system stays immutable.
What kind of support / snippets are we giving to the user? Right now, we only have a very rudimentary commit hook sample, which we could extend with common operations if we decide to open up the concept.

Remove Gitlab CI pipeline file

A leftover Gitlab CI pipeline definition resides in the root directory and should be removed, now that the project is hosted on GitHub.

Related to #2, could be done in the same PR.

Investigate breaking API change in `lakefs-client==0.108.0`

As seen in this CI job, lakefs-client==0.108.0 seems to include a breaking API change that we need to address: https://github.com/appliedAI-Initiative/lakefs-spec/actions/runs/6023343785/job/16340149959

Acceptance Criteria

Affected API call has been identified and updated
lakefs-client dependency has been unpinned in pyproject.toml
CI builds are passing

Implement `LakeFSFileSystem.ls` caching

The machinery is already in place, now we just need to store results of ls calls in the dircache.

Pipe storage_options through file system constructor to dircache.
Add a check for rpath in the ls cache before API call.
Add a store of an objects_api.list_objects result to the dircache to the ls method.
Add test to confirm cache hit.

Read LakeFS Client Config from `.lakectl.yaml ` (if Exists)

Inspired by this comment in the LakeFS repo.

We can use the .lakectl.yaml (if it exists) to avoid needing to pass the storage_option parameters whenever we interact with the FS.

The .lakectl.yaml is created when using lakectl (the LakeFS CLI). Docs.

Example content of .lakectl.yaml in /Users/janedoe/.lakectl.yaml:

credentials:
    access_key_id: AKIAIOSFOLQUICKSTART
    secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
metastore:
    glue:
        catalog_id: ""
    hive:
        db_location_uri: file:/user/hive/warehouse/
        uri: ""
server:
    endpoint_url: http://127.0.0.1:8000

The storage_options of LakeFSpec look like this:

storage_options={
    "host": "localhost:8000",
    "username": "username",
    "password": "password",
}

So we can use the corresponding variables from the .lakectl.yaml if it exists. If not, passing storage_options would be required.

Discuss optional `PyYAML` dependency

          I disagree - has any user reported any headaches? With all due respect, running `pip install --upgrade pyyaml` once in the console (which is even suggested if you fail to load an existing configfile) is possible for every prospective user of this library.

And since we only require it for an opt-in feature (lakectl configs), adding at as an unconditional requirement is not appropriate.

That being said, if we mention it heavily in the user guide, we can either a) change that to use other auth methods or b) put a small disclaimer in front of the examples (preferred).

Originally posted by @nicholasjng in #129 (comment)

Create an integration test for `pandas` example usage

Our pandas integration is so far our main example, because it showcases how fsspec is useful for completely abstracting away service communication from the data scientist, effectively shrinking DataFrame I/O to <5 lines (or even 1, with our zero-config approach).

We should make sure to write tests for it, to assert that our features work as expected. Concretely, I mean the following:

Create the test repo with sample data, so that we get the lakes.parquet example from which we can test DataFrame reads. This works by adding the sample_data=True option to the RepositoryCreation object in conftest.ensurerepo.
Add a test for pandas parquet I/O, probably with automatic branch creation.
Add a test with an automatic commit (I discovered that this was broken just before the demos today, fwiw - now it's fixed).

Properly translate LakeFS errors into OS errors

Role model can be https://github.com/fsspec/s3fs/blob/main/s3fs/errors.py.

In short, from API exceptions with HTTP codes, we can construct the following mapping between lakeFS and Python errors:

NotFoundException (HTTP404) -> FileNotFoundError
ForbiddenException (HTTP403) -> PermissionError
UnauthorizedException (HTTP401) -> PermissionError

List to be amended by other backend errors from the Python docs.

Algorithm can be as follows:

Construct error code to exception mapping (all backend errors have HTTP codes set as status)
Write a translate_lakefs_error function similar to the boto one above,
Wrap client calls into a try-except on the ApiException base type, translate error on failure, raise the translated error instead.

Add auto-commit mode

For a new user of the library, having to register a hook in order to make a commit after uploading a file can be a bit daunting, since it requires understanding multiple unrelated concepts. There is also a danger of forgetting to create a commit at all, which might lead to unexpected repository states when using the library.

We currently have client_utils.commit(), but that involves pulling the client from a filesystem instance (and thus, does not work easily with LakeFSFile) and manually keeping track of repo and branch names.

Instead, we could offer a simple auto_commit=True option to a filesystem (or a file, as part of close()), which would simply create a new commit on the active branch for all modification operations (i.e., put, rm, cp_file). We could even harness the existing hook system for this and automatically register a commit_file hook for the relevant FSEvents. (however, this would probably warrant a revisit of the hook system, such that multiple hooks can be registered independently for a given event to enable composability).

Proposed API changes

Introduce an auto_commit: bool argument in LakeFSFileSystem
Introduce an auto_commit: bool argument in LakeFSFile

Consider handling of rpaths with URI scheme included

          s3fs for one parses the bucket and resource names after stripping the protocol, so maybe calling `_strip_protocol` in the `parse` routine is the way to go here.

Originally posted by @nicholasjng in #156 (comment)

Add unit tests for `rm` and `exists`

They are shown as completely uncovered in the main branch report: https://app.codecov.io/gh/appliedAI-Initiative/lakefs-spec/blob/main/src%2Flakefs_spec%2Fspec.py

Keep only single `development` docs version for `main` branch

The documentation versions for the main branch are quite distracting and in fact bury the proper releases (which come at the end of the list):

We should modify the CI pipeline to publish a single unstable alias from main, instead of deriving the version number from the build as we currently do.

`ls()` should return fully-qualified paths with repo/ref

When calling ls(), the paths for items returned in the result are not prefixed with the repository and ref (as is the case for the underlying API endpoint). However, this results in these paths not being able to be used in other lakefs-spec API calls, since they all expect a fully-qualified rpath (as validated by parse()).

Example (assume the repo contains a folder data, containing a single file 1.txt):

items = fs.ls("repo/main/data")
assert items[0]["name"] == "repo/main/data/1.txt"  # AssertionError!

Since ls() is used under the hood by the AbstractFileSystem base class for a variety of other operations (at least find(), walk(), glob(), but also get(..., recursive=True)), these are broken by extension as well (since they might either return incorrect data, or in the case of put() pass an unqualified path to info(), which fails the validation in parse()).

A possible solution is to prefix the items returned by lakeFS API with the repo and ref in ls() (a single-line fix). However, extra care needs to be taken to make sure this behavior works correctly with the directory listing cache.

Failing test case:

def test_ls(
    random_file_factory: RandomFileFactory,
    fs: LakeFSFileSystem,
    repository: str,
    temp_branch: str,
) -> None:
    random_file = random_file_factory.make()

    prefix = f"{repository}/{temp_branch}/find_{uuid.uuid4().hex.lower()[:6]}"
    fs.put(str(random_file), f"{prefix}/{random_file.name}")
    files = fs.ls(f"{prefix}/")

    assert len(files) == 1
    assert files[0]["name"] == f"{prefix}/{random_file.name}"

Allow `LakeFSFile`s to decay to standard block store files on request

LakeFS has APIs that allow a user to upload files directly to the underlying block storage and link them to a resource in the repository.

We could support this, and allow users to put files directly into the block store - with this, we could also support multipart uploads through the actual block storage file systems.

Figure out where to include multiple dispatch of files, including user options to enable / request it.
Implement dispatch of the target file system as well as import guard if necessary.
Add test(s).

Allow passing Path-like inputs as local paths

In #123, we added type hints that allow passing other Path-like inputs besides str to filesystem operations. However, they are not currently handled correctly internally when passing a Path, e.g., in put():

from lakefs_spec import LakeFSFileSystem
from pathlib import Path

fs = LakeFSFileSystem()
fs.put(Path("demo.txt"), "repo/main/demo.txt")

Traceback (most recent call last):
  File "/home/adriano/tmp/poetry-test/demo.py", line 5, in <module>
    fs.put(Path("demo.txt"), "repo/main/demo.txt")
  File "/home/adriano/tmp/poetry-test/.venv/lib/python3.11/site-packages/lakefs_spec/spec.py", line 570, in put
    super().put(
  File "/home/adriano/tmp/poetry-test/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 1038, in put
    lpaths = fs.expand_path(lpath, recursive=recursive, maxdepth=maxdepth)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adriano/tmp/poetry-test/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 1153, in expand_path
    path = [self._strip_protocol(p) for p in path]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'PosixPath' object is not iterable

All methods that take a str | os.PathLike[str] | pathlib.Path parameter should be examined and adapted to correctly work with all types of allowed inputs.

local `pip-compile` on Mac causes CI failures

After yesterday I also finally understood what the failure in pr #107 was (link to the failed job).

TLDR: pip-compile is platform-dependent, and ipykernel from the docs has a Mac-only dependency:

https://github.com/ipython/ipykernel/blob/966e0a41fc61e7850378ae672e28202eb29b10b0/pyproject.toml#L36-L39

Which means a) that currently a pre-commit run on Mac causes CI failures, specifically the requirements-docs.txt compile, and b) lockfiles are platform-dependent (maybe not so surprising).

I don't have any suggestion available immediately, since pip-compile apparently does not expose an option to compile for a specific platform.

Add support for presigned URLs in file operations

Idea from treeverse/lakeFS#6469.

The lakeFS client supports pre-signed URLs by the presign boolean argument to many object-based APIs like list_objects, get_object, put_object, delete_object(s). We should support that!

Considerations:

~~The pre-sign feature has been added as a filesystem-wide feature flag (e.g. presign, use_presign, presign_urls...).~~ presign support was added for each API that supports it (list_objects, get_object, stat_object).
The packaging implications have been observed (e.g., will this require boto3 as dependency?) and dealt with.
Tests for pre-signed URLs were added.

Directory listing cache issues between multiple branches/repos

The directory listing cache in the LakeFSFileSystem.ls() implementation uses the relative path (without the repo/ref prefix!) as a cache key for the dircache.

This means that calling ls() for the same directory across multiple branches (or even repos) might lead to false cache hits, when in reality the listings might be completely different.

Here's a failing test case:

def test_ls_caching_regression(fs: LakeFSFileSystem, repository: str) -> None:
    fs.client, counter = with_counter(fs.client)

    testdir = "data"

    listing = fs.ls(f"{repository}/main/{testdir}/")
    assert len(fs.dircache) == 1
    assert tuple(fs.dircache.keys()) == (testdir,)

    listing2 = fs.ls(f"{repository}-foobar/main/{testdir}/")
    assert len(fs.dircache) == 2  # Fails currently

    assert listing != listing2   # Fails currently

    # second `ls` call should not hit the cache
    assert counter.count("objects_api.list_objects") == 2  # Fails currently

Improve contributing guide

The contributing guide can be improved:

mention of pip-compile to update locked dev dependencies
update pytest invocation based on the changes in #2 (since it switches to a bundled lakeFS test container)

Add `exist_ok` attributes to repository, tag, other creatable lakeFS types

The backend returning HTTP 409 errors (resource already exists) breaks idempotence in e.g. the tutorial notebook, which worsens UX for a docs development flow.

An easy fix is to allow their existence (e.g. by an exist_ok attribute like in pathlib), and return the object if it already exists.

Clean up `LakeFSClient` stub

#9 bumped the minimum lakeFS client version to v0.105.0, which means that typing information is now available for the client. Thus, the client.{py,pyi} files can go.

Also in that release came the deprecation of client API members without _api suffixes, so we should migrate the usage of client APIs to the _api suffix variants.

Consider class-level hook support

What might be nice is to be able to define hooks on a class level like so:

from lakefs_spec import LakeFSFileSystem

LakeFSFileSystem.register_hook("put_file", ...)

fs = LakeFSFileSystem()

print(fs.hooks) # <- prints {FSEvent.PUT_FILE: <function ...>}

Now the acrobatic part of this would be to still retain instance-level hooks side by side. That is, registering a hook on fs in the above example should not make it into the LakeFSFileSystem hook registry.

This would require some fiddling with dict objects, classmethods, and a global hook registry. Also, as a potential side effect, if a file system inherits hooks from its class, and then has one registered, the is check (memory address equality) between it and an equally constructed file system class will likely fail:

fs = LakeFSFileSystem()

fs.register_hook("put_file", ...)

fs2 = LakeFSFileSystem()

fs is fs2  # False

Opinions welcome!

`ls()` should honor `refresh=True` parameter

AbstractFileSystem.ls() takes an optional refresh: bool kwarg that allows the caller to bypass the directory listing cache.

We override that method from the base class and make use of the dircache, but don't offer a way to bypass it.

Improve commit hook abstraction

Right now, we only supply the fsspec event name, and the resource that is being mutated.

However, to craft a better commit message automatically, more information would be nice. This could include:

The branch name that is being committed to.
The name of the repository holding the branch.
The diff that is being committed.

We already have all the above information present (with the changes to the commit hook logic in #26), so this should not be hard to do - only the interface change needs to be implemented (and communicated!)

Implementation

Decide what of the above info to include for commit hooks.
Updating the CommitHook type hint in the src/lakefs_spec/commithook.py file.
Updating the Default commit hook to take the new arguments.
Updating the commit hook unit test.

Move YAML file existence check into `LakectlConfig`

          > > (also looking at the same method, the error message for missing `PyYAML` is slightly inaccurate: it says that the config file exists, even when it might not)

Hm, good point. We do in fact only call LakectlConfig.read() if the path input exists (see the file system constructor), but to make this portable, the existence check could probably be carried out again in the method?

But this raises the question of how to proceed if path.exists() is False. Return an empty object and risk silent errors, or raise?

For now, we can get away by (a) rewording the error message, and/or (b) reading the file first and then importing PyYAML (this would eliminate a race condition between the existence check and trying to read the file, while still keeping the error message relevant).

I wouldn't silently swallow the error and try to be too smart as it might lead to unexpected behavior.

Also happy to move this into a separate issue.

Originally posted by @AdrianoKF in #123 (comment)

`LakeFSFilesystem` Should throw authentication error or warning if credentials are not available at point of instantiation

As per title.

Add Tests for implicit Branch creation

Implement tests for the create_branch_ok flag for implicit branch creations. Implemented here.

The tests should cover:

with create_branch_ok = True implicit creation of a branch works
with create_branch_ok = True pushing to an existing branch works
with create_branch_ok = False pushing to an existing branch works
with create_branch_ok = False pushing to an non-existing branch throws error

URL usage not bound to file system instance

Consider the following example:

import pandas as pd
from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem(
    host="http://localhost:8000",
    username="<USERNAME>",
    password="<PASSWORD>",
)

df = pd.DataFrame.from_dict({"a": [0, 1, 2]})

with fs.transaction as tx:
    df.to_csv("lakefs://example/main/data.csv")
    tx.commit("example", "main", "Some data")

I would expect the df.to_csv to use the LakeFSFileSystem that I have defined above. However, this code fails with a urllib3.exceptions.LocationValueError: No host specified. error.

If I set the LAKEFS_HOST, LAKEFS_USERNAME, and LAKEFS_PASSWORD environment variables, it works. (Probably the same applies when using the configuration in the YAML file)

Version: 0.3.0

Add test coverage for file op prechecks

They show up uncovered on the latest main coverage report.

Also the condition for fs.put_file is missing a self.exists(rpath) as secondary condition.

Pass configuration arguments instead of client to the `LakeFSFileSystem` constructor

This is similar to, for example, how s3fs does it: https://github.com/fsspec/s3fs/blob/39125f79051c55fc715cf056d8a2d0c7aa9d1c4b/s3fs/core.py#L171

It also has another benefit: Giving the bare options instead of the client should lead to more robust instance caching, since the cache key is now calculated from the config attributes directly and not the client.

Implementation

Pick out lakefs_client.Configuration attributes that should be supported.
Add them to the LakeFSFileSystem constructor.
Change all example code to take the raw config attributes as storage_options.

Allow implicit branch creation on file uploads / moves

Consider the following use case:

import pandas as pd

storage_options = {"client": client}
df = pd.read_parquet("lakefs://my-repo/main/lakes.parquet", storage_options=storage_options)

... # process the data, add columns, whatever

pd.to_parquet("lakefs://my-repo/new-branch/lakes.parquet", storage_options=storage_options)

Here, it might be beneficial to allow the creation of new-branch if it does not exist.

BUT: It might be surprising to silently create branches (think typos, etc.), so this should most likely be a toggle switch with a sensible default behavior.

The branch creation would only be relevant for additive operations, i.e. those that add or duplicate data (e.g. put, mv, cp).

Implementation

Add a boolean switch (e.g. create_branch_ok) to the LakeFSFileSystem constructor, with documentation.
Add the switch to the LakeFSFileSystem.scope() context manager, same as the other two options.
Create the branch via client.branches_api.create_branch before an additive operation if create_branch_ok is True - either by creating if it does not exist (incurs a branch listing, which might be expensive for repos with many branches) OR by branch-ensuring via an unconditional create (needs client-side error handling).
Create the branch in the same way for the LakeFSFile if it is opened in write mode, e.g. in __init__ if mode == "wb" OR in _upload_chunk directly before the put_object call).

Get included in the official list of known backends

Once this library is released to the general public and has been tested in a few use cases, we should get it included in the list of official list of known backends (see the docs for the process).

Revisit `put_file` and `get_file` implementations

Those were historically the two most important functions which were implemented even before LakeFSFile was.

Other file systems, including the parent AbstractFileSystem, implement these two via LakeFSFile.open, which does make sense. We should study the code and see if we might be able to roll with a solution based on opened LakeFSFiles, too.

Rethink hook execution concept

The more I think about it, the more I am unsure about our "execute hooks anyway after events" approach.

Because, what is the point of creating a commit if a file upload fails? As of now, this is only not exploding on us because we provide the "if diff is empty, abort" escape hatch.

Would it not be more sensible to execute hooks only after operational success? Opinions welcome.

Migrate the `demos/requirements.txt` and demo to work with `lakefs_sdk`

As per title.

Set up GitHub Actions for the project

Additionally to the existing CI setup, we would need a lakeFS instance for integration tests.

Subtasks:

Add pre-commit job
- Primed on pull requests against main and merges into main
- Bonus: with caching via actions/cache (including tool subcaches)
Add pytest job
- Primed on pull requests against main and merges into main
- ~~Add a service block for lakeFS setup (local mode should be fine)~~
- Bonus: with venv caching via actions/cache (keyed on dev-deps.lock)

Expose additional lakeFS client configuration attributes in constructor

IMO, a few additional configuration options could be exposed:

verify_ssl (which unfortunately is not part of Configuration, but rather the lakeFS client itself - but for user convenience, I would make it part of our constructor)

proxy (same argument)

ssl_ca_cert (needed for self-managed PKI)

Originally posted by @AdrianoKF in #29 (comment)

Expand `lakefs_spec.client_helpers` module with more useful operations

Some that come to mind:

Revert
Merge (into a different branch)
Create/amend tag

Depending on the arguments that this needs, the HookContext interface could need to be expanded.

Add info about the development resources in the hack directory to the `Contributing.md`

See title.

We should write a quick overview about available resources and helpful commands in the contributin.md for ourselves and potential contributors.

Define behavior when requesting unknown repositories and branches

Currently we have little to no coverage on what happens when a user specifies a nonexistent repository or branch in his interactions.

While repositories should probably not be created automatically, it might be a UX benefit to allow for automatic branch creation?

The easiest and most natural option is, of course, to error whenever a non-existent repo or branch is requested.

Compatibility problems for older Python versions

Listing some issues on different Python versions. Reproducer:

pythonX.Y -m pip install --upgrade lakefs-spec
python -c "import lakefs_spec"

Python 3.9:

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Suggested fix: from __future__ import annotations imports in all modules using X | Y-syntax for unions.

Python 3.10:

ImportError: cannot import name 'StrEnum' from 'enum'

Suggested fix: Either use Enum and coerce to string, or do a conditional import on Python version, or inherit from string in older versions since all things we use from StrEnum should be in scope for inheritance from str as well.

Docs misrendering on development

Current HEAD version: https://lakefs-spec.org/0.3.1.dev14+gbc76a28/tutorials/demo_data_science_project/

I see the following problems:

Bulleted markdown lists in the beginning are not rendered,
Pandas DataFrame prints are written over the line.

If we decide to keep the output cells for the docs build, we should also add some information to the remaining print statements, just as we did to the data shapes and accuracies.

aai-institute / lakefs-spec Goto Github PK

lakefs-spec's Introduction

lakeFS-spec: An fsspec backend for lakeFS

Installation

Usage

Low-level: As a fsspec filesystem

High-level: Via third-party libraries

Contributing

License

lakefs-spec's People

Contributors

Stargazers

Watchers

Forkers

lakefs-spec's Issues

Acceptance Criteria

Acceptance Criteria

Proposed API changes

Considerations:

Implementation

Implementation

Implementation

Recommend Projects

Recommend Topics

Recommend Org