Code Monkey home page Code Monkey logo

lakefs-spec's Introduction

lakeFS-spec logo

lakeFS-spec: An fsspec backend for lakeFS

GitHub docs GitHub

Welcome to lakeFS-spec, a filesystem-spec backend implementation for the lakeFS data lake. Our primary goal is to streamline versioned data operations in lakeFS, enabling seamless integration with popular data science tools such as Pandas, Polars, and DuckDB directly from Python.

Highlights:

  • Simple repository operations in lakeFS
  • Easy access to underlying storage and versioning operations
  • Seamless integration with the fsspec ecosystem
  • Directly access lakeFS objects from popular data science libraries (including Pandas, Polars, DuckDB, Hugging Face Datasets, PyArrow) with minimal code
  • Transaction support for reliable data version control
  • Smart data transfers through client-side caching (up-/download)
  • Auto-discovery configuration

Note

We are seeking early adopters who would like to actively participate in our feedback process and shape the future of the library. If you are interested in using the library and want to get in touch with us, please reach out via Github Discussions.

Installation

lakeFS-spec is published on PyPI, you can simply install it using your favorite package manager:

$ pip install lakefs-spec
  # or
$ poetry add lakefs-spec

Usage

The following usage examples showcase two major ways of using lakeFS-spec: as a low-level filesystem abstraction, and through third-party (data science) libraries.

For a more thorough overview of the features and use cases for lakeFS-spec, see the user guide and tutorials sections in the documentation.

Low-level: As a fsspec filesystem

The following example shows how to upload a file, create a commit, and read back the committed data using the bare lakeFS filesystem implementation. It assumes you have already created a repository named repo and have lakectl credentials set up on your machine in ~/.lakectl.yaml (see the lakeFS quickstart guide if you are new to lakeFS and need guidance).

from pathlib import Path

from lakefs_spec import LakeFSFileSystem

REPO, BRANCH = "repo", "main"

# Prepare example local data
local_path = Path("demo.txt")
local_path.write_text("Hello, lakeFS!")

# Upload to lakeFS and create a commit
fs = LakeFSFileSystem()  # will auto-discover config from ~/.lakectl.yaml

# Upload a file on a temporary transaction branch
with fs.transaction(repository=REPO, base_branch=BRANCH) as tx:
    fs.put(local_path, f"{REPO}/{tx.branch.id}/{local_path.name}")
    tx.commit(message="Add demo data")

# Read back committed file
f = fs.open(f"{REPO}/{BRANCH}/demo.txt", "rt")
print(f.readline())  # "Hello, lakeFS!"

High-level: Via third-party libraries

A variety of widely-used data science tools are building on fsspec to access remote storage resources and can thus work with lakeFS data lakes directly through lakeFS-spec (see the fsspec docs for details). The examples assume you have a lakeFS instance with the quickstart repository containing sample data available.

# Pandas -- see https://pandas.pydata.org/docs/user_guide/io.html#reading-writing-remote-files
import pandas as pd

data = pd.read_parquet("lakefs://quickstart/main/lakes.parquet")
print(data.head())


# Polars -- see https://pola-rs.github.io/polars/user-guide/io/cloud-storage/
import polars as pl

data = pl.read_parquet("lakefs://quickstart/main/lakes.parquet", use_pyarrow=True)
print(data.head())


# DuckDB -- see https://duckdb.org/docs/guides/python/filesystems.html
import duckdb
import fsspec

duckdb.register_filesystem(fsspec.filesystem("lakefs"))
res = duckdb.read_parquet("lakefs://quickstart/main/lakes.parquet")
res.show()

Contributing

We encourage and welcome contributions from the community to enhance the project. Please check discussions or raise an issue on GitHub for any problems you encounter with the library.

For information on the general development workflow, see the contribution guide.

License

The lakeFS-spec library is distributed under the Apache-2 license.

lakefs-spec's People

Contributors

adrianokf avatar janwillemkl avatar leonpawelzik avatar maciej818 avatar maxmynter avatar nicholasjng avatar ozkatz avatar renesat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

lakefs-spec's Issues

Implement proper file upload / download events

Right now, we put a FSEvent.PUT_FILE awkwardly into the _upload_chunk method - for reading, we do not have any hook support.

We should change this by:

  • Defining FSEvent.FILE{UPLOAD,DOWNLOAD}.
  • Firing either of these hooks in LakeFSFile.close, depending on whether mode == wb or mode == rb.
  • Add test(s).

Investigate (and implement) instance caching by client configuration

fsspec exposes the mechanism of caching filesystem instances, e.g. for reusing API clients. We could implement this by overriding the fsid attribute to give a unique identifier, e.g. based on client configuration.

Objectives:

  • Find a way to access the lakeFS client configuration - this should be possible as fs.client._api.configuration (?).
  • Create an ID (via hashing, object IDs, etc.) out of this config, loop in the desired members only (e.g., the password probably does not need to go into the ID).
  • Test that the file system is reused, for example in pd.read_parquet calls. This is especially useful for downstream package users.

fs.put of an unchanged file does not execute a newly registered Hook

When I add a hook, e.g. a hook on FSEVENT.PUT and then call fs.put(...) with a file that is unchanged on the remote, the commithook is not executed as the checksums match.

I can force the fs.put with precheck=False.

Do we want hook execution even when a file-upload is skipped?

E.g. git would throw an error (Nothing to Commit) when no changes are present. But people might add other functionality in their hooks which they want executed despite no change.

  • If we adapt this, we need to update the Demo.

Set up automatic version management

In order to keep versions consistent across locations (package metadata, __init__.py, tags, releases), we should set up automatic version management.

We have used bump2version for this task in the past with good success, so I would recommend it here. (update: upon closer inspection and taking into account that bump2version isn't actively maintained anymore, it seems that setuptools-scm fits the bill quite well).

Acceptance Criteria

  • Automatic version management has been integrated
  • A version bump and accompanying GH release has been made (see also #14)

Add wheel build & publishing job

๐Ÿšง Blocked by PyPI token issuance. ๐Ÿšง

Implementation details:

  • Primed on published release and manual dispatch
  • Publishing e.g. via pypa/gh-action-pypi-publish (1st party solution)

Acceptance criteria:

  • PyPI token has been set as GH Actions secret PYPI_API_TOKEN / TEST_PYPI_API_TOKEN (for TestPyPI)
  • Package metadata has been correctly set:
  • Added py.typed file, since we are using type hints
  • GitHub action for package publishing (MR pushes -> TestPyPI, tags -> PyPI) has been added
  • Package has been published initially

Improve public API docs coverage

In order to make the generated API docs (more) useful, the code should include docstrings for (at a minimum):

  • All modules: otherwise, subpages in the API reference will appear empty (since the are generated from the module docstring)
  • All public members (function, classes): e.g., the config module is entire undocumented
  • Any missing type annotations

Handle backend errors on commit attempts for no changes

Found in #15:

Pushing changes with postcommit = True results in an error when no changes are on the branch. The error:

HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Request-Id': '490cc54c-a8f9-41c4-974b-25e56c9b9aa3', 'Date': 'Tue, 22 Aug 2023 13:54:27 GMT', 'Content-Length': '33'})
HTTP response body: {"message":"commit: no changes"}

I think we should handle that error since we also have a logging statement informing that there was no update to the resource.

Possible fixes:

  1. Use the diff API to check if there are actual changes, abort if the diff is empty.
  2. Force commit and ignore errors (but maybe at least log them)

Implement `cp_file`

Should simply be a wrapper around objects_api.copy_object.

  • Added impl
  • Added test(s)

Brainstorm ideas to extend the lakeFS hook concept

Based on internal discussions with AR / JWK.

Support for post-operation commits is nice, but the user might want to do more, like merging a branch back into the source branch if some condition is fulfilled, revert a previous commit, push a tag, etc.

Our current model does not allow for this user flexibility. So, in order to give the user more customization power, we could open up the post-operation commit hook concept into a more general one. For it, the following design considerations should be had:

  • Do we stay with only a post-operation hook, or do we allow a pre-operation hook as well?
  • Do we remain with one hook function, or do we allow a mapping of op type -> hook? This is e.g. what fsspec does with its callbacks, which is very extensible.
  • Do we expose the client altogether, or pre-selected lakeFS operations? It might be best to create a read-only view of the client, so that the file system stays immutable.
  • What kind of support / snippets are we giving to the user? Right now, we only have a very rudimentary commit hook sample, which we could extend with common operations if we decide to open up the concept.

Remove Gitlab CI pipeline file

A leftover Gitlab CI pipeline definition resides in the root directory and should be removed, now that the project is hosted on GitHub.

Related to #2, could be done in the same PR.

Implement `LakeFSFileSystem.ls` caching

The machinery is already in place, now we just need to store results of ls calls in the dircache.

  • Pipe storage_options through file system constructor to dircache.
  • Add a check for rpath in the ls cache before API call.
  • Add a store of an objects_api.list_objects result to the dircache to the ls method.
  • Add test to confirm cache hit.

Read LakeFS Client Config from `.lakectl.yaml ` (if Exists)

Inspired by this comment in the LakeFS repo.

We can use the .lakectl.yaml (if it exists) to avoid needing to pass the storage_option parameters whenever we interact with the FS.

The .lakectl.yaml is created when using lakectl (the LakeFS CLI). Docs.

Example content of .lakectl.yaml in /Users/janedoe/.lakectl.yaml:

credentials:
    access_key_id: AKIAIOSFOLQUICKSTART
    secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
metastore:
    glue:
        catalog_id: ""
    hive:
        db_location_uri: file:/user/hive/warehouse/
        uri: ""
server:
    endpoint_url: http://127.0.0.1:8000

The storage_options of LakeFSpec look like this:

storage_options={
    "host": "localhost:8000",
    "username": "username",
    "password": "password",
}

So we can use the corresponding variables from the .lakectl.yaml if it exists. If not, passing storage_options would be required.

Discuss optional `PyYAML` dependency

          I disagree - has any user reported any headaches? With all due respect, running `pip install --upgrade pyyaml` once in the console (which is even suggested if you fail to load an existing configfile) is possible for every prospective user of this library.

And since we only require it for an opt-in feature (lakectl configs), adding at as an unconditional requirement is not appropriate.

That being said, if we mention it heavily in the user guide, we can either a) change that to use other auth methods or b) put a small disclaimer in front of the examples (preferred).

Originally posted by @nicholasjng in #129 (comment)

Create an integration test for `pandas` example usage

Our pandas integration is so far our main example, because it showcases how fsspec is useful for completely abstracting away service communication from the data scientist, effectively shrinking DataFrame I/O to <5 lines (or even 1, with our zero-config approach).

We should make sure to write tests for it, to assert that our features work as expected. Concretely, I mean the following:

  • Create the test repo with sample data, so that we get the lakes.parquet example from which we can test DataFrame reads. This works by adding the sample_data=True option to the RepositoryCreation object in conftest.ensurerepo.
  • Add a test for pandas parquet I/O, probably with automatic branch creation.
  • Add a test with an automatic commit (I discovered that this was broken just before the demos today, fwiw - now it's fixed).

Properly translate LakeFS errors into OS errors

Role model can be https://github.com/fsspec/s3fs/blob/main/s3fs/errors.py.

In short, from API exceptions with HTTP codes, we can construct the following mapping between lakeFS and Python errors:

NotFoundException (HTTP404) -> FileNotFoundError
ForbiddenException (HTTP403) -> PermissionError
UnauthorizedException (HTTP401) -> PermissionError

List to be amended by other backend errors from the Python docs.

Algorithm can be as follows:

  1. Construct error code to exception mapping (all backend errors have HTTP codes set as status)
  2. Write a translate_lakefs_error function similar to the boto one above,
  3. Wrap client calls into a try-except on the ApiException base type, translate error on failure, raise the translated error instead.

Add auto-commit mode

For a new user of the library, having to register a hook in order to make a commit after uploading a file can be a bit daunting, since it requires understanding multiple unrelated concepts. There is also a danger of forgetting to create a commit at all, which might lead to unexpected repository states when using the library.

We currently have client_utils.commit(), but that involves pulling the client from a filesystem instance (and thus, does not work easily with LakeFSFile) and manually keeping track of repo and branch names.

Instead, we could offer a simple auto_commit=True option to a filesystem (or a file, as part of close()), which would simply create a new commit on the active branch for all modification operations (i.e., put, rm, cp_file). We could even harness the existing hook system for this and automatically register a commit_file hook for the relevant FSEvents. (however, this would probably warrant a revisit of the hook system, such that multiple hooks can be registered independently for a given event to enable composability).

Proposed API changes

  • Introduce an auto_commit: bool argument in LakeFSFileSystem
  • Introduce an auto_commit: bool argument in LakeFSFile

Keep only single `development` docs version for `main` branch

The documentation versions for the main branch are quite distracting and in fact bury the proper releases (which come at the end of the list):

image

We should modify the CI pipeline to publish a single unstable alias from main, instead of deriving the version number from the build as we currently do.

`ls()` should return fully-qualified paths with repo/ref

When calling ls(), the paths for items returned in the result are not prefixed with the repository and ref (as is the case for the underlying API endpoint). However, this results in these paths not being able to be used in other lakefs-spec API calls, since they all expect a fully-qualified rpath (as validated by parse()).

Example (assume the repo contains a folder data, containing a single file 1.txt):

items = fs.ls("repo/main/data")
assert items[0]["name"] == "repo/main/data/1.txt"  # AssertionError!

Since ls() is used under the hood by the AbstractFileSystem base class for a variety of other operations (at least find(), walk(), glob(), but also get(..., recursive=True)), these are broken by extension as well (since they might either return incorrect data, or in the case of put() pass an unqualified path to info(), which fails the validation in parse()).

A possible solution is to prefix the items returned by lakeFS API with the repo and ref in ls() (a single-line fix). However, extra care needs to be taken to make sure this behavior works correctly with the directory listing cache.

Failing test case:

def test_ls(
    random_file_factory: RandomFileFactory,
    fs: LakeFSFileSystem,
    repository: str,
    temp_branch: str,
) -> None:
    random_file = random_file_factory.make()

    prefix = f"{repository}/{temp_branch}/find_{uuid.uuid4().hex.lower()[:6]}"
    fs.put(str(random_file), f"{prefix}/{random_file.name}")
    files = fs.ls(f"{prefix}/")

    assert len(files) == 1
    assert files[0]["name"] == f"{prefix}/{random_file.name}"

Allow `LakeFSFile`s to decay to standard block store files on request

LakeFS has APIs that allow a user to upload files directly to the underlying block storage and link them to a resource in the repository.

We could support this, and allow users to put files directly into the block store - with this, we could also support multipart uploads through the actual block storage file systems.

  • Figure out where to include multiple dispatch of files, including user options to enable / request it.
  • Implement dispatch of the target file system as well as import guard if necessary.
  • Add test(s).

Allow passing Path-like inputs as local paths

In #123, we added type hints that allow passing other Path-like inputs besides str to filesystem operations. However, they are not currently handled correctly internally when passing a Path, e.g., in put():

from lakefs_spec import LakeFSFileSystem
from pathlib import Path

fs = LakeFSFileSystem()
fs.put(Path("demo.txt"), "repo/main/demo.txt")
Traceback (most recent call last):
  File "/home/adriano/tmp/poetry-test/demo.py", line 5, in <module>
    fs.put(Path("demo.txt"), "repo/main/demo.txt")
  File "/home/adriano/tmp/poetry-test/.venv/lib/python3.11/site-packages/lakefs_spec/spec.py", line 570, in put
    super().put(
  File "/home/adriano/tmp/poetry-test/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 1038, in put
    lpaths = fs.expand_path(lpath, recursive=recursive, maxdepth=maxdepth)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adriano/tmp/poetry-test/.venv/lib/python3.11/site-packages/fsspec/spec.py", line 1153, in expand_path
    path = [self._strip_protocol(p) for p in path]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'PosixPath' object is not iterable

All methods that take a str | os.PathLike[str] | pathlib.Path parameter should be examined and adapted to correctly work with all types of allowed inputs.

local `pip-compile` on Mac causes CI failures

After yesterday I also finally understood what the failure in pr #107 was (link to the failed job).

TLDR: pip-compile is platform-dependent, and ipykernel from the docs has a Mac-only dependency:

https://github.com/ipython/ipykernel/blob/966e0a41fc61e7850378ae672e28202eb29b10b0/pyproject.toml#L36-L39

Which means a) that currently a pre-commit run on Mac causes CI failures, specifically the requirements-docs.txt compile, and b) lockfiles are platform-dependent (maybe not so surprising).

I don't have any suggestion available immediately, since pip-compile apparently does not expose an option to compile for a specific platform.

Add support for presigned URLs in file operations

Idea from treeverse/lakeFS#6469.

The lakeFS client supports pre-signed URLs by the presign boolean argument to many object-based APIs like list_objects, get_object, put_object, delete_object(s). We should support that!

Considerations:

  • The pre-sign feature has been added as a filesystem-wide feature flag (e.g. presign, use_presign, presign_urls...). presign support was added for each API that supports it (list_objects, get_object, stat_object).
  • The packaging implications have been observed (e.g., will this require boto3 as dependency?) and dealt with.
  • Tests for pre-signed URLs were added.

Directory listing cache issues between multiple branches/repos

The directory listing cache in the LakeFSFileSystem.ls() implementation uses the relative path (without the repo/ref prefix!) as a cache key for the dircache.

This means that calling ls() for the same directory across multiple branches (or even repos) might lead to false cache hits, when in reality the listings might be completely different.

Here's a failing test case:

def test_ls_caching_regression(fs: LakeFSFileSystem, repository: str) -> None:
    fs.client, counter = with_counter(fs.client)

    testdir = "data"

    listing = fs.ls(f"{repository}/main/{testdir}/")
    assert len(fs.dircache) == 1
    assert tuple(fs.dircache.keys()) == (testdir,)

    listing2 = fs.ls(f"{repository}-foobar/main/{testdir}/")
    assert len(fs.dircache) == 2  # Fails currently

    assert listing != listing2   # Fails currently

    # second `ls` call should not hit the cache
    assert counter.count("objects_api.list_objects") == 2  # Fails currently

Improve contributing guide

The contributing guide can be improved:

  • mention of pip-compile to update locked dev dependencies
  • update pytest invocation based on the changes in #2 (since it switches to a bundled lakeFS test container)

Clean up `LakeFSClient` stub

#9 bumped the minimum lakeFS client version to v0.105.0, which means that typing information is now available for the client. Thus, the client.{py,pyi} files can go.

Also in that release came the deprecation of client API members without _api suffixes, so we should migrate the usage of client APIs to the _api suffix variants.

Consider class-level hook support

What might be nice is to be able to define hooks on a class level like so:

from lakefs_spec import LakeFSFileSystem

LakeFSFileSystem.register_hook("put_file", ...)

fs = LakeFSFileSystem()

print(fs.hooks) # <- prints {FSEvent.PUT_FILE: <function ...>}

Now the acrobatic part of this would be to still retain instance-level hooks side by side. That is, registering a hook on fs in the above example should not make it into the LakeFSFileSystem hook registry.

This would require some fiddling with dict objects, classmethods, and a global hook registry. Also, as a potential side effect, if a file system inherits hooks from its class, and then has one registered, the is check (memory address equality) between it and an equally constructed file system class will likely fail:

fs = LakeFSFileSystem()

fs.register_hook("put_file", ...)

fs2 = LakeFSFileSystem()

fs is fs2  # False

Opinions welcome!

Improve commit hook abstraction

Right now, we only supply the fsspec event name, and the resource that is being mutated.

However, to craft a better commit message automatically, more information would be nice. This could include:

  • The branch name that is being committed to.
  • The name of the repository holding the branch.
  • The diff that is being committed.

We already have all the above information present (with the changes to the commit hook logic in #26), so this should not be hard to do - only the interface change needs to be implemented (and communicated!)

Implementation

  • Decide what of the above info to include for commit hooks.
  • Updating the CommitHook type hint in the src/lakefs_spec/commithook.py file.
  • Updating the Default commit hook to take the new arguments.
  • Updating the commit hook unit test.

Move YAML file existence check into `LakectlConfig`

          > > (also looking at the same method, the error message for missing `PyYAML` is slightly inaccurate: it says that the config file exists, even when it might not)

Hm, good point. We do in fact only call LakectlConfig.read() if the path input exists (see the file system constructor), but to make this portable, the existence check could probably be carried out again in the method?

But this raises the question of how to proceed if path.exists() is False. Return an empty object and risk silent errors, or raise?

For now, we can get away by (a) rewording the error message, and/or (b) reading the file first and then importing PyYAML (this would eliminate a race condition between the existence check and trying to read the file, while still keeping the error message relevant).

I wouldn't silently swallow the error and try to be too smart as it might lead to unexpected behavior.

Also happy to move this into a separate issue.

Originally posted by @AdrianoKF in #123 (comment)

Add Tests for implicit Branch creation

Implement tests for the create_branch_ok flag for implicit branch creations. Implemented here.

The tests should cover:

  • with create_branch_ok = True implicit creation of a branch works
  • with create_branch_ok = True pushing to an existing branch works
  • with create_branch_ok = False pushing to an existing branch works
  • with create_branch_ok = False pushing to an non-existing branch throws error

URL usage not bound to file system instance

Consider the following example:

import pandas as pd
from lakefs_spec import LakeFSFileSystem

fs = LakeFSFileSystem(
    host="http://localhost:8000",
    username="<USERNAME>",
    password="<PASSWORD>",
)

df = pd.DataFrame.from_dict({"a": [0, 1, 2]})

with fs.transaction as tx:
    df.to_csv("lakefs://example/main/data.csv")
    tx.commit("example", "main", "Some data")

I would expect the df.to_csv to use the LakeFSFileSystem that I have defined above. However, this code fails with a urllib3.exceptions.LocationValueError: No host specified. error.

If I set the LAKEFS_HOST, LAKEFS_USERNAME, and LAKEFS_PASSWORD environment variables, it works. (Probably the same applies when using the configuration in the YAML file)

Version: 0.3.0

Pass configuration arguments instead of client to the `LakeFSFileSystem` constructor

This is similar to, for example, how s3fs does it: https://github.com/fsspec/s3fs/blob/39125f79051c55fc715cf056d8a2d0c7aa9d1c4b/s3fs/core.py#L171

It also has another benefit: Giving the bare options instead of the client should lead to more robust instance caching, since the cache key is now calculated from the config attributes directly and not the client.

Implementation

  • Pick out lakefs_client.Configuration attributes that should be supported.
  • Add them to the LakeFSFileSystem constructor.
  • Change all example code to take the raw config attributes as storage_options.

Allow implicit branch creation on file uploads / moves

Consider the following use case:

import pandas as pd

storage_options = {"client": client}
df = pd.read_parquet("lakefs://my-repo/main/lakes.parquet", storage_options=storage_options)

... # process the data, add columns, whatever

pd.to_parquet("lakefs://my-repo/new-branch/lakes.parquet", storage_options=storage_options)

Here, it might be beneficial to allow the creation of new-branch if it does not exist.

BUT: It might be surprising to silently create branches (think typos, etc.), so this should most likely be a toggle switch with a sensible default behavior.

The branch creation would only be relevant for additive operations, i.e. those that add or duplicate data (e.g. put, mv, cp).

Implementation

  • Add a boolean switch (e.g. create_branch_ok) to the LakeFSFileSystem constructor, with documentation.
  • Add the switch to the LakeFSFileSystem.scope() context manager, same as the other two options.
  • Create the branch via client.branches_api.create_branch before an additive operation if create_branch_ok is True - either by creating if it does not exist (incurs a branch listing, which might be expensive for repos with many branches) OR by branch-ensuring via an unconditional create (needs client-side error handling).
  • Create the branch in the same way for the LakeFSFile if it is opened in write mode, e.g. in __init__ if mode == "wb" OR in _upload_chunk directly before the put_object call).

Revisit `put_file` and `get_file` implementations

Those were historically the two most important functions which were implemented even before LakeFSFile was.

Other file systems, including the parent AbstractFileSystem, implement these two via LakeFSFile.open, which does make sense. We should study the code and see if we might be able to roll with a solution based on opened LakeFSFiles, too.

Rethink hook execution concept

The more I think about it, the more I am unsure about our "execute hooks anyway after events" approach.

Because, what is the point of creating a commit if a file upload fails? As of now, this is only not exploding on us because we provide the "if diff is empty, abort" escape hatch.

Would it not be more sensible to execute hooks only after operational success? Opinions welcome.

Set up GitHub Actions for the project

Additionally to the existing CI setup, we would need a lakeFS instance for integration tests.

Subtasks:

  • Add pre-commit job
    • Primed on pull requests against main and merges into main
    • Bonus: with caching via actions/cache (including tool subcaches)
  • Add pytest job
    • Primed on pull requests against main and merges into main
    • Add a service block for lakeFS setup (local mode should be fine)
    • Bonus: with venv caching via actions/cache (keyed on dev-deps.lock)

Define behavior when requesting unknown repositories and branches

Currently we have little to no coverage on what happens when a user specifies a nonexistent repository or branch in his interactions.

While repositories should probably not be created automatically, it might be a UX benefit to allow for automatic branch creation?

The easiest and most natural option is, of course, to error whenever a non-existent repo or branch is requested.

Compatibility problems for older Python versions

Listing some issues on different Python versions. Reproducer:

pythonX.Y -m pip install --upgrade lakefs-spec
python -c "import lakefs_spec"

Python 3.9:

TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

Suggested fix: from __future__ import annotations imports in all modules using X | Y-syntax for unions.

Python 3.10:

ImportError: cannot import name 'StrEnum' from 'enum'

Suggested fix: Either use Enum and coerce to string, or do a conditional import on Python version, or inherit from string in older versions since all things we use from StrEnum should be in scope for inheritance from str as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.