Code Monkey home page Code Monkey logo

elliptio_data_lake's Introduction

About

ElliptIO is a small python library for storing and accessing files in data lakes in a data science context. It stores files including automatically generated metadata on any file system and inserts metada into a database. A lot of inspiration is drawn from Weights & Biases.

It is named after the Elliptio mussel genus which lives in freshwater lakes.

Problems and solution approach

Particular in data science you often find data lakes where...

Problem Solution approach
data cannot be reproduced Automatically log required information.
data lineage is unknown Automatically track lineage between files.
data is accidentally modified Lock files using S3 lock.
data has no metadata Users can specify custom metadata when saving files.
directory structure is chaotic Simply save files by date and user. A good metadata search makes structure much less important.
data is duplicated Automatically replace duplicated files with references (not yet implemented)

Existing solutions

I find Weights and Biases a great app, from which a lot of inspiration is drawn. However, it can be rather expensive and focuses on a lot more things than just data storage, so can easily be an overkill.

Object stores such as S3 or Ceph already provide the option to store metadata. However, this does not cover all required data for reproducibility. Also, querying metadata is not as efficient as querying a database.

How to use

import json
import pandas as pd
from elliptio import get_default_handler, ManualMetadata

# setup manual metadata (optional) and handler
metadata = ManualMetadata(
    ticket="abc-123",
    project="my_project",
    config=json.dumps({"example": "value"}),
    description="lorem ipsum",
)
h = get_handler(dirpath="/tmp/my_data_lake", manual_metadata=metadata)

# save file directly to remote
df = pd.DataFrame({"a": [1], "b": [2]})
with h.create("train.txt") as f:
    df.to_csv(f.remote_url)

# load file. Its file_id will be added to every new file in this session.
train_file = h.load(f.file_id)

# upload an existing new file
# model.train(train_file)
with h.create("model.pickle") as model:
    model.upload("/tmp/my_data_lake/best_model.pickle")
assert model.based_on == [train_file.file_id]

# querying the database
df = h.query({"ticket": "abc-123"})

How to install

Simply run pip install elliptio.

Tips

  • You can easily pass custom filesystem, database, tracker and id_creator classes to get_handler
  • The current filesystem class is based on fsspec and thus should support all their filesystem implementations (S3, Azure Blob service, Google Cloud Storage, etc.). See example below.
  • To create a nice GUI for your database, I can recommend Metabase. Metabase, without the enterprise features, is APGL licensed. You have to be careful when modifying the code or incorporating it into your application, but running the app without modifications internally in "vanilla mode" seems to be fine according to them.
  • The terraform/ directory contains example Terraform code to setup S3 and a free MongoDB on AWS. However, there's currently no MongoDB implementation for the DatabaseInterface.
# Example for passing custom FileSystemInterfaces like S3
from elliptio.adapters import fs,db
from elliptio import get_handler

h = get_handler(
    fs=fs.FsspecFilesystem(prefix="some/prefix/", protocol="s3", storage_options={}),
    db=db.SqlDatabase("db.sqlite"),
)

TODOs

elliptio_data_lake's People

Contributors

cgebbe avatar github-actions[bot] avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.