Code Monkey home page Code Monkey logo

tabular-pipeline's Introduction

Data pipeline

Data pipeline to ingest, conform, normalise, validate and export tabular data files (for now) given yml schema(s) as the only source of truth. It should be possible to easily plug it in modern orchestration tools such as Airflow and Dagster.

โš ๏ธ This is a work in progress

Steps

  • Read - currently csv and xlsx
  • Conform - detect column headers, normalise (according to schema rules/accepted aliases) and ignore irrelevant ones
  • Normalise - normalise columns content according to schema data types
    • define accepted data formats (int, float, str, date, year-month, categorical)
    • allow schemas to extend said formats

TODO

  • Try this with airflow - use S3 operator to store data in between
  • Try this with dagster ?
  • Try this with pyodide ?
  • make it work for multiple datasets - many files and many tables within each file
  • allow other export formats - for now, data is transfered over csv, but we might want to use binary, like feather (which might require the usage of pandas)
  • try with external data sources (like S3 bucket) - one should able to read and write from theme easily
  • return data directly - it might be the case that there's no need to store the data between steps - in that case it should be returned directly.

tabular-pipeline's People

Contributors

franciscobmacedo avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.