Code Monkey home page Code Monkey logo

carefree-data's Introduction

carefree-data

carefree-data implemented a data processing module with numpy.

Update 2021.02.04

carefree-data now uses datatable as backend, which significantly improves the performances on file inputs!

Why carefree-data?

carefree-data is a data processing module which is capable of handling 'dirty' and 'messy' datasets.

For tabular datasets, carefree-data is able to:
  • Elegantly deal with data pre-processing.
    • A Recognizer to recognize whether a column is STRING, NUMERICAL or CATEGORICAL.
    • A Converter to convert a column into friendly format (["one", "two"] -> [0, 1]).
    • A Processor to further process columns (OneHot, Normalize, MinMax, ...).
    • And all the transforms could be inverse! (See tests\unittests\test_tabular.py -> test_recover_labels & test_recover_features).
    • And these procedures are all completed AUTOMATICALLY!
  • Handle datasets saved in files (.txt, .csv).
    • For .txt, " " will be the default delimiter.
    • For .csv, "," will be the default delimiter, and the first row will be skipped as default.
    • delimiter, label index, skip first could be set manually.

Pandas-free

There is one more thing we'd like to mention: carefree-data is 'Pandas-free'. Pandas is an open source library providing easy-to-use data structures on structured datasets. Although it is a widely used library in almost every famous Machine Learning and Deep Learning module, we finally decided to escape from it, and the reasons are listed below:

  • carefree-data wants to have full control on the data, and Pandas is not flexible enough.
  • carefree-data needs higher performances. Pandas is fast, but not as fast as pure numpy (and sometimes cython) codes on some critical code paths.
  • Pandas provides many powerful functions, but carefree-data doesn't need that much, which means Pandas is a little 'heavy' for carefree-data.

In short, Pandas is a more general library, and that's why we've written some codes to cover our needs instead of directly utilizing it.

Currently carefree-data only supports tabular datasets.

Installation

carefree-data requires Python 3.8 or higher.

pip install carefree-data

or

git clone https://github.com/carefree0910/carefree-data.git
cd carefree-data
pip install -e .

Basic Usages

Get scikit-learn datasets

from cfdata.tabular import TabularDataset

iris = TabularDataset.iris()

Read from array / dataset

from cfdata.tabular import *

iris = TabularDataset.iris()
x, y = iris.xy
assert TabularData().read(x, y) == TabularData.from_dataset(iris)

Read from file

from cfdata.tabular import TabularData

file = "/path/to/your/file"
data = TabularData().read(file)
assert data.processed == data.transform(file)

License

carefree-data is MIT licensed, as found in the LICENSE file.


carefree-data's People

Contributors

carefree0910 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

saizhuowang

carefree-data's Issues

Loading Titanic Dataset

I want to try use your program on Titanic but am unsure how to split the csv file into x and y. I tried the code below but it did not work:
train = 'titanic_train.csv'
label_column = 'Survived'
data = TabularData().read(train)
assert data.processed == data.transform(train)
x, y = data.processed.xy

carefree-data and carefree-learn appear to require numpy == 1.20.0

When importing either of:

from cfdata.tabular import TabularDataset
import cflearn

When the following lines in cython_wrappers.py is run with numpy==1.19.2 it will fail

try:
    from .cython_utils import *
except ImportError:
    raise

# with error
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

numpy 1.20 is not yet available in anaconda, but if numpy is uninstalled & numpy==1.20.0 is specifically installed via pip, the error disappears.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.