carefree-data's Introduction

carefree-data

carefree-data implemented a data processing module with numpy.

Update 2021.02.04

carefree-data now uses datatable as backend, which significantly improves the performances on file inputs!

Why carefree-data?

carefree-data is a data processing module which is capable of handling 'dirty' and 'messy' datasets.

For tabular datasets, `carefree-data` is able to:

Elegantly deal with data pre-processing.
- A Recognizer to recognize whether a column is STRING, NUMERICAL or CATEGORICAL.
- A Converter to convert a column into friendly format (["one", "two"] -> [0, 1]).
- A Processor to further process columns (OneHot, Normalize, MinMax, ...).
- And all the transforms could be inverse! (See tests\unittests\test_tabular.py -> test_recover_labels & test_recover_features).
- And these procedures are all completed AUTOMATICALLY!
Handle datasets saved in files (.txt, .csv).
- For .txt, " " will be the default delimiter.
- For .csv, "," will be the default delimiter, and the first row will be skipped as default.
- delimiter, label index, skip first could be set manually.

Pandas-free

There is one more thing we'd like to mention: carefree-data is 'Pandas-free'. Pandas is an open source library providing easy-to-use data structures on structured datasets. Although it is a widely used library in almost every famous Machine Learning and Deep Learning module, we finally decided to escape from it, and the reasons are listed below:

carefree-data wants to have full control on the data, and Pandas is not flexible enough.
carefree-data needs higher performances. Pandas is fast, but not as fast as pure numpy (and sometimes cython) codes on some critical code paths.
Pandas provides many powerful functions, but carefree-data doesn't need that much, which means Pandas is a little 'heavy' for carefree-data.

In short, Pandas is a more general library, and that's why we've written some codes to cover our needs instead of directly utilizing it.

Currently carefree-data only supports tabular datasets.

Installation

carefree-data requires Python 3.8 or higher.

pip install carefree-data

git clone https://github.com/carefree0910/carefree-data.git
cd carefree-data
pip install -e .

Basic Usages

Get scikit-learn datasets

from cfdata.tabular import TabularDataset

iris = TabularDataset.iris()

Read from array / dataset

from cfdata.tabular import *

iris = TabularDataset.iris()
x, y = iris.xy
assert TabularData().read(x, y) == TabularData.from_dataset(iris)

Read from file

from cfdata.tabular import TabularData

file = "/path/to/your/file"
data = TabularData().read(file)
assert data.processed == data.transform(file)

License

carefree-data is MIT licensed, as found in the LICENSE file.

carefree-data's People

Contributors

Stargazers

Watchers

carefree-data's Issues

Loading Titanic Dataset

I want to try use your program on Titanic but am unsure how to split the csv file into x and y. I tried the code below but it did not work:
train = 'titanic_train.csv'
label_column = 'Survived'
data = TabularData().read(train)
assert data.processed == data.transform(train)
x, y = data.processed.xy

carefree-data and carefree-learn appear to require numpy == 1.20.0

When importing either of:

from cfdata.tabular import TabularDataset
import cflearn

When the following lines in cython_wrappers.py is run with numpy==1.19.2 it will fail

try:
    from .cython_utils import *
except ImportError:
    raise

# with error
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

numpy 1.20 is not yet available in anaconda, but if numpy is uninstalled & numpy==1.20.0 is specifically installed via pip, the error disappears.

Recommend Projects

carefree0910 / carefree-data Goto Github PK