Code Monkey home page Code Monkey logo

dstk's Introduction

DSTK, Data Science Toolkit

Dependencies

  • pandas == 0.22.0

Package structure

Inspection

Bi-variant inspection

  • chi2 (2 stars)
  • ANOVA (2 stars)
  • T-test (2 stars)
  • IV (3 stars)
  • KS (3 stars)

Check collinearity

  • Collinearity
    • TBD
  • Multicolinearity
    • Variance Inflation Factor (3 stars)

OOT inspection

  • PSI (3 stars)
  • Dataframe comparison (unit tests-covered)

Data type detector (3 stars)

  • Numeric
  • Numeric-Categorical
  • String-Categorical
  • Time

Outlier detection

Univariate
  • Tukey's method
  • z-test
Multivariate
  • Residual threshold method
  • Local outlier factor
  • HiCS

Preprocessing

Imputing (unit tests-covered)

  • Continous
    • mean
    • truncated mean
    • median
    • bin-nan
  • Categorical
    • most frequent class
    • stringify

MISC

  • onehot_split

Metric

Response related metrics

Clustering metrics

  • Purity (need unit tests)
  • Accuracy (need unit tests)

Transformation

Binning

  • Equal pupulation binning (3 stars)
  • Equal value binning (3 stars)
  • Monotonic binning
  • ChiMerge

Encoding

  • Dummy (2 stars)
  • WOE (2 stars)
  • Tree leaves encoding

Clustering

K-based clustering

Density-based clustering

Hierarchical clustering

Advanced clustering

  • Spectral clustering
  • Subspace clustering
  • Multi-sourced clustering
  • Multi-aspect clustering
  • Multi-task clustering

Deep learning-based clustering

  • AE + K-means
  • AE + Spectral clustering
  • AE + Subspace clustering

Feature learning

Adversarial representation learning

  • BiGAN
  • infoGAN
  • AAE

statistical description of raw data

Exploring/Summarize the data distribution

data type

different processing methods for differnet types of data

numeric

  • continuous: Data that can take on any value in an interval
  • discrete: Data that can only take on integer values

categorical

Data that can only take on a specific set of values

  • Binary: special case of categorical data, can only take two values

ordinal

Categorical data that has an explicit ordering

Numeric data statistical description

Estimates of Location

  • mean
  • truncated mean
  • weighted mean
  • median
  • outliers

Estimates of Variability

  • variance: N-1
  • standard deviation
  • range: min/max values
  • percentiles
  • Interquartile Range(IQR): 75th percentile - 25th percentile

data distribution exploration

  • Boxplot
  • Frequency table
  • histogram
  • density plot: kernal density estimate

Categorical data statistical description

  • Mode: the most commonly category/value
  • Expected value: similar as weighted mean
  • Bar charts:The frequency or proportion for each category plotted as bars
  • Pie charts:The frequency or proportion for each category plotted as wedges in a pie =======

MISC

  • Entity embeddings

dstk's People

Contributors

irisldc avatar savourylie avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

irisldc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.