Code Monkey home page Code Monkey logo

testing4ds's Introduction

Testing for Data Scientists

Why Testing

  • Find bugs
  • Check your assumptions by making them explicit
  • Test outcomes, not the implementation

When to test?

  • Write tests as you develop the code (or before, if you're using TDD)
  • When you find a bug, add a test.

Types of Tests

  • Unit
    • define input, output, expected behavior
    • run on internal (module) code only
    • often automated, run before software merge/release cycles
    • code coverage measures lines exercised by unit tests
  • Integration

Docstrings, Syntax and Type Checking

  • Write clear docstrings
    • What is the function's purpose
    • What objects and parameters does it take and return
  • Put examples in the docstring Create an example at the interactive prompt and paste it in the docstring (People mostly skip reading the docstring and go straight to examples)
    • The doctest module creates tests out of docstring examples
    • It finds the interactive prompt text and re-creates test results
if __name__ == '__main__':
    import doctest
    print(doctest.testmod())
  • Detect common errors with pyflakes
    • Unused imports, undeclared variables etc. will be found
python3 -m pyflakes quadratic.py
  • Put type checking in the code with mypy
    • Add type hints each time you declare a new data structure/function
    • mypy is able to follow the chain of commands and ensure they assemble properly
mypy quadratic.py

Test Modules

pytest

  • Create a folder called /tests in your repo to hold files with test functions
  • Write test functions that begin with def test_<func-to-test>():
    • test functions examine the output of module functions
    • ensure that module functions do as they advertize using assert statements
    • Example
from my_module import get_data

def test_get_data():
    df = get_data()
    assert all(df.columns == ['A', 'B', 'C', 'D'])
    assert isinstance(df.index, pd.DatetimeIndex)
  • Run tests as
python3 -m pytest my_module

hypothesis

Find failing test cases with hypothesis

  • Implements property-based testing
  • The property is invariant (always True) against which hypothesis runs many tests
    • Generate random inputs according to some strategies
    • Run function with different combinations
    • Essentially, it tortures the function until it fails
  • We use the given, assume and strategies features of hypothesis to set up test space
  • Useful for finding edge cases
  • Ideal for code that accepts free-text input

engarde

  • Built with pandas, very lightweight
  • Use for functions that accept and return Dataframes
  • Just add decorators! (or use DataFrame.pipe())
@is_shape(10**5, 16)
@none_missing()
@unique_index
def transform_input(df):
    ...
    return result

Writing tests in Data Science

  • Univariate
    • feature values within expected thresholds
    • feature stats (mean, stddev) within expected ranges
    • feature values following an expected statistical distribution
    • categorical features have expected number of levels
  • Bivariate
    • correlation metric for pairs of features are satisfied
  • DataFrame level
    • DataFrame has expected shape
    • Index is of expected type and shape

References

Talks

Links

testing4ds's People

Contributors

dushyantkhosla avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.