Code Monkey home page Code Monkey logo

Doc Codecov PythonVersion PyPi Paper Black

image

Machine learning applications often need large amounts of training data to perform well. Whereas unlabeled data can be easily gathered, the labeling process is difficult, time-consuming, or expensive in most applications. Active learning can help solve this problem by querying labels for those data samples that will improve the performance the most. Thereby, the goal is that the learning algorithm performs sufficiently well with fewer labels

With this goal in mind, scikit-activeml has been developed as a Python module for active learning on top of scikit-learn. The project was initiated in 2020 by the Intelligent Embedded Systems Group at the University of Kassel and is distributed under the 3-Clause BSD licence.

User Installation

The easiest way of installing scikit-activeml is using pip:

pip install -U scikit-activeml

Examples

We provide a broad overview of different use-cases in our tutorial section offering

Two simple examples illustrating the straightforwardness of implementing active learning cycles with our Python package skactiveml are given in the following.

Pool-based Active Learning

The following code snippet implements an active learning cycle with 20 iterations using a Gaussian process classifier and uncertainty sampling. To use other classifiers, you can simply wrap classifiers from sklearn or use classifiers provided by skactiveml. Note that the main difficulty using active learning with sklearn is the ability to handle unlabeled data, which we denote as a specific value (MISSING_LABEL) in the label vector y. More query strategies can be found in the documentation.

import numpy as np
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.datasets import make_blobs
from skactiveml.pool import UncertaintySampling
from skactiveml.utils import unlabeled_indices, MISSING_LABEL
from skactiveml.classifier import SklearnClassifier

# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)
y = np.full(shape=y_true.shape, fill_value=MISSING_LABEL)

# Use the first 10 instances as initial training data.
y[:10] = y_true[:10]

# Create classifier and query strategy.
clf = SklearnClassifier(
    GaussianProcessClassifier(random_state=0),
    classes=np.unique(y_true),
    random_state=0
)
qs = UncertaintySampling(method='entropy')

# Execute active learning cycle.
n_cycles = 20
for c in range(n_cycles):
    query_idx = qs.query(X=X, y=y, clf=clf)
    y[query_idx] = y_true[query_idx]

# Fit final classifier.
clf.fit(X, y)

As a result, we obtain an actively trained Gaussian process classifier. A corresponding visualization of its decision boundary (black line) and the sample utilities (greenish contours) is given below.

image

Stream-based Active Learning

The following code snippet implements an active learning cycle with 200 data points and the default budget of 10% using a pwc classifier and split uncertainty sampling. Like in the pool-based example you can wrap other classifiers from sklearn, sklearn compatible classifiers or like the example classifiers provided by skactiveml.

import numpy as np
from sklearn.datasets import make_blobs
from skactiveml.classifier import ParzenWindowClassifier
from skactiveml.stream import Split
from skactiveml.utils import MISSING_LABEL

# Generate data set.
X, y_true = make_blobs(n_samples=200, centers=4, random_state=0)

# Create classifier and query strategy.
clf = ParzenWindowClassifier(random_state=0, classes=np.unique(y_true))
qs = Split(random_state=0)

# Initializing the training data as an empty array.
X_train = []
y_train = []

# Initialize the list that stores the result of the classifier's prediction.
correct_classifications = []

# Execute active learning cycle.
for x_t, y_t in zip(X, y_true):
    X_cand = x_t.reshape([1, -1])
    y_cand = y_t
    clf.fit(X_train, y_train)
    correct_classifications.append(clf.predict(X_cand)[0] == y_cand)
    sampled_indices = qs.query(candidates=X_cand, clf=clf)
    qs.update(candidates=X_cand, queried_indices=sampled_indices)
    X_train.append(x_t)
    y_train.append(y_cand if len(sampled_indices) > 0 else MISSING_LABEL)

As a result, we obtain an actively trained Parzen window classifier. A corresponding visualization of its accuracy curve accross the active learning cycle is given below.

image

Query Strategy Overview

For better orientation, we provide an overview (incl. paper references and visualizations) of the query strategies implemented by skactiveml.

Overview Visualization

Citing

If you use skactiveml in one of your research projects and find it helpful, please cite the following:

@article{skactiveml2021,
    title={scikit-activeml: {A} {L}ibrary and {T}oolbox for {A}ctive {L}earning {A}lgorithms},
    author={Daniel Kottke and Marek Herde and Tuan Pham Minh and Alexander Benz and Pascal Mergard and Atal Roghman and Christoph Sandrock and Bernhard Sick},
    journal={Preprints},
    doi={10.20944/preprints202103.0194.v1},
    year={2021},
    url={https://github.com/scikit-activeml/scikit-activeml}
}

scikit-activeml's Projects

scikit-activeml icon scikit-activeml

Our package scikit-activeml is a Python library for active learning on top of SciPy and scikit-learn.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.