Code Monkey home page Code Monkey logo

todd's Introduction

Todd: Text Out of Distribution Detection

Todd is a library designed to provide anomaly scorer and the derived filters for text generation. It supports three levels of anomaly detection: input level, generated sequence level and token level. The input level anomaly detection is used for OOD detection. Whereas the generated sequence level aims at selecting the best candidates in a beam and the token level aims at detecting potentially wrong tokens in the generated sequence.

It revolves around two concepts: scorers and filters. The scorers are used to score the input, token or ouput and filters leverage these scores to produce masks to decide what to keep or reject.

For benchmarking purpose we tend to only use scorers to dump the scores and compte statistics. In practice though, we would want to use filters to select the best candidates.

Citing this work

DOI

@software{Darrin_Todd_A_tool_2023,
author = {Darrin, Maxime and Faysse, Manuel and Staerman, Guillaume and Picot, Marine and Dadalto Camara Gomez, Eduardo and Colombo, Pierre},
month = {2},
title = {{Todd: A tool for text OOD detection.}},
url = {https://github.com/icannos/Todd},
version = {0.0.1},
year = {2023}
}

Installation

From github

git clone [email protected]:icannos/Todd.git
cd Todd
pip install -e .

From pip

pip install todd

Examples

Different examples are provided in the examples folder.

OOD Detection with Mahalanobis Distance

# Extract features from the reference set
ref_embeddings, _ = extract_embeddings(model, tokenizer, in_val_loader, layers=[6])

# Initialize the Mahalanobis detector
maha_detector = MahalanobisScorer(layers=[6])
# Fit the detector with the reference set
maha_detector.fit(ref_embeddings)

with torch.no_grad():
    for batch in loader:
        inputs = tokenizer(
            batch["source"], padding=True, truncation=True, return_tensors="pt"
        )
        output = model.generate(
            **inputs,
            return_dict_in_generate=True,
            output_hidden_states=True,
            output_scores=True,
        )

        print(maha_detector(output)) 
        # Output the scores for each input in the batch
        # True means the input is In-Distribution (ie to be kept) and False means OOD

Todo:

  • Add reorderers to reorder the beam search candidates
  • Split filters and anomaly scorers (WIP, maxime)
  • Add power means and cosines ood scores

API

todd's People

Contributors

icannos avatar manuelfay avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.