Code Monkey home page Code Monkey logo

mulsi's Introduction

logo

MUltimodal Llm Safe Inference

Safe inference using representation engineering.

mulsi's People

Contributors

xmaster6y avatar imenelydiaker avatar

Stargazers

BufanXu avatar Jason Hoelscher-Obermaier avatar Alexandre GAZAGNES avatar  avatar Clement Dumas avatar  avatar

Watchers

 avatar

mulsi's Issues

[EXPERIMENT] Detect concepts using images

Description

Detect concepts activated by the attack using images instead of text when building contrast vectors.

Acceptance Criterion

  • The code is documented
  • Utilities and class test are written
  • The code was reviewed

Tasks

  • Select concepts to work on.
  • Gather a set of images for each selected concept.
  • Build the contrast vectors.

Extend concept reading with datasets

Description

Extend concept reading with text datasets.

Acceptance Criterion

  • The code is documented
  • Utilities and class test are written
  • The code was reviewed

Tasks

  • Collect datasets.
  • Compute dataset representation with class Representation.
  • Extend script to evaluate on a dataset.

[EXPERIMENT] Expand the list of model to evaluate

Description

Expand the list of models to attack and evaluate the detection method on. The idea is to verify the effectiveness of the detection method.
And if possible, see if the some patterns appear when different models are attacked by different methods.

Acceptance Criterion

  • The code is documented
  • Utilities and class test are written
  • The code was reviewed

Tasks

  • List models and build processors.
  • Attack the models and perform detection.
  • Analyze results.

Attack Scripts

Description

Make some scripts and modules to easily reproduce attacks.

Acceptance Criterion

  • The code is documented
  • Utilities and tests are written
  • The code was reviewed

Tasks

  • Create a basic FGSM attack
  • Create a script to scale the attack

Contrast Reading Script

Description

Make some scripts and modules to easily scale reading.

Acceptance Criterion

  • The code is documented
  • Utilities and tests are written
  • The code was reviewed

Tasks

  • Create a basic contrast reading
  • Create a script to scale the reading

Add BERT-Attack on text modality

Description

Attacking the text modality with BERT-Attack. The attack was initially developed on BERT, thus it should be adapted to the targeted LLM.

Acceptance Criterion

  • The code is documented
  • Utilities and class test are written
  • The code was reviewed

Tasks

  • Develop the attack.
  • Test script for the attack.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.