Code Monkey home page Code Monkey logo

sae-exercises-mats's Introduction

This is a GitHub repo for hosting some exercises in sparse autoencoders, which I've recently finished working on as part of the upcoming ARENA 3.0 iteration. Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum.

Links to Colabs: Exercises, Solutions.

If you don't like working in Colabs, then you can clone this repo, download the exercises & solutions Colabs as notebooks, and run them in the same directory.

The exercises were built out from the Toy Models of Superposition exercises from the previous iteration, but now with new Sparse Autoencoder content. These exercises fall into 2 groups:

SAEs in toy models

We take the toy models from Anthropic's Toy Models of Superposition paper (which there are also exercises for), and train sparse autoencoders on the representations learned by these toy models. These exercises culminate in using neuron resampling to successfully recover all the learned features from the toy model of bottleneck superposition - see this animation.

SAEs in real models

And there are exercises on interpreting an SAE trained on a transformer, where you can discover some cool learned features (e.g. a neuron exhibiting skip trigam-like behaviour, which activates on left-brackets following Django-related sytax, and predicts the completion (' -> django).

You can either read through the Solutions colab (which has all output displayed & explained), or go through the Exercises colab and fill in the functions according to the specifications you are given, looking at the Solutions when you're stuck. Both colabs come with test functions you can run to verify your solution works.

List of all exercises

I've listed all the exercises down here, along with prerequisites (although I expect most readers will only be interested in the sparse autoencoder exercises). Each set of exercises  is labelled with their prerequisites. For instance, the label (1*, 3) means the first set of exercises is essential, and the third is recommended but not essential.

Abbreviations: TMS = Toy Models of Superposition, SAE = Sparse Autoencoders.

  1. TMS: Superposition in a Nonprivileged Basis
  2. TMS: Correlated / Anticorrelated Features (1*)
  3. TMS: Superposition in a Privileged Basis (1*)
  4. TMS: Feature Geometry (1*)
  5. SAEs in Toy Models (1*, 3)
  6. SAEs in Real Models (1*, 5*, 3)

Please reach out to me if you have any questions or suggestions about these exercises (either by email at [email protected], or a LessWrong private message / comment on this post). Happy coding!

sae-exercises-mats's People

Contributors

callummcdougall avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.