cadenza-labs / elk Goto Github PK

This project forked from eleutherai/elk

Keeping language models honest by directly eliciting knowledge encoded in their activations. Building on "Discovering latent knowledge in language models without supervision" (Burns et al. 2022)

License: MIT License

Python 98.67% Dockerfile 0.62% Shell 0.71%

elk's Introduction

Introduction

WIP: This codebase is under active development

Because language models are trained to predict the next token in naturally occurring text, they often reproduce common human errors and misconceptions, even when they "know better" in some sense. More worryingly, when models are trained to generate text that's rated highly by humans, they may learn to output false statements that human evaluators can't detect. We aim to circumvent this issue by directly **eliciting latent knowledge ** (ELK) inside the activations of a language model.

Specifically, we're building on the Contrastive Representation Clustering (CRC) method described in the paper Discovering Latent Knowledge in Language Models Without Supervision by Burns et al. (2022). In CRC, we search for features in the hidden states of a language model which satisfy certain logical consistency requirements. It turns out that these features are often useful for question-answering and text classification tasks, even though the features are trained without labels.

Quick Start

Our code is based on PyTorch and Huggingface Transformers. We test the code on Python 3.10 and 3.11.

First install the package with pip install -e . in the root directory, or pip install eleuther-elk to install from PyPi. Use pip install -e .[dev] if you'd like to contribute to the project (see Development section below). This should install all the necessary dependencies.

To fit reporters for the HuggingFace model model and dataset dataset, just run:

elk elicit microsoft/deberta-v2-xxlarge-mnli imdb

This will automatically download the model and dataset, run the model and extract the relevant representations if they aren't cached on disk, fit reporters on them, and save the reporter checkpoints to the elk-reporters folder in your home directory. It will also evaluate the reporter classification performance on a held out test set and save it to a CSV file in the same folder.

The following will generate a CCS (Contrast Consistent Search) reporter instead of the CRC-based reporter, which is the default.

elk elicit microsoft/deberta-v2-xxlarge-mnli imdb --net ccs

The following command will evaluate the probe from the run naughty-northcutt on the hidden states extracted from the model deberta-v2-xxlarge-mnli for the imdb dataset. It will result in an eval.csv and cfg.yaml file, which are stored under a subfolder in elk-reporters/naughty-northcutt/transfer_eval.

elk eval naughty-northcutt microsoft/deberta-v2-xxlarge-mnli imdb

The following runs elicit on the Cartesian product of the listed models and datasets, storing it in a special folder ELK_DIR/sweeps/<memorable_name>. Moreover, --add_pooled adds an additional dataset that pools all of the datasets together. You can also add a --visualize flag to visualize the results of the sweep.

elk sweep --models gpt2-{medium,large,xl} --datasets imdb amazon_polarity --add_pooled

If you just do elk plot, it will plot the results from the most recent sweep. If you want to plot a specific sweep, you can do so with:

elk plot {sweep_name}

Caching

The hidden states resulting from elk elicit are cached as a HuggingFace dataset to avoid having to recompute them every time we want to train a probe. The cache is stored in the same place as all other HuggingFace datasets, which is usually ~/.cache/huggingface/datasets.

Development

Use pip install pre-commit && pre-commit install in the root folder before your first commit.

Devcontainer

Run tests

pytest

Run type checking

We use pyright, which is built into the VSCode editor. If you'd like to run it as a standalone tool, it requires a nodejs installation.

pyright

Run the linter

We use ruff. It is installed as a pre-commit hook, so you don't have to run it manually. If you want to run it manually, you can do so with:

ruff . --fix

Contributing to this repository

If you work on a new feature / fix or some other code task, make sure to create an issue and assign it to yourself ( Maybe, even share it in the elk channel of Eleuther's Discord with a small note). In this way, others know you are working on the issue and people won't do the same thing twice 👍 Also others can contact you easily.

elk's People

Contributors

Stargazers

Forkers

tommybark

elk's Issues

Average over layers and prompts during inference [29.05]

Same as inference-k-prompts but with credence scores also averaged over layers. (I.e. there is one credence score one gets from inference-k-prompts for each layer, and one averages those for some set of layers (maybe e.g. the last 10 layers) before inference.)

Editing Activations

Edit the activation
We could use DLK to figure out what direction inside an inner layer corresponds to truthiness, edit the activation in that direction, and then see if the model output changes correspondingly. For instance, try the following:
“X is a glarg iff X is a schmutzel. X is a glarg. Is X a schmutzel? A:”
The language model should output “yes” as the answer. And the hope is that if we edit the truthiness of sentence 2 to be false, then it will output “no”.
Actually, I [Kaarel] have a pretty low probability of this working because the main association here is probably not-sentence level. Maybe something like “The previous sentence is true.” would work better.

VINC presentation [19.04]

Deadline: EoD 14. April
Give 1-hour presentation on VINC

Check if all neurons lead to ROME

Using interpretability tools (e.g the causal tracing method from the ROME publication), we could check if we can figure out how truth is represented and in which neurons. We could even combine this approach with the previous idea and see if they produce the same results.

Link

[2 May] Experiment: train probe per prompt

See https://www.lesswrong.com/posts/bFwigCDMC5ishLz7X/rfc-possible-ways-to-expand-on-discovering-latent-knowledge#Additional_ideas_that_came_up_while_writing_this_post_
Point 2

note: Imported from old project

Sweep + Visualizing [05.05]

for Paper:

Run sweep on datasets and different models (VINC and CCCs)
Visualize this datasets best with Plotly
Compare VINC with CCS
Which layer to visualize / display, since there is a lot of information? How to present best? Just pick the best layer and the worse layer accross all datasets?

Experiment: Inference k-prompts

See https://www.lesswrong.com/posts/bFwigCDMC5ishLz7X/rfc-possible-ways-to-expand-on-discovering-latent-knowledge#Additional_ideas_that_came_up_while_writing_this_post_

I think you don't really need to understand the details VINC for this, just that it's a process which provides a model that maps activations on inputs -> credence scores.

Kaarel

In this experiment, we want to average the CCS outputs on the various ways to prompt the same data point when doing inference, and also do the same for VINC outputs.

In other words, I think you can essentially treat the VINC training process as a black box here. The VINC training process outputs a probe that maps activations to credence scores (just like the CCS probe does), and you'd only be using this trained probe.

note: Imported from old project

Create VINC explainer [12.04]

Deadline: EoD 12.04

I will create a document that explains:

VINC
The basic problem setting of PCA
Variance, Covariance

Visualizing the probability of output for each layer [? May]

See https://www.notion.so/eleutherai/bb626a3eb5884c0b835b3b459b598c3a?v=a02eb8da078f4414b6fd5791e639747a&p=e1b28843b6be4abb9e6e1d5ab5f92a9f&pm=s

Testing Inverse scaling dataset

We want to take the inverse scaling datasets and train a DLK probe for the following models:

a small model (GPT2)
a mid-sized model (GPT-J)
a big model (some LLama model…)

Then we want to check if the representation of truth in inner representations is also getting less accurate for bigger models. If that happens, it could point in the direction of the model's understanding actually getting worse. On the other hand, if it doesn't hold, it could point in the direction of the inverse scaling law cases thus far having more to do with something weird going on with output behavior in a given context, and they might not generalize. Also, this seems like a potentially interesting additional testing ground for whether DLK can provide information about the model beyond output behavior.

See in blog post

Evaluate alternating text

We could try to input a passage of alternating true/false sentences and try to see which inner states (i.e. which position) are best for determining the truth of each particular sentence. Are these always the positions of the tokens in that sentence? Does it get more spread out as one goes deeper into the transformer? The hypothesis is that if we can locate the positions that the model looks for in each true sentence, we can trace that to the model's internal representation of the truth.

DLK is a non-mechanistic interpretability technique since it only finds a representation of truth; it doesn’t provide a mechanism. On the other hand, if the above works, it might provide information on how the model stores truth, which is useful for mechanistic interpretability research.

See in post

Persona / Character

One worry with DLK is that it might end up recovering the beliefs of some simulated agent (e.g. a simulated aligned AGI, or a particular human, or humanity’s scientific consensus). One idea for checking for this is to study how the LM represents the beliefs of characters it is modeling by just doing supervised learning with particular prompting and labels. For instance, get it to produce text as North Korean state news, and do supervised learning to find a representation in the model of truth-according-to-North-Korean-state-news.

If we do this for a bunch of simulated agents, maybe we can state some general conclusion about how the truth according to some simulated agent is represented, and maybe this contrasts with what is found by DLK providing evidence that DLK is not just finding the representation of truth according to some simulated agent. Or maybe we can even use this to develop a nice understanding of how language models represent concepts of simulated agents vs analogous concepts of their own, which would let us figure out e.g. the goals of a language model from understanding how goals of simulated agents are represented using supervised probing, and then using the general mapping from reps of simulatee-concepts to reps of the model’s own analogous concepts which we developed for truth (if we’re lucky and it generalizes).

See Point 13 in Additional Ideas: https://www.lesswrong.com/posts/bFwigCDMC5ishLz7X/rfc-possible-ways-to-expand-on-discovering-latent-knowledge#Additional_ideas_that_came_up_while_writing_this_post_