apartresearch / specificityplus Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 3.0 73.27 MB

👩‍💻 Code for the ACL paper "Detecting Edit Failures in LLMs: An Improved Specificity Benchmark"

Home Page: https://specificityplus.apartresearch.com

License: Other

Python 80.85% Shell 11.26% CSS 1.91% HTML 5.72% JavaScript 0.25%

benchmarking llm

specificityplus's Issues

Port specificity+ to memitpp

Prettify plots produced by e2e_analyze.py

goal: update plotting functions in e2e_analyze.py to output the format used in the paper
example

Guess hyperparameters for MEMIT and gpt2-medium for quick sanity check

Improve logging: progress bar, reduced verbosity, etc

allow reducing verbosity
add progress bar to long-running loops
provide start and stop timestamps on most important functions
potentially: log basic info about the machine we're running on

Add errorbars to barplots grouped by dataset

#47 adds barplots grouped by dataset. These plots are still missing errorbars. Asymmetric errorbars for pandas barplots turn out to be surprisingly tricky to get right; hence, the separate issue

Plot specificity results against model size

We want to see (tentatively) whether model edits are more, less, or equally sensitive if the model size increases

Run specificity++ on ROME and FT for gpt2-medium and compare to specificity+

Goal: confirm that specificity++ works by

running it on the models for which specificity+ results are already available
evaluating the results
comparing to the specificity+ results

Run and evaluate all edit algos on a given test set batch

Task Description: Ease-of-use enhancement allowing end-to-end experiments with a single call

Background

Our experiments require the outputs (incl. next token log probs) of the unedited model and multiple edit algorithms. Currently, the evaluation is a separate step and requires a separate call for each algorithm, leading to inefficiencies in computation and mental overhead.

Goal

To increase efficiency and ease of handling, we aim to evaluate a given test dataset batch directly for all edit algorithms and the unedited model. This dispenses with the need for transferring log prob files between the compute node and DFS when running on a cluster, and makes it easier to quickly experiment with all edit algorithms on a small slice of the test data.

Tasks

Chain the runs of multiple edit algorithms and the unedited model on a given test dataset batch
Perform the evaluation of the test dataset batch
adapt analysis and plotting script to combine outputs for several test dataset batches

Rename repo to Specificity+ and remove unwanted files

Clean up the README file

Store info about base model inside the run-dir

Currently, experiments/evaluate.py does not store info about which LLM has been anywhere (see screenshot below for an overview over which information is available, currently)

Goal: Make sure the model which has been used gets logged inside the run_dir. Best will be to log it in the *case_*.json file since then it would be traceable on the individual test case level (in principle, edits for different LLMs could end up in the same run_dir if the --continue_from_run flag is used)

Improve CLI for evaluating experimental results

The reporting of experimental results in summarize.py should be improved to facilitate a more in-depth analysis:

add reporting to also include case ID etc
make it easier to save evaluation results in reasonable format (JSONL)
expose all arguments of summarize.main in the CLI
allow to read reports into pandas DataFrame easily

Make evaluation deterministic

current state

the logits are slightly different from one run to the next

goal

make it deterministic so it's easier to see when we're introducing actual bugs

Summarize and plot chunked results

Goal: take the output of runs of our e2e.py script and produce statistics and ploys for our paper.

Output of e2e.py: results.csv files in the "combined" subfolder of our results directory.

To do: new script e2e_analyze.py which produces statistics and plots using these four lines from experiments/analyze.py

    dfs = get_statistics(df) 
     print(format_statistics(dfs)) 
     plot_statistics(dfs) 
     export_statistics(dfs)

Allow to benchmark the unedited model

the code in experiments/evaluate.py behaves differently in the MEMIT repo compared to the ROME repo:

in the ROME repo, for every run of an evaluation it would run the edited as well as the unedited model on every test case; results for the former would be stored under the "post" key in the case_{case_id}.json file, results for the latter in the "pre" key.
in the MEMIT repo, it seems to run only the unedited model on every test case. This is more efficient, since we don't need to run the unedited model again for every memory editing algo.

goal: allow to run the unedited model as well using the CLI

consume results for unedited model from results/IDENTITY/... folder
use information about the used LLM and editing algo from the results json file if available
extract results for specificity++
inspect outliers

get all branches into main

experiments
cluster_experiments
layer_stats

Goal

Allow to evaluate specificity on CounterFact with the edit-prompt prepended to the test prompt and using KL divergence between edited and unedited model as the metric.

Details

build on the code from jas-ho/romepp#1
make sure we still get all three numbers (specificity, sp+, sp++) at the same time for comparison

apartresearch / specificityplus Goto Github PK

specificityplus's People

Contributors

Stargazers

Watchers

Forkers

specificityplus's Issues

Task Description: Ease-of-use enhancement allowing end-to-end experiments with a single call

Background

Goal

Tasks

current state

goal

Goal

Details

Recommend Projects

Recommend Topics

Recommend Org