Code Monkey home page Code Monkey logo

specificityplus's People

Contributors

davidbau avatar esbenkc avatar fbarez avatar jas-ho avatar juliahpersson avatar kmeng01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

specificityplus's Issues

Add errorbars to barplots grouped by dataset

#47 adds barplots grouped by dataset. These plots are still missing errorbars. Asymmetric errorbars for pandas barplots turn out to be surprisingly tricky to get right; hence, the separate issue

Run and evaluate all edit algos on a given test set batch

Task Description: Ease-of-use enhancement allowing end-to-end experiments with a single call

Background

Our experiments require the outputs (incl. next token log probs) of the unedited model and multiple edit algorithms. Currently, the evaluation is a separate step and requires a separate call for each algorithm, leading to inefficiencies in computation and mental overhead.

Goal

To increase efficiency and ease of handling, we aim to evaluate a given test dataset batch directly for all edit algorithms and the unedited model. This dispenses with the need for transferring log prob files between the compute node and DFS when running on a cluster, and makes it easier to quickly experiment with all edit algorithms on a small slice of the test data.

Tasks

  • Chain the runs of multiple edit algorithms and the unedited model on a given test dataset batch
  • Perform the evaluation of the test dataset batch
  • adapt analysis and plotting script to combine outputs for several test dataset batches

Store info about base model inside the run-dir

Currently, experiments/evaluate.py does not store info about which LLM has been anywhere (see screenshot below for an overview over which information is available, currently)

Goal: Make sure the model which has been used gets logged inside the run_dir. Best will be to log it in the *case_*.json file since then it would be traceable on the individual test case level (in principle, edits for different LLMs could end up in the same run_dir if the --continue_from_run flag is used)

Improve CLI for evaluating experimental results

The reporting of experimental results in summarize.py should be improved to facilitate a more in-depth analysis:

  • add reporting to also include case ID etc
  • make it easier to save evaluation results in reasonable format (JSONL)
  • expose all arguments of summarize.main in the CLI
  • allow to read reports into pandas DataFrame easily

Make evaluation deterministic

current state

the logits are slightly different from one run to the next
image

goal

make it deterministic so it's easier to see when we're introducing actual bugs

Summarize and plot chunked results

Goal: take the output of runs of our e2e.py script and produce statistics and ploys for our paper.

Output of e2e.py: results.csv files in the "combined" subfolder of our results directory.

To do: new script e2e_analyze.py which produces statistics and plots using these four lines from experiments/analyze.py

    dfs = get_statistics(df) 
     print(format_statistics(dfs)) 
     plot_statistics(dfs) 
     export_statistics(dfs)

Allow to benchmark the unedited model

the code in experiments/evaluate.py behaves differently in the MEMIT repo compared to the ROME repo:

  • in the ROME repo, for every run of an evaluation it would run the edited as well as the unedited model on every test case; results for the former would be stored under the "post" key in the case_{case_id}.json file, results for the latter in the "pre" key.
  • in the MEMIT repo, it seems to run only the unedited model on every test case. This is more efficient, since we don't need to run the unedited model again for every memory editing algo.

goal: allow to run the unedited model as well using the CLI

Implement specificity++

Goal

Allow to evaluate specificity on CounterFact with the edit-prompt prepended to the test prompt and using KL divergence between edited and unedited model as the metric.

Details

  • build on the code from jas-ho/romepp#1
  • make sure we still get all three numbers (specificity, sp+, sp++) at the same time for comparison

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.