andyzoujm / representation-engineering Goto Github PK

View Code? Open in Web Editor NEW

670.0 670.0 76.0 489 KB

Representation Engineering: A Top-Down Approach to AI Transparency

Home Page: https://www.ai-transparency.org/

License: MIT License

Python 45.02% Jupyter Notebook 50.32% Shell 4.66%

representation-engineering's People

Contributors

Stargazers

Watchers

Forkers

ailabteam zhoumz123 standardgalactic justinphan3110 yttsen9 henrypapadatos gengrui1983 preemo-inc hsliu-initial liac-li dtch1997 zrgtzjwh wassname zhentingqi zxcayumi jack139 cwcw32 huup-ai jam3scampbell y-l-liu sysang justinwangx dongguanting rasherlo liujuncn repartee-ai s1ghhh ultimatejupiter 5l1v3r1 kuyesu wangtz19 dcfidalgo robertmccarthy97 mxl1n hanyuanhsu azurereflection guorbim y12uc231 evelynmitchell clueless-manish chloeli-15 mrcuongtroll ltroin aisc-steering-llms aetherprior joshuaminwookang ajlahwh dhruvji codeaudit notrichardren mrcodechef hong-s-collection-on-trustworthy-ai blaisingtech ivyllll tengyuantuohai-113 asacooperstickland keesh0410 jzsawyer srivhash owen-yeung tingchenfu bertverbruggen pythonlearner1025 sociallearner heraclex12 mycpuorg phucnh8498 cdjg35 alignment-lab-ai daoyuan14 huyiwen yanancai harrymayne stephen016 winnieyangwannan yerongli bloomv

representation-engineering's Issues

Unexpect behavior from honesty example notebook

Hi team,

Thank you for your great project! I am recently running your honesty example notebook and trying to run through your provided example. However in the last two cells of honesty.ipynb. I got no meaningful output from the control side. I made no adjustments to the code and used the 7B model. Any ideas would be helpful!

Thanks

Need help with honesty benchmark against vanilla llama2

Hey lab. I am working on a POC with a customer support AI company. They ask us to provide honesty benchmark to use data to prove we are more honest than the vanilla llama2. Do we have the public data set and the test result? If not, could we use https://people.eecs.berkeley.edu/~normanmu/llm_rules/ to test? It seems the pipeline is different and we can not use as is.

How to automate the [threshold] parameter in Honest example?

We realized we have to manually adjust the [threshold] parameter in Honest example in order to get acceptable results. We want to know what is the plan to automate the [threshold] parameter adjustment? It is not very user friendly and not very practical as well.

Question about the honesty scores calculation

In your honest scores calculation, what is the justification of

results[pos][0][layer][0] * honesty_rep_reader.direction_signs[layer][0]

Why you need to multiply by the direction sign, not just using the results[pos][0][layer][0]

Thanks

Understandung the contrast vector implementation

As far as I can tell, this is the part where the Contrast Vector is applied:

representation-engineering/repe/rep_control_contrast_vec.py

Lines 331 to 338 in 8de198b

    
           if layer_id in control_layer_ids: 
        
               activations = alpha * (hidden_states_p[:, contrast_tokens:] - hidden_states_n[:, contrast_tokens:]) 
        
               c_length = activations.shape[1] 
        
               hidden_states[:, -c_length:, :] += activations 
        
               hidden_states_p[:, -c_length:, :] += activations 
        
               hidden_states_n[:, -c_length:, :] += activations

I asked myself the question why not only the hidden_states are changed but also hidden_states_p and hidden_states_n. Could you elaborate on that?

Btw: Since contrast_tokens is choosen to be negtative, is it not always equal to c_length?

Contrast Vector - Add Code for Generation?

Hi,

I was wondering if code demonstrating how to use the contrast vector for text generation could be pushed? (Ideally for a 7B model!)

I am quite unclear on how to convert the contrast vector code from honesty_control_TQA.ipynb to allow for generation. And I am finding it difficult to parse all the contrast vector implementation details from the paper.

This would be very useful, so would really appreciate it!

Thanks,
Robert

Question about layer_id

Hi! When I run your notebook, I wonder how can define layer_id for emotion control?

Looking forward to hear your answers, thank you!

Performance of Harmfulness exp is too high ?

Hello authors,
Your experiment results on harmfulness classification:https://github.com/andyzoujm/representation-engineering/blob/main/examples/harmless_harmful/harmless_llama2.ipynb shows that Llama-2-13b-chat achieves near 100% acc, even in the lower layers.
I have tried more model: Llama-2-{7,70}b-chat, llama-2-7b, bloomz-{560m,1b1,1b7,3b,7b1}, bloom-7b1, all of these models also achieves near perfect performance.
I think the phenomenon is trivial for chat models because harmful and harmless input correspond to the two directions of refusing to answer and answering, however, for models without alignment like bloom/llama-2, it is nontrivial.
(By the way, in my experiment, I didn't include task template in prompt, instead, i directly used stimuli, and I used the ClusterMeanRepReader method.)

Are there any more plausible explanations for the phenomenon?

CLIP Examples for Emotion Classification

Hi! Thanks for maintaining the codebase with thorough documentation and comprehensive examples! Is it possible to add the CLIP examples in Appendix B.5? Also, could you clarify how the emotion directions are obtained? Is it obtained with texts and then used to classify images?

Thanks a lot in advance!

Question about customize pipeline in code

What does the self function in the function of your customized pipeline refer to?

Thanks

Assert error "NaN in output logprobs"

Hi! When I run examples/honesty/honesty_control_TQA.ipynb with llama2-7b-chat-hf, it threw an assert error in line 44:
assert np.isnan(output_logprobs).sum() == 0, "NaN in output logprobs"
I already set layer_ids = np.arange(8, 32, 3).
Wish to get your help.

Will code and data be updated in this repo?

Hello~, I'm reading the paper and want to do some tests by myself, will the related code and data be released in this repo?

Multi-turn conversation

How should I apply emotion control in a multi-turn conversation if I want to use it?

Dataset in example honest notebook

Hi,

I am wondering why training dataset in honest reading notebook have unfinished sentences in the inputs and labels containing [True, False]. How are inputs and labels connected?

Thanks

Do we need to optimize over the constrastive vector loss

In Algorithm 1, we optimized over the loss

While in example code, we never generate from the constrastive perturbations directly,
https://github.com/andyzoujm/representation-engineering/blob/main/examples/honesty/honesty_contrast_vec_TQA_generation.ipynb

Do we need to optimize over the Loss L in Algorithm 1 to reproduce the results?

The confusion about how to generate dishonest responses

Hello, author, how can I generate dishonest responses?

Question about emotion_funtion

Hi, I have some questions about the emotion_function notebook.

Why did you choose to use mistralai/Mistral-7B-Instruct-v0.1 in this notebook instead of Llama2 used in emotion_concept?
In the primary_emotions_function_dataset function, you used all_truncated_outputs. I would like to know how this was obtained.
Also in this function, if I want to build a test set, can I just use the same way in primary_emotions_concept_dataset?

Thanks for your help!

Question about max and min function in LAT reading

Hi,

What is the purpose of this in your rep_reader.py?

pca_outputs_min = np.mean([o[train_labels[i].index(1)] == min(o) for i, o in enumerate(pca_outputs_comp)])
pca_outputs_max = np.mean([o[train_labels[i].index(1)] == max(o) for i, o in enumerate(pca_outputs_comp)])

layer_signs[component_index] = np.sign(np.mean(pca_outputs_max) - np.mean(pca_outputs_min))

Then in your inference code on testing data
Why do you evaluate like this:

sign = rep_reader.direction_signs[layer][component_index]
eval_func = min if sign == -1 else max
cors = np.mean([eval_func(H) == H[0] for H in H_test])

It is not clearly stated in the paper as well.

Thanks!

I have some questions and ask the author for help

I would like to ask the author, in Step 1: Designing Stimulus and Task of LAT USER: <experimental/reference prompt>
ASSISTANT: Are the and of this template used as stimulus input, or are used as input and used as output? We shall designate this template for a function f as Tf+
when using the experimental prompt and Tf−
when using the reference prompt. What is the function f here? Could you please ask the author to help me answer my doubts? Thank you!

Performance Enhancement

Andy and the team:

We made two performance enhancements: Flash Attention & Int8 quantization to be able to make the execution speed 4-5 times faster. Please let us know if we are allowed to contribute the source code back to the community.

Regards

Founder of ReparteeAI

Danny

some confusion when reading ur paper.

Do all pairs in a stimuli set share two sides of a conception? Or a stimuli set could include many kinds of aspect, e.g., in emotion analysis section, did u get all of "Happiness Sadness Anger Fear Surprise Disgust" simultaneously?
What the last tokens means in Figure 5, it seems each word in '\n' 'The' 'amount' 'of' 'happiness' 'in' 'the' 'scenario' 'is' could be last word.
H(si) − H(si+1). How H formed?
What do u mean by Y-axis of figure.7. It seems that "few shot" is a kind of method of concept representation, which is beyond my understanding.

Forgive my English expression ability, I would be appreciated for ur answer and it would be super if u provide a schedule of code updating.

How to make Repe support image as input if using llama3 as base model?

We see huge demand for text-to-image honesty check. Please help us provide such capability

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_`

Currently, after training the rep_reader, the coeff variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions as a example, here's the values I found:

# LLaMA-2-Chat-13B coeff=3.0-3.5
# mistralai/Mistral-7B-Instruct-v0.1 coeff=0.5
# HuggingFaceH4/zephyr-7b-beta coeff=0.3
# openchat/openchat_3.5 coeff=0.2

This makes it challenging for RepControl to adapt to new models.

My finding is that by introducing the pca_model's explained_variance_ratio_ into control progress can make the manipulation progress more "gentle" / "accurate".

Here's the key modifications:
In the rep_readers.py :

def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
        """Get PCA components for each layer"""
        directions = {}

        # like directions, save the variance ratio for each layer
        variance_ratio = {}

        for layer in hidden_layers:

             ........

            self.n_components = pca_model.n_components_
            variance_ratio[layer] = pca_model.explained_variance_ratio_
           
        self.variance_ratio = variance_ratio
        return directions

Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.

So, when manipulating the output, the activation variable is calculated as:

coeff=0.2
coeff_with_variance = 2.0

activations = {}
activations_with_variance = {}

for layer in layer_id:
    activations[layer] = torch.tensor(coeff * rep_reader.directions[layer] * rep_reader.direction_signs[layer]).to(model.device).half()
   
    variance_ratio = rep_reader.variance_ratio[layer][0]
    # print(variance_ratio)
    activations_with_variance[layer] = torch.tensor(coeff_with_variance * rep_reader.directions[layer] * rep_reader.direction_signs[layer] * variance_ratio).to(model.device).half()

Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.

Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach.
So, I just take the variance_ratio into account in a most simple way.
Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.

Thanks for sharing this great work!

Dataset in example honest notebook

I notice in the paper and the example Jupyter code, the output of ASSISTANT(response), or the statement is truncated, I would like to know the reason. Thank you so much!

confused about the data leakage in honesty_control_TQA.ipynb

I am confused with the code "Contrast Vector Control" in honesty_control_TQA.ipynb.
You use the correct answer as input along with the question to get your direction vectors. Doesn't this constitute data leakage?
I don't know if I didn't fully understand the code. Please help me to solve this problem.

        inputs_neg_s, masks_neg_s, split_neg = prepare_decoder_only_inputs(q_batch_neg, a_batch, tokenizer, model.model.device)
        split = inputs_neg_s['input_ids'].shape[1] - split_neg

        for layer_id in layer_ids:

            with torch.no_grad():
                
                _ = wrapped_model(**inputs_pos_s)
                pos_outputs = wrapped_model.get_activations(layer_ids, block_name=block_name)
                _ = wrapped_model(**inputs_neg_s)
                neg_outputs = wrapped_model.get_activations(layer_ids, block_name=block_name)
                directions[layer_id] += coeff * (pos_outputs[layer_id][:, -split:] - neg_outputs[layer_id][:, -split:]) / len(templates)

                wrapped_model.reset()
                wrapped_model.set_controller([l for l in layer_ids if l <= layer_id], directions, 
                                            masks=masks[:, -split:, None], 
                                            token_pos="end",
                                            normalize=False)

Questions when using honesty.ipynb on llama-7b-chat-hf

Hi, thanks for your great contribution.

I have meet some questions when I try to use llama2-7b-chat-hf on the honesty.ipynb

when I load the model with a dtype of bfloat 16, it shows the hidden layer has nan of inf data so the program goes wrong when doing the pca.

when I load the model with a dtype of float 32, it shows the data is nan of inf and the program goes wrong when doing the generation test.

Hoping to get your help!

Padding and only align CoT side

representation-engineering/lorra_finetune/src/llama2_lorra.py

Line 80 in ebc2c0f

direction_hidden = [pos_outputs[l][:, -min_length:].detach() - \

I see your attempts trying to only align the CoT side.

Is this what you are trying to do, only align the CoT part?
How did you guarantee that you never see the "untrustworthy" word in the prompt?
Your mechanism is as simple as trimming a fixed length.

Wizard-Vicuna-30B-Uncensored

Since the 'ehartford/Wizard-Vicuna-30B-Uncensored' model file is approximately over 120GB, does it mean that my computer's RAM memory also needs to be greater than 120GB to load it? If so, are there any smaller models you would recommend?

How to select the layers used for control?

Hi！
For the control phase, may I ask how you select the layers to be controlled?

Thanks very much！

The assignment of untruthful_statements is wrong? code in / examples/honesty/utils.py

true_statements = df[df['label'] == 1]['statement'].values.tolist()
false_statements = df[df['label'] == 0]['statement'].values.tolist()

template_str = "Pretend you're {type} person making statements about the world."
honest_statements = []
untruthful_statements = []

# Process statements
for statement in true_statements:
    tokens = tokenizer.tokenize(statement)
    
    for idx in range(1, len(tokens) - 5):
        truncated_tokens = tokens[:idx]
        truncated_statement = tokenizer.convert_tokens_to_string(truncated_tokens)

        honest_statements.append(f"{user_tag} {template_str.format(type='an honest')} {assistant_tag} " + truncated_statement)
        untruthful_statements.append(f"{user_tag} {template_str.format(type='an untruthful')} {assistant_tag} " + truncated_statement)

false_statements not being used?

false_statements

neg/pos prompt length never changes : Is this expected

In constrastive perturbation, the input neg/pos prompt should constant while the model_inputs increases. Is this expected when reproducing the optimal result in Table 1?

representation-engineering/repe/rep_control_contrast_vec.py

Line 100 in ebc2c0f

model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)

please gab code

pleabe give code sir 🥰

Performance on Open LLM Leaderboard

Hi, I wonder if the performance of your LoRRA models is shown in Open LLM Leaderboard? Thanks!

About Harmlessness concept and controlling

Hi authors,

I have some questions about Harmlessness concept and controlling:

I would to know how to construct stimulus and how to get hidden states. You mentioned you adopt the below format:
Therefore, only the instruction part varies, depending on the type of instruction (harmful or harmless). Addtionally, you also get the hidden states of final token ([/INST] or ASSISTANT). Do I understand correctly?
You mentioned that you also utilize harmless instructions from ShareGPT and construct contrast pairs to obtain more Significant activations. When predicting whether a specific input sample is harmful or harmless, do we still need to use a contrast pair instead of just a single input? like precedures shown in https://github.com/andyzoujm/representation-engineering/tree/main/examples/honesty

How to prompt Llama2-13b-chat to generate false answers?

In Section 4.1, it mentions that there are 3 ways to obtain stimulus for truthfulness.

(1) Fifty examples from the ARC-Challenge training set, (2) Five examples generated by the LLaMA-2-Chat-13B model in response to requests for question-answer pairs with varying degrees of truthfulness, (3) The six QA primer examples used in the original implementation, each of which is paired with a false answer generated by LLaMA-2-Chat-13B.

And I wonder in the second and third way, how do you prompt Llama2-13b-chat to generate false answers?

Add projection operation

I have been experimenting with the linear, piece-wise, and projection operations for representation control. It would be useful to have the projection operation available for reference to make sure my implementation matches what was used in the paper. But, it's currently not implemented.

representation-engineering/repe/rep_control_reading_vec.py

Lines 93 to 95 in 8de198b

    
           elif operator == 'projection': 
        
               def op(current, controller): 
        
                   raise NotImplementedError

What is the parameter to reproduce result on TQA

What is the parameter in llama_lorra_tqa_7b.sh to reproduce the result in paper 55.0 (on TQA dataset)
{'tqa_accuracy': 0.42717258261933905, 'arc-e_accuracy': 0.6929824561403509}

Accelerate the rep-reading

I found the calculation of the reading pipeline is super slow. I send the projection and recenter to GPU and make it run 20x faster. May I open a pull request?

def project_onto_direction(H, direction):
    """Project matrix H (n, d_1) onto direction vector (d_2,)"""
    # Calculate the magnitude of the direction vector
     # Ensure H and direction are on the same device (CPU or GPU)
    device = H.device
    if type(direction) != torch.Tensor:
        direction = torch.Tensor(direction)
    direction = direction.to(device)
    mag = torch.norm(direction)
    assert not torch.isinf(mag).any()
    # Calculate the projection
    projection = H.matmul(direction) / mag
    return projection

def recenter(x, mean=None):
    if mean is None:
        # mean = x.mean(axis=0, keepdims=True)
        mean = torch.mean(x,axis=0,keepdims=True)
    else:
        mean = torch.Tensor(mean).cuda()
    # print(type(x), type(mean))
    return x - mean

n_difference parameter with clustermean?

Hi, Thank you for making the code directly available!
I have a question about the clustermean method. The code (repe.rep_reading_pipeline, l. 106) contains the below statement which I think is not effective since it's 'cluster_mean'. However, independently of that, since I am trying to just understand, shouldn't it be n_difference == 0?

if direction_method == 'clustermean':
 assert n_difference == 1, "n_difference must be 1 for clustermean"

Thank you so much already!

How to use RAG in Repe?

We are working on a POC that requires RAG. We found Repe using HF transformer custom pipeline but langchain(HF using) does not support transformer custom pipeline. Any suggestion?

Train large model on multiple GPUs. You can't train a model that has been loaded with `device_map='auto'` in any distributed mode.

I am not able to train larger model on two GPUs, does anyone know how to fix this with deepspeed?

I tried to use llama2_lorra.py with the script llama_lorra_tqa_7b.sh with 4 GPUs. Though I know --num_gpus=1 fixed the issue.

ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
{'tqa_accuracy': 0.31334149326805383, 'arc-e_accuracy': 0.6614035087719298}

deepspeed --master_port $ds_master_port --num_gpus=1 src/llama2_lorra.py

	if layer_id in control_layer_ids:
	activations = alpha * (hidden_states_p[:, contrast_tokens:] - hidden_states_n[:, contrast_tokens:])
	c_length = activations.shape[1]

	hidden_states[:, -c_length:, :] += activations
	hidden_states_p[:, -c_length:, :] += activations
	hidden_states_n[:, -c_length:, :] += activations

	elif operator == 'projection':
	def op(current, controller):
	raise NotImplementedError

andyzoujm / representation-engineering Goto Github PK

representation-engineering's People

Contributors

Stargazers

Watchers

Forkers

representation-engineering's Issues

Recommend Projects

Recommend Topics

Recommend Org