Code Monkey home page Code Monkey logo

representation-engineering's People

Contributors

andyzoujm avatar justinphan3110 avatar justinphan3110cais avatar justinwangx avatar mxl1n avatar y-l-liu avatar yangfy0608 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

representation-engineering's Issues

Unexpect behavior from honesty example notebook

Hi team,

Thank you for your great project! I am recently running your honesty example notebook and trying to run through your provided example. However in the last two cells of honesty.ipynb. I got no meaningful output from the control side. I made no adjustments to the code and used the 7B model. Any ideas would be helpful!

Thanks

ours_control

How to automate the [threshold] parameter in Honest example?

We realized we have to manually adjust the [threshold] parameter in Honest example in order to get acceptable results. We want to know what is the plan to automate the [threshold] parameter adjustment? It is not very user friendly and not very practical as well.

Question about the honesty scores calculation

Hi

In your honest scores calculation, what is the justification of

results[pos][0][layer][0] * honesty_rep_reader.direction_signs[layer][0]

Why you need to multiply by the direction sign, not just using the results[pos][0][layer][0]

Thanks

Understandung the contrast vector implementation

As far as I can tell, this is the part where the Contrast Vector is applied:

if layer_id in control_layer_ids:
activations = alpha * (hidden_states_p[:, contrast_tokens:] - hidden_states_n[:, contrast_tokens:])
c_length = activations.shape[1]
hidden_states[:, -c_length:, :] += activations
hidden_states_p[:, -c_length:, :] += activations
hidden_states_n[:, -c_length:, :] += activations

I asked myself the question why not only the hidden_states are changed but also hidden_states_p and hidden_states_n. Could you elaborate on that?

Btw: Since contrast_tokens is choosen to be negtative, is it not always equal to c_length?

Contrast Vector - Add Code for Generation?

Hi,

I was wondering if code demonstrating how to use the contrast vector for text generation could be pushed? (Ideally for a 7B model!)

I am quite unclear on how to convert the contrast vector code from honesty_control_TQA.ipynb to allow for generation. And I am finding it difficult to parse all the contrast vector implementation details from the paper.

This would be very useful, so would really appreciate it!

Thanks,
Robert

Question about layer_id

Hi! When I run your notebook, I wonder how can define layer_id for emotion control?

Looking forward to hear your answers, thank you!

Performance of Harmfulness exp is too high ?

Hello authors,
Your experiment results on harmfulness classification:https://github.com/andyzoujm/representation-engineering/blob/main/examples/harmless_harmful/harmless_llama2.ipynb shows that Llama-2-13b-chat achieves near 100% acc, even in the lower layers.
I have tried more model: Llama-2-{7,70}b-chat, llama-2-7b, bloomz-{560m,1b1,1b7,3b,7b1}, bloom-7b1, all of these models also achieves near perfect performance.
I think the phenomenon is trivial for chat models because harmful and harmless input correspond to the two directions of refusing to answer and answering, however, for models without alignment like bloom/llama-2, it is nontrivial.
(By the way, in my experiment, I didn't include task template in prompt, instead, i directly used stimuli, and I used the ClusterMeanRepReader method.)

Are there any more plausible explanations for the phenomenon?

CLIP Examples for Emotion Classification

Hi! Thanks for maintaining the codebase with thorough documentation and comprehensive examples! Is it possible to add the CLIP examples in Appendix B.5? Also, could you clarify how the emotion directions are obtained? Is it obtained with texts and then used to classify images?

Thanks a lot in advance!

Assert error "NaN in output logprobs"

Hi! When I run examples/honesty/honesty_control_TQA.ipynb with llama2-7b-chat-hf, it threw an assert error in line 44:
assert np.isnan(output_logprobs).sum() == 0, "NaN in output logprobs"
I already set layer_ids = np.arange(8, 32, 3).
Wish to get your help.

Dataset in example honest notebook

Hi,

I am wondering why training dataset in honest reading notebook have unfinished sentences in the inputs and labels containing [True, False]. How are inputs and labels connected?

Thanks

Question about emotion_funtion

Hi, I have some questions about the emotion_function notebook.

  1. Why did you choose to use mistralai/Mistral-7B-Instruct-v0.1 in this notebook instead of Llama2 used in emotion_concept?
  2. In the primary_emotions_function_dataset function, you used all_truncated_outputs. I would like to know how this was obtained.
  3. Also in this function, if I want to build a test set, can I just use the same way in primary_emotions_concept_dataset?

Thanks for your help!

Question about max and min function in LAT reading

Hi,

What is the purpose of this in your rep_reader.py?

pca_outputs_min = np.mean([o[train_labels[i].index(1)] == min(o) for i, o in enumerate(pca_outputs_comp)])
pca_outputs_max = np.mean([o[train_labels[i].index(1)] == max(o) for i, o in enumerate(pca_outputs_comp)])

layer_signs[component_index] = np.sign(np.mean(pca_outputs_max) - np.mean(pca_outputs_min))

Then in your inference code on testing data
Why do you evaluate like this:

sign = rep_reader.direction_signs[layer][component_index]
eval_func = min if sign == -1 else max
cors = np.mean([eval_func(H) == H[0] for H in H_test])

It is not clearly stated in the paper as well.

Thanks!

I have some questions and ask the author for help

I would like to ask the author, in Step 1: Designing Stimulus and Task of LAT USER: <experimental/reference prompt>
ASSISTANT: Are the and of this template used as stimulus input, or are used as input and used as output? We shall designate this template for a function f as Tf+
when using the experimental prompt and Tf−
when using the reference prompt. What is the function f here? Could you please ask the author to help me answer my doubts? Thank you!

Performance Enhancement

Andy and the team:

We made two performance enhancements: Flash Attention & Int8 quantization to be able to make the execution speed 4-5 times faster. Please let us know if we are allowed to contribute the source code back to the community.

Regards

Founder of ReparteeAI

Danny

some confusion when reading ur paper.

  1. Do all pairs in a stimuli set share two sides of a conception? Or a stimuli set could include many kinds of aspect, e.g., in emotion analysis section, did u get all of "Happiness Sadness Anger Fear Surprise Disgust" simultaneously?
  2. What the last tokens means in Figure 5, it seems each word in '\n' 'The' 'amount' 'of' 'happiness' 'in' 'the' 'scenario' 'is' could be last word.
  3. H(si) − H(si+1). How H formed?
  4. What do u mean by Y-axis of figure.7. It seems that "few shot" is a kind of method of concept representation, which is beyond my understanding.

Forgive my English expression ability, I would be appreciated for ur answer and it would be super if u provide a schedule of code updating.

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_`

Currently, after training the rep_reader, the coeff variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions as a example, here's the values I found:

# LLaMA-2-Chat-13B coeff=3.0-3.5
# mistralai/Mistral-7B-Instruct-v0.1 coeff=0.5
# HuggingFaceH4/zephyr-7b-beta coeff=0.3
# openchat/openchat_3.5 coeff=0.2

This makes it challenging for RepControl to adapt to new models.

My finding is that by introducing the pca_model's explained_variance_ratio_ into control progress can make the manipulation progress more "gentle" / "accurate".

Here's the key modifications:
In the rep_readers.py :

def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
        """Get PCA components for each layer"""
        directions = {}

        # like directions, save the variance ratio for each layer
        variance_ratio = {}

        for layer in hidden_layers:

             ........

            self.n_components = pca_model.n_components_
            variance_ratio[layer] = pca_model.explained_variance_ratio_
           
        self.variance_ratio = variance_ratio
        return directions

Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.

So, when manipulating the output, the activation variable is calculated as:

coeff=0.2
coeff_with_variance = 2.0

activations = {}
activations_with_variance = {}

for layer in layer_id:
    activations[layer] = torch.tensor(coeff * rep_reader.directions[layer] * rep_reader.direction_signs[layer]).to(model.device).half()
   
    variance_ratio = rep_reader.variance_ratio[layer][0]
    # print(variance_ratio)
    activations_with_variance[layer] = torch.tensor(coeff_with_variance * rep_reader.directions[layer] * rep_reader.direction_signs[layer] * variance_ratio).to(model.device).half()

Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.

Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach.
So, I just take the variance_ratio into account in a most simple way.
Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.

Thanks for sharing this great work!

Dataset in example honest notebook

I notice in the paper and the example Jupyter code, the output of ASSISTANT(response), or the statement is truncated, I would like to know the reason. Thank you so much!

confused about the data leakage in honesty_control_TQA.ipynb

I am confused with the code "Contrast Vector Control" in honesty_control_TQA.ipynb.
You use the correct answer as input along with the question to get your direction vectors. Doesn't this constitute data leakage?
I don't know if I didn't fully understand the code. Please help me to solve this problem.

        inputs_neg_s, masks_neg_s, split_neg = prepare_decoder_only_inputs(q_batch_neg, a_batch, tokenizer, model.model.device)
        split = inputs_neg_s['input_ids'].shape[1] - split_neg

        for layer_id in layer_ids:

            with torch.no_grad():
                
                _ = wrapped_model(**inputs_pos_s)
                pos_outputs = wrapped_model.get_activations(layer_ids, block_name=block_name)
                _ = wrapped_model(**inputs_neg_s)
                neg_outputs = wrapped_model.get_activations(layer_ids, block_name=block_name)
                directions[layer_id] += coeff * (pos_outputs[layer_id][:, -split:] - neg_outputs[layer_id][:, -split:]) / len(templates)

                wrapped_model.reset()
                wrapped_model.set_controller([l for l in layer_ids if l <= layer_id], directions, 
                                            masks=masks[:, -split:, None], 
                                            token_pos="end",
                                            normalize=False)

Questions when using honesty.ipynb on llama-7b-chat-hf

Hi, thanks for your great contribution.

I have meet some questions when I try to use llama2-7b-chat-hf on the honesty.ipynb

when I load the model with a dtype of bfloat 16, it shows the hidden layer has nan of inf data so the program goes wrong when doing the pca.

when I load the model with a dtype of float 32, it shows the data is nan of inf and the program goes wrong when doing the generation test.

Hoping to get your help!

Wizard-Vicuna-30B-Uncensored

Since the 'ehartford/Wizard-Vicuna-30B-Uncensored' model file is approximately over 120GB, does it mean that my computer's RAM memory also needs to be greater than 120GB to load it? If so, are there any smaller models you would recommend?

The assignment of untruthful_statements is wrong? code in / examples/honesty/utils.py

true_statements = df[df['label'] == 1]['statement'].values.tolist()
false_statements = df[df['label'] == 0]['statement'].values.tolist()

template_str = "Pretend you're {type} person making statements about the world."
honest_statements = []
untruthful_statements = []

# Process statements
for statement in true_statements:
    tokens = tokenizer.tokenize(statement)
    
    for idx in range(1, len(tokens) - 5):
        truncated_tokens = tokens[:idx]
        truncated_statement = tokenizer.convert_tokens_to_string(truncated_tokens)

        honest_statements.append(f"{user_tag} {template_str.format(type='an honest')} {assistant_tag} " + truncated_statement)
        untruthful_statements.append(f"{user_tag} {template_str.format(type='an untruthful')} {assistant_tag} " + truncated_statement)

false_statements not being used?

false_statements

About Harmlessness concept and controlling

Hi authors,

I have some questions about Harmlessness concept and controlling:

  1. I would to know how to construct stimulus and how to get hidden states. You mentioned you adopt the below format:
    image Therefore, only the instruction part varies, depending on the type of instruction (harmful or harmless). Addtionally, you also get the hidden states of final token ([/INST] or ASSISTANT). Do I understand correctly?

  2. You mentioned that you also utilize harmless instructions from ShareGPT and construct contrast pairs to obtain more Significant activations. When predicting whether a specific input sample is harmful or harmless, do we still need to use a contrast pair instead of just a single input? like precedures shown in https://github.com/andyzoujm/representation-engineering/tree/main/examples/honesty

How to prompt Llama2-13b-chat to generate false answers?

In Section 4.1, it mentions that there are 3 ways to obtain stimulus for truthfulness.

(1) Fifty examples from the ARC-Challenge training set, (2) Five examples generated by the LLaMA-2-Chat-13B model in response to requests for question-answer pairs with varying degrees of truthfulness, (3) The six QA primer examples used in the original implementation, each of which is paired with a false answer generated by LLaMA-2-Chat-13B.

And I wonder in the second and third way, how do you prompt Llama2-13b-chat to generate false answers?

Add projection operation

I have been experimenting with the linear, piece-wise, and projection operations for representation control. It would be useful to have the projection operation available for reference to make sure my implementation matches what was used in the paper. But, it's currently not implemented.

elif operator == 'projection':
def op(current, controller):
raise NotImplementedError

Accelerate the rep-reading

I found the calculation of the reading pipeline is super slow. I send the projection and recenter to GPU and make it run 20x faster. May I open a pull request?

def project_onto_direction(H, direction):
    """Project matrix H (n, d_1) onto direction vector (d_2,)"""
    # Calculate the magnitude of the direction vector
     # Ensure H and direction are on the same device (CPU or GPU)
    device = H.device
    if type(direction) != torch.Tensor:
        direction = torch.Tensor(direction)
    direction = direction.to(device)
    mag = torch.norm(direction)
    assert not torch.isinf(mag).any()
    # Calculate the projection
    projection = H.matmul(direction) / mag
    return projection

def recenter(x, mean=None):
    if mean is None:
        # mean = x.mean(axis=0, keepdims=True)
        mean = torch.mean(x,axis=0,keepdims=True)
    else:
        mean = torch.Tensor(mean).cuda()
    # print(type(x), type(mean))
    return x - mean

n_difference parameter with clustermean?

Hi, Thank you for making the code directly available!
I have a question about the clustermean method. The code (repe.rep_reading_pipeline, l. 106) contains the below statement which I think is not effective since it's 'cluster_mean'. However, independently of that, since I am trying to just understand, shouldn't it be n_difference == 0?

if direction_method == 'clustermean':
 assert n_difference == 1, "n_difference must be 1 for clustermean"

Thank you so much already!

How to use RAG in Repe?

We are working on a POC that requires RAG. We found Repe using HF transformer custom pipeline but langchain(HF using) does not support transformer custom pipeline. Any suggestion?

Train large model on multiple GPUs. You can't train a model that has been loaded with `device_map='auto'` in any distributed mode.

I am not able to train larger model on two GPUs, does anyone know how to fix this with deepspeed?

I tried to use llama2_lorra.py with the script llama_lorra_tqa_7b.sh with 4 GPUs. Though I know --num_gpus=1 fixed the issue.

ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
{'tqa_accuracy': 0.31334149326805383, 'arc-e_accuracy': 0.6614035087719298}
deepspeed --master_port $ds_master_port --num_gpus=1 src/llama2_lorra.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.