Code Monkey home page Code Monkey logo

interrolang's People

Contributors

nfelnlp avatar qiaw99 avatar schopra6 avatar tanikina avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

eltociear

interrolang's Issues

[Operations] nlpattribute sentence-level aggregation

  • Sentence-level aggregation: Based on nlpattribute importances, split the sentences (e.g. using spaCy, NLTK, etc.) and calculate the average saliency score.
  • A parse nlpattribute sentence (reconsider if this is a sensible name) should return all sentences together with their average saliency score.

[Operations] Handle "includes" string filters by accessing filtered temp_dataset

  • Add another input box to filter dataset (includes). We should then have three dropdown menu options, i.e. [1] regular input requesting an operation, [2] defining a new instance ("custom input") and [3] filter dataset by string
    Solved by #45

All operations that currently have includes {span} prompts (see #36) need previousfilter and x adaptations and should be based on the contents of the temp_dataset.

  • predict
  • countdata
  • label
  • show
  • mistakes
  • score

[Config] different class names in different gin configs for OLID

I found the following inconsistency:
olid_adapter.gin, olid_flan-t5.gin and olid_nn.gin have the following classes:
Conversation.class_names = {1: "True", 0: "False"}

However, olid.gin has: Conversation.class_names = {1: "offensive", 0: "non-offensive"}

I guess the olid.gin is the correct version, right?

[Summary] Explanation Operations

Operation Terminals / Prompts Action Description Tools Status
nlpattribute nlpattribute token | phrase | sentence {classes} feature_importance Provides feature importances at the token (default), phrase or sentence level. Captum (Integrated Gradients)
globaltopk important {number} {classes} global_topk Returns top k most attributed tokens across the entire dataset. Captum (Integrated Gradients)
nlpcfe nlpcfe {number} counterfactuals Returns counterfactual explanations (model predicts another label) for a single instance. Polyjuice
adversarial adversarial {number} Returns adversarial examples (model predicts wrong label) for a single instance. OpenAttack
similar similar {number} similarity Gets number of training data instances that are most similar to the current one. Sentence Transformers
rules rules {number} Outputs the decision rules for the dataset. Anchors
interact interact Gets feature interactions. HEDGE
rationalize rationalize rationalize Explains the prediction for some specified instance in natural language. Zero-shot prompting with GPTNeo parser

[Operations] includes with other operators (OR, NOT)

We currently don't have the option to apply string filters (includes) with other logical operators like or or not.
This would also require frontend work to allow multiple inputs (or structure them with special characters, at least).

OR

  • Update action
  • Write prompts with includes or includes
  • Update grammar
  • Frontend

NOT

  • Update action
  • Write prompts with includes false / not includes
  • Update grammar
  • Frontend

No output for nlp attribute (local_feature_importance_sentence.txt)

Given Input : Most important phrases in id 12

Log trace


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 881, in update_state
    parse_tree, parsed_text = self.compute_parse_text(text)
  File "/usr/src/app/logic/core.py", line 639, in compute_parse_text
    api_response = self.decoder.complete(
  File "/usr/src/app/logic/decoder.py", line 118, in complete
    completed = self.gen_completions(prompt, grammar)
  File "/usr/src/app/logic/decoder.py", line 84, in complete
    return predict_f(text=prompt, grammar=grammar)
  File "/usr/src/app/parsing/gpt/few_shot_inference.py", line 36, in predict_f
    parser = GuidedParser(grammar, tokenizer, model="gpt")
  File "/usr/src/app/parsing/guided_decoding/gd_logits_processor.py", line 43, in __init__
    self.text_parser = Lark(self.text_grammar, parser="lalr")
  File "/usr/local/lib/python3.9/site-packages/lark/lark.py", line 333, in __init__
    self.grammar, used_files = load_grammar(grammar, self.source_path, self.options.import_paths, self.options.keep_all_tokens)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1408, in load_grammar
    builder.load_grammar(grammar, source)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1233, in load_grammar
    tree = _parse_grammar(grammar_text, grammar_name)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 964, in _parse_grammar
    raise GrammarError("Unexpected input at line %d column %d in %s: \n\n%s" %
lark.exceptions.GrammarError: Unexpected input at line 87 column 5 in <string>: 

id: " id 12
    ^


[2023-06-02 08:16:10,663] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 961, in _parse_grammar
    tree = _get_parser().parse(text + '\n', start)
  File "/usr/local/lib/python3.9/site-packages/lark/parser_frontends.py", line 96, in parse
    return self.parser.parse(stream, chosen_start, **kw)
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 41, in parse
    return self.parser.parse(lexer, start)
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 171, in parse
    return self.parse_from_state(parser_state)
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 188, in parse_from_state
    raise e
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 178, in parse_from_state
    for token in state.lexer.lex(state):
  File "/usr/local/lib/python3.9/site-packages/lark/lexer.py", line 456, in lex
    yield self.next_token(state, parser_state)
  File "/usr/local/lib/python3.9/site-packages/lark/lexer.py", line 466, in next_token
    raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches '"' in the current parser context, at line 87 col 5

id: " id 12
    ^
Expected one of: 
	* _RBRACE
	* _COMMA
	* _LBRACE
	* _OVERRIDE
	* NUMBER
	* TILDE
	* _EXTEND
	* _RBRA
	* _NL_OR
	* _OR
	* _RPAR
	* OP
	* _DECLARE
	* _LPAR
	* REGEXP
	* RULE_MODIFIERS
	* _IGNORE
	* _LBRA
	* _DOT
	* _TO
	* TERMINAL
	* _NL
	* RULE
	* _COLON
	* STRING
	* _IMPORT
	* _DOTDOT

Previous tokens: Token('_COLON', ':')


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 881, in update_state
    parse_tree, parsed_text = self.compute_parse_text(text)
  File "/usr/src/app/logic/core.py", line 639, in compute_parse_text
    api_response = self.decoder.complete(
  File "/usr/src/app/logic/decoder.py", line 118, in complete
    completed = self.gen_completions(prompt, grammar)
  File "/usr/src/app/logic/decoder.py", line 84, in complete
    return predict_f(text=prompt, grammar=grammar)
  File "/usr/src/app/parsing/gpt/few_shot_inference.py", line 36, in predict_f
    parser = GuidedParser(grammar, tokenizer, model="gpt")
  File "/usr/src/app/parsing/guided_decoding/gd_logits_processor.py", line 43, in __init__
    self.text_parser = Lark(self.text_grammar, parser="lalr")
  File "/usr/local/lib/python3.9/site-packages/lark/lark.py", line 333, in __init__
    self.grammar, used_files = load_grammar(grammar, self.source_path, self.options.import_paths, self.options.keep_all_tokens)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1408, in load_grammar
    builder.load_grammar(grammar, source)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1233, in load_grammar
    tree = _parse_grammar(grammar_text, grammar_name)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 964, in _parse_grammar
    raise GrammarError("Unexpected input at line %d column %d in %s: \n\n%s" %
lark.exceptions.GrammarError: Unexpected input at line 87 column 5 in <string>: 

id: " id 12
    ^


[2023-06-02 08:16:10 +0000] [8] [INFO] Exception getting bot response: Unexpected input at line 87 column 5 in <string>: 

id: " id 12
    ^

[2023-06-02 08:16:10,663] INFO in flask_app: Exception getting bot response: Unexpected input at line 87 column 5 in <string>: 

id: " id 12

[Summary] General-purpose LLMs as explained models

Note: This is meant to be done after the first submission.

A promising extension to InterroLang (probably even warranting its own paper) would be to replace BERT-type models with a single general-purpose LLM (e.g. LLaMa or the already-in-place GPT-Neo parser) that performs all of the tasks reasonably well. This would be a more modern approach, since BERTs are slowly getting out-of-date and LLMs can now run on consumer hardware locally. This would cause some changes, however, which I will document in the following:

  • Write instructions for the various tasks, e.g.
    " Please predict one of the following labels: <label_1> … <label_n> Prediction: "
  • Overhaul of the entire feature importance operation category (nlpattribute, globaltopk) using Inseq.
    ⚠️ How matrices of feature attributions would be verbalized into a response has yet to be determined.
  • Pre-compute predictions and explanations with the new LLM

We can also get rid of many smaller language models, e.g. the GPT-2 for CFE generation, the SBERT for semantic similarity.

Ideally, we would end up with one model (for the entire framework) that assesses itself. It would take care of

  1. Parsing / Intent recognition
  2. Prediction of downstream tasks
  3. Feature attribution (nlpattribute, globaltopk)
  4. Perturbations (CFE, adversarial, augment)
  5. Semantic similarity
  6. Rationalization

The only two parts of the pipeline that would remain rule-based are the dialogue state tracking (custom inputs, clarification questions, previous filters) and the response generation (currently template-based).

Resources

llama.cpp (Efficient execution of up to 7B models on CPUs)
RedPajama-INCITE-Instruct-3B (Hugging Face) – maybe better for rationalization?
RedPajama-INCITE-Chat-3B – maybe better for response generation?

[Operation] Dataset Viewer

With dataset viewer, user should not only search instances with texts but also with id.

Idea:
when entering input, check if the input can be converted to digits.

[Operations] Metadata (Model Cards & Datasheets)

  • Decide on methodology (Return pre-formatted strings vs. QA-type setup)
  • Write CSVs, e.g.
"number of parameters", "110 million"
"training objective", "Masked language modeling (MLM) and Next sentence prediction (NSP)"
"intended uses", "This model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering."
  • Write parsing data (prompts), e.g.
User: How many parameters does the model have?
Parsed: model parameters

Model Cards

Model card writer on Hugging Face

Action

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/actions/metadata/model.py#L4-L11

The existing model operation should be extended.

Prompts

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/prompts/metadata/describe_model.txt#L1-L11

Datasheet

Action

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/actions/metadata/data_summary.py#L41-L44

The exisitng data operation should be extended. However, the additional metadata collected in lines 46 ff. should stay the same.

Prompts

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/prompts/metadata/describe_data.txt#L1-L23

[Parsing] Evaluation of parsing

Encounter this error when executing generate_parsing_results.py file with models GPT and FLAN-T5

Exception: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx, likely OOM

Counterfactuals fail to work

Here is the stack trace for the following input: "cfe for id 14"

[2023-06-08 09:49:41,263] INFO in core: USER INPUT: cfe for id 14

Batches: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 40.87it/s]
[2023-06-08 09:49:41,335] INFO in core: adapters decoded text filter id 14 and cfe
[2023-06-08 09:50:46,811] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe_generation.py", line 51, in get_samples_from_pj
generated_samples = self.explainer.perturb(instance, ctrl_code=ctrl_code, num_perturbations=None,
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 263, in perturb
ctrl = self.detect_ctrl_code(orig_doc, generated_doc, eop)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 131, in detect_ctrl_code
meta.compute_metadata(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 546, in compute_metadata
p.compute_metadata(sentence_similarity)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 386, in compute_metadata
negs2, neg_heads2 = get_negations(bcore)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 156, in get_negations
for t in span:
TypeError: 'NoneType' object is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/projects/InterroLang/flask_app.py", line 190, in get_bot_response
response = BOT.update_state(user_text, conversation)
File "/home/ubuntu/projects/InterroLang/logic/core.py", line 892, in update_state
returned_item = run_action(
File "/home/ubuntu/projects/InterroLang/logic/action.py", line 48, in run_action
action_return, action_status = actions[p_text](
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe.py", line 80, in counterfactuals_operation
same, diff = cfe_explainer.cfe(instance, cfe_num, ctrl_code=ALL_CTRL_CODES, _id=_id)
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe_generation.py", line 67, in cfe
new_samples = self.get_samples_from_pj(instance, ctrl_code)
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe_generation.py", line 54, in get_samples_from_pj
generated_samples = self.explainer.perturb(instance, ctrl_code=ALL_CTRL_CODES, num_perturbations=None,
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 263, in perturb
ctrl = self.detect_ctrl_code(orig_doc, generated_doc, eop)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 131, in detect_ctrl_code
meta.compute_metadata(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 546, in compute_metadata
p.compute_metadata(sentence_similarity)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 386, in compute_metadata
negs2, neg_heads2 = get_negations(bcore)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 156, in get_negations
for t in span:
TypeError: 'NoneType' object is not iterable

[2023-06-08 09:50:46,811] INFO in flask_app: Exception getting bot response: 'NoneType' object is not iterable

[Data] Pre-computed rationales as JSON

Computing rationales is only possible with a LLM such as GPT-Neo or the Dolly-3B model. When we do the user study, we have to choose one of the smaller models for parsing (Adapter, FLAN-T5), so users don't have to wait on the responses. This also means we cannot compute rationales on-the-fly, but have to have a JSON of pre-computed rationales for the entire three datasets.

First tests indicate that Dolly-3B might be the best rationalizer we can use. Few-shot also gives us better results than zero-shot.

Code

  • Write code that executes rationalize_operation on every instance and saves the result to a JSON file
  • Read JSON in rationalize_operation (have on-the-fly computation as a separate function instead)

Datasets

  • BoolQ
  • DailyDialog
  • OLID
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import AutoModelForCausalLM
import pandas as pd
import numpy as np
import torch
import json

from transformers import GPTNeoXTokenizerFast, GPTNeoXForCausalLM, GPTNeoXConfig

dataset_name = "boolq"
dataset = pd.read_csv("./data/boolq_validation.csv")
model = AutoModelForSequenceClassification.from_pretrained("andi611/distilbert-base-uncased-qa-boolq", num_labels=2)
instances = []
for i in range(len(dataset)):
    instances.append([dataset["question"][i], dataset["passage"][i]])
print(instances)

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("andi611/distilbert-base-uncased-qa-boolq")
model.to(device)

gpt_tokenizer = GPTNeoXTokenizerFast.from_pretrained("databricks/dolly-v2-3b", padding_side="left")
gpt_model = GPTNeoXForCausalLM.from_pretrained("databricks/dolly-v2-3b")
gpt_model.to(device)

explanations = []
for instance in instances:
    # if dataset_name == "boolq":
    text = 'Question: ' + instance[0] + '\nPassage: ' + instance[1]
    label_dict = {0: 'false', 1: 'true'}
    text_description = 'question and passage'
    fields = ['question', 'passage']
    fields_enum = ', '.join([f"'{f}'" for f in fields])
    output_description = 'answer'

    string = instance[0] + ' ' + instance[1]

    encoding = tokenizer.encode_plus(string, return_tensors='pt', max_length=512, truncation=True)
    input_ids = encoding["input_ids"]
    attention_mask = encoding["attention_mask"]
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    input_model = {
        'input_ids': input_ids.long(),
        'attention_mask': attention_mask.long(),
    }
    output_model = model(**input_model)[0]

    # Get logit
    model_predictions = np.argmax(output_model.cpu().detach().numpy())

    # model_predictions = model.predict()
    pred_str = label_dict[model_predictions]
    # else:
    #     return f"Dataset {dataset_name} currently not supported by rationalize operation", 1
    #
    prompt = f"{text}\n" \
             f"Based on {text_description}, the {output_description} is {pred_str}. " \
             f"Without using {fields_enum}, or revealing the answer or outcome in your response, " \
             f"explain why: "

    input_ids = gpt_tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to(device)
    generation = gpt_model.generate(
        input_ids,
        max_length=350,
        no_repeat_ngram_size=2,
    )
    decoded_generation = gpt_tokenizer.decode(generation[0], skip_special_tokens=True)
    #
    inputs = decoded_generation.split("Based on ")[0]
    explanation = decoded_generation.split("explain why: ")[1]
    print(explanation)
    explanations.append(explanation)

# jsonString = json.dumps(explanations)
# jsonFile = open("./netscratch/qwang/boolq_rationalization.json", "w")
# jsonFile.write(jsonString)
# jsonFile.close()

[Data] Additional prompts that capture dataset-specific descriptions

TTM had prompts that included "people with age [...]". All prompts that we have adapted and written so far are generic and applicable to any dataset (which is good!). However, for our three prototype datasets (BoolQ, OLID, DailyDialog), we might need additional prompts that include strings such as:

BoolQ

  • question
  • passage
  • document
  • article

e.g. "What would need to be changed in this document to predict False?"

OLID

  • tweet
  • Twitter post

e.g. "Please display a random tweet from the dataset and explain the predictions"

DailyDialog

  • dialogue
  • conversation

e.g. "For dialogue 250, please show the prediction"


Related:
metadata prompts README

[Operations] important/globaltopk + filters

Created from #25

Right now, globaltopk only operates on a fixed set of pre-computed global explanations. It is only possible to use important on the whole dataset or on a specific class ({classname}).
If there's a way to efficiently compute them on other temp_datasets, we can combine this operation with different filters.

prompts/work_in_progress

[Operation] nlpattribute throws an error for OLID

Config: olid.gin
Input: sentence level feature importance for id 1766
Parsed: filter id 1766 and nlpattribute sentence [e]
Traceback:

[2023-06-18 10:43:13,641] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
File "/home/ubuntu/projects/InterroLang/flask_app.py", line 192, in get_bot_response
response = BOT.update_state(user_text, conversation)
File "/home/ubuntu/projects/InterroLang/logic/core.py", line 898, in update_state
returned_item = run_action(
File "/home/ubuntu/projects/InterroLang/logic/action.py", line 51, in run_action
action_return, action_status = actions[p_text](
File "/home/ubuntu/projects/InterroLang/actions/explanation/feature_importance.py", line 374, in feature_importance_operation
return_s += get_sentence_level_feature_importance(conversation, filtered_text, simulation)
File "/home/ubuntu/projects/InterroLang/actions/explanation/feature_importance.py", line 195, in get_sentence_level_feature_importance
res_list = get_explanation(dataset_name, inputs, conversation, file_name="sentence_level")
File "/home/ubuntu/projects/InterroLang/actions/explanation/feature_importance.py", line 85, in get_explanation
res_list = generate_explanation(model, dataset_name, inputs, conversation, file_name=file_name)
File "/home/ubuntu/projects/InterroLang/actions/custom_input.py", line 249, in generate_explanation
attribution, predictions = compute_feature_attribution_scores(b, model, device)
File "/home/ubuntu/projects/InterroLang/actions/custom_input.py", line 165, in compute_feature_attribution_scores
attributions = explainer.attribute(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/log/init.py", line 35, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/attr/_core/layer/layer_integrated_gradients.py", line 365, in attribute
inputs_layer = _forward_layer_eval(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/gradient.py", line 182, in _forward_layer_eval
return _forward_layer_eval_with_neuron_grads(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/gradient.py", line 445, in _forward_layer_eval_with_neuron_grads
saved_layer = _forward_layer_distributed_eval(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/gradient.py", line 294, in _forward_layer_distributed_eval
output = _run_forward(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/common.py", line 456, in _run_forward
output = forward_func(
File "/home/ubuntu/projects/InterroLang/actions/custom_input.py", line 117, in bert_forward
output_model = model(**input_model)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 1599, in forward
outputs = self.bert(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/adapters/context.py", line 108, in wrapper_func
results = f(self, *args, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 1042, in forward
embedding_output = self.embeddings(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 245, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1123, in call_impl
hook_result = hook(self, input, result)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/OpenAttack/utils/transformers_hook.py", line 7, in call
output
.retain_grad()
RuntimeError: can't retain_grad on Tensor that has requires_grad=False

It doesn't work for any ids. It seems that only the operations parsed as nlpattribute or nlpattribute sentence throw this error for OLID.

Set up TalkToModel env

Intergration

This is an overview how we could intergrate other desired dataset into TTM.

1. Get TTM

git clone [email protected]:dylan-slack/TalkToModel.git

2. Setup env

Get into the TTM directory and run these commands:

conda create -n ttm python=3.9
conda activate ttm

Then you should all dependencies:

pip install -r requirements.txt
pip install datasets

3. Add needed files

Firstly, put the configure file into folder /configs as /configs/boolq.gin

##########################################
# The boolq dataset conversation config
##########################################

# for few shot, e.g., "EleutherAI/gpt-neo-2.7B"
ExplainBot.parsing_model_name = "EleutherAI/gpt-neo-2.7B"


# set skip_prompts to true for quicker startup for finetuned models
# make sure to set to false using few-shot models
ExplainBot.skip_prompts = False

ExplainBot.t5_config = "./parsing/t5/gin_configs/t5-large.gin"
ExplainBot.seed = 0
ExplainBot.background_dataset_file_path = "./data/boolq_train.csv"
ExplainBot.model_file_path = "./data/boolq_model"
ExplainBot.dataset_file_path = "./data/boolq_validation.csv"

ExplainBot.name = "boolq"

ExplainBot.dataset_index_column = "idx"
ExplainBot.target_variable_name = "label"
ExplainBot.categorical_features = None
ExplainBot.numerical_features = None
ExplainBot.remove_underscores = False

ExplainBot.prompt_metric = "cosine"
ExplainBot.prompt_ordering = "ascending"

# Prompt params
Prompts.prompt_cache_size = 1_000_000
Prompts.prompt_cache_location = "./cache/boolq-prompts.pkl"
Prompts.max_values_per_feature = 2
Prompts.sentence_transformer_model_name = "all-mpnet-base-v2"
Prompts.prompt_folder = "./explain/prompts"
Prompts.num_per_knn_prompt_template = 1
Prompts.num_prompt_template = 7

# Explanation Params
Explanation.max_cache_size = 1_000_000

# MegaExplainer Params
MegaExplainer.cache_location = "./cache/boolq-mega-explainer-tabular.pkl"
MegaExplainer.use_selection = False

# Conversation params
Conversation.class_names = {1: "True", 0: "False"}

# Dataset description
DatasetDescription.dataset_objective = "predict to answer yes/no questions based on text passages"
DatasetDescription.dataset_description = "Boolean question answering (yes/no)"
DatasetDescription.model_description = "DistilBERT"

And change the global config files in global_config.gin:

GlobalArgs.config = "./configs/boolq.gin"

Then you should add datasets:

from datasets import load_dataset
val = load_dataset("super_glue", "boolq", split="validation").to_csv('data/boolq_validation.csv')
train = load_dataset("super_glue", "boolq", split="train").to_csv('data/boolq_train.csv')

What's more, you should download the model:
https://huggingface.co/andi611/distilbert-base-uncased-qa-boolq/tree/main. And put it under /configs as ./configs/boolq_model.

Adapt original files:

In /explain/logic.py,

  • Add load_hf_model()
@gin.configurable
def load_hf_model(model_id):
    """ Loads a (local) Hugging Face model from a directory containing a pytorch_model.bin file and a config.json file.
    """
    return TransformerModel(model_id)
    # transformers.AutoModel.from_pretrained(model_id)
  • Comment load_explanations:
# Load the explanations
# self.load_explanations(background_dataset=background_dataset)
  • Change else part from load_model():
else:
    model = load_hf_model(filepath)
    self.conversation.add_var('model', model, 'model')

4. Execution

python flash_app.py

[Operations] Adversarial attack is not displayed

I've been testing the adversarial operation and it currently fails with the following message:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

screenshot

My input was: Show me an adversarial sample for id 4

The decoded/parsed text: filter id 4 and adversarial

It seems that the input is parsed correctly but the operation itself fails when it tries to access the dataset.
I also tried to change the visualize parameter:
d = attack_eval.eval(dataset, visualize=True) (line 85 in adversarial.py) but it results in the same error message.
The adversarial attack itself is successful, I can see it in the console output.

Error in rationalize operation

Input : rationalize the prediction for id 9

Logs:

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/src/app/timeout.py", line 19, in run
    self._result = self._func(*self._args, **self._kwargs)
  File "/usr/src/app/actions/explanation/rationalize.py", line 89, in rationalize_operation
    generation = conversation.decoder.gpt_model.generate(
  File "/usr/local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 1296, in generate
    return self.greedy_search(
  File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 1690, in greedy_search
    outputs = self(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 745, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 583, in forward
    position_embeds = self.wpe(position_ids)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
[2023-06-02 08:31:46 +0000] [8] [INFO] Traceback getting bot response: Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 890, in update_state
    returned_item = run_action(
  File "/usr/src/app/logic/action.py", line 48, in run_action
    action_return, action_status = actions[p_text](
TypeError: cannot unpack non-iterable NoneType object

[2023-06-02 08:31:46,296] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 890, in update_state
    returned_item = run_action(
  File "/usr/src/app/logic/action.py", line 48, in run_action
    action_return, action_status = actions[p_text](
TypeError: cannot unpack non-iterable NoneType object

[2023-06-02 08:31:46 +0000] [8] [INFO] Exception getting bot response: cannot unpack non-iterable NoneType object
[2023-06-02 08:31:46,296] INFO in flask_app: Exception getting bot response: cannot unpack non-iterable NoneType object

[Data] Dev set for parsing accuracy evaluation

We need a custom dataset of prompts (around 100 instances) that we can test our four systems on before the user study.

A test set creation will be separate task done in parallel to the user study (#73).

Agree on folder structure

Suggested structure

  • actions (on top level instead of under explain/)
    • cfe_code
    • feature_importance
    • similarity
  • cache
    • subdirectories for boolq, dailydialog, olid
    • in each subdirectory: JSON files of pre-computed predictions and explanations
  • configs
    • boolq.gin
    • dailydialog.gin
    • olid.gin
  • data
    • subdirectories for boolq, dailydialog, olid
    • in each subdirectory: CSV files containing the raw datasets
  • experiments
  • explained_models
    • "boolq" and "da_classifier" should both be subdirectories
  • logic (folder)
    • action.py
    • conversation.py
    • core.py (previous explain/logic.py)
    • dataset_description.py
    • decoder.py
    • grammar.py
    • parser.py
    • prompts.py
    • sample_prompts_by_action.py
    • utils.py (read_and_format_data, etc.)
    • write_to_log.py
  • parsing (fine-tuning of t5 which is a separate workflow)
  • prompts (on top level instead of under explain/)
    • generation.py (explain/prompts.py > [generate_prompts, filter_prompts, get_embedding, …])
    • retrieval.py (explain/prompts.py > [get_prompts, get_k_nearest_prompts])
  • tests

Stuff we can delete from TTM

  • explain/complete.py (optional NLG [?] engine)
  • explain/explanation.py (additional logic for handling feature importance)
  • explain/feature_interaction.py (not trivially applicable)
  • explain/mega_explainer (we only need the IG explainer)

specific to TTM datasets (will be replaced by BoolQ, OLID, DailyDialog)

  • configs/{…}.gin
  • data

[Summary] Nice to Have

  • In actions/filter.py: labelfilter(), predfilter(), lengthfilter() modify the temp_dataset from conversation. However, most of our operations won't use it, since they just read data from cache directly -> #52
  • Adversarial examples: we have to use one more new package OpenAttack. Implementation for our use case find here. Disadv. is about the execution time.

2023-04-03

  • Visualization of explanation. See #23

2023-05-11

  • Add keyboard shortcuts similar to a Python terminal, e.g. to reset the temp dataset, remove the custom input, or sample actions

2023-05-14

  • Pagination for dataset viewer (allow user to click through more than just the first page of 10 results, see "Dataset Preview" on Hugging Face)
  • Hints or tooltips for operations, e.g. that operations like cfe, nlpattribute, rationalize can only be used on single instances --> #90

2023-05-15

  • Clicking on an icon next to an instance in dataset viewer enters "ID xx" into input box
  • Clicking on a "On this subset..." button next to dataset viewer inserts a random "includes" prompt and caches this temp dataset to be used in the backend

2023-05-17

  • Add pytest for Lark grammar

2023-05-20

  • Input box should dynamically expand into multi-line view (see ChatGPT interface)
  • When hovering over "ID xx" in previous turns, the contents should be shown in a pop-up tooltip
  • Sample prompts should dynamically choose IDs (not necessarily IDs as hard-coded in prompts folder)
  • Class names should be highlighted with a background color
  • Three buttons with possible follow-up questions below the last response -> Clicking them automatically inserts them into the query (User can still edit them)

[Data] Adapt {span} prompts to custom input functionality

There are a few prompts that are based on the notion of the string filter being part of the user question. However, we changed that recently to be handled by either the string filter input box next to the dataset viewer or the custom input dropdown option, so these prompts would have to be slightly adapted:

  • global feature importance (not applicable)
  • filter
    includes {span} -> includes
    When a string filter is used, the backend will use the filtered dataset as the temp_dataset.
    The user questions are modified slightly to allow for two kinds of formats:
  1. The span is part of the question, i.e. a {span} placeholder
  2. The span is not mentioned explicitly, but reffered to as "my query", "search term" or "filter", e.g. "Can you display all occurrences where my filter applies?"

[Summary] Filters

This issue keeps track of the filter actions and functionality. Apart from the string filter (includes), everything else is optional for now.

  • String filter (includes)
    Checks if a span/string is included in one of the text fields, e.g. includes spider

  • Prediction filter (predictionfilter)
    Checks if the model predicted this label, e.g. predictionfilter offensive (OLID)

  • Label filter (labelfilter) [in TTM, but not yet adapted]
    Checks if the true label is equal to this label, e.g. labelfilter commissive (DD)

  • Length filter (lengthfilter)
    Checks if the instance is above or below a certain length, e.g. lengthfilter words above 30 would show all instances with 30 words or more. Possibly extensible to more parameters like text fields (is one specific field longer than the value). This should work similarly to the numerical filters in TTM (there are already equality terms implemented in the grammar). Possible arguments are words, sents (sentences) and chars (characters)

  • Similarity filter (similarityfilter or similar n and …)
    Checks if the instance is among the n instances selected with similar n in a previous turn.

[Operations] spanimportance

Created from #25

An explanation operation that, based on the pre-computed feature attributions, would retrieve k instances where a custom input span has the highest attribution score. The sorting could be influenced by (1) the rank of the span in sorted attributions and (2) the absolute difference to the next highest attributed word/sentence.

spanimportance topk {k}

prompts/talktomodel

[Data] Annotate user responses (creating "user" test set)

We will need to annotate the feedback files manually. In particular, based on the original user questions, we should annotate the gold parse. This lets us compare them with the actual parses and compute accuracies in #53

  • Repo group
  • BoolQ (Hosted Interface group)
  • DailyDialog (Hosted Interface group)
  • OLID (Hosted Interface group)

[Operations] replace operation

A possible adaptation of the deprecated TTM operation change (actions/perturbation/what_if.py)
is replace where any string can be replaced with a custom input.

This would require two custom inputs, in fact. It would modify the temp_dataset. It is also similar to cfe, but here the user has full control of what they would insert in place of the original string from the data.

A major limitation is that even prediction is not easily possible, because the entire dataset would have to be recomputed. However, this might still be relevant for a small temp_dataset, e.g. you filter by "spider" and are left with 13 instances for BoolQ. With this scale, it would certainly be possible to compute predictions and explanations on the fly.

Newer transformers is incompatible with polyjuice_nlp

1.

While testing cfe code, which requires both transformers and polyjuice_nlp packages, figures out that after installing everything needed according to the offical website from polyjuice_nlp, there is always an error:

ValueError: The following `model_kwargs` are not used by the model: ['n'] (note: typos in the generate arguments will also show up in this list)

This issue is also mentioned in repo of transformers.

The easiest way to solve it, is just to use lower version.

2.

If running code and encounter this problem:

AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils'. 

One similar issue can be found here.

Make sure that your transformers version is 4.18.0

3.

If you want to install polyjuice_nlp, it's better to clone the git and install then directly(instead of pip/conda).

Similar operation parser issue

During parsing for similar prompts, number of similar items to be retrieved are omitted by parser.

input: could you locate 6 comparable data point to id 75 for me? parsed: filter id 75 and similar [e] Batches: 100%|███████████████████████████████████| 1/1 [00:00<00:00, 138.85it/s] Batches: 100%|██████████████████████████████████| 75/75 [00:00<00:00, 95.83it/s] [2023-05-20 11:08:59,960] INFO in write_to_log: {'bot_name': 'daily_dialog', 'username': 'unknown', 'id': 'e89e983cc655e38ebdd12604539f48ef9fd9f6140419140cbd6e53b94fbe', 'system_input': 'Could you locate 6 comparable data point to ID 75 for me?', 'parsed_text': ' filter id 75 and similar [e]', 'system_response': "The original text for <b>id 75</b>:<br><summary>a good rest is all you need, and drink more water. i'll write you a prescription....</summary><details>a good rest is all you need, and drink more water. i'll write you a prescription.</details><br>Here are some instances similar to <b>id 75</b>:<br><b> id 1989</b> (cossim 0.665): <summary>try to get outdoors more and be sure to get more rest....</summary><details>try to get outdoors more and be sure to get more rest.</details>", 'time': '05_20_2023_02_08_59_PDT'} [2023-05-20 11:08:59,960] INFO in _internal: 134.96.189.13 - - [20/May/2023 11:08:59] "POST /get_response HTTP/1.1" 200 -

[Summary] gin configs

Nearest neighbor

  • BoolQ
  • OLID
  • DailyDialog

GPT-Neo 2.7B

  • BoolQ
  • OLID
  • DailyDialog

Fine-tuned FLAN-T5-large

  • BoolQ
  • OLID
  • DailyDialog

Adapter + BERT-base-uncased

  • BoolQ
  • OLID
  • DailyDialog

[Summary] User study functionality

  • Add a demonstration for users before they start the evaluation (using Selenium, automatic user questions, or screen recording)
  • Investigate how to run Flask application on a server, so we could access it via URL in a browser @OguzCennet
  • For "Feedback" option, check how to collect acceptability ratings in the user study (decide on rating values, save to files etc.)
  • Test sample interface ("Help me generate a question about...") with all operations
  • Check if the response time of all operations is not exceeding 60 seconds

[Summary] Parsing evaluation (User response set)

Blocked by #41 and #106

BoolQ

  • Nearest Neighbors
  • GPT-Neo 2.7B
  • FLAN-T5-large (780M)
  • BERT+Adapter (110M)

DailyDialog

  • Nearest Neighbors
  • GPT-Neo 2.7B
  • FLAN-T5-large (780M)
  • BERT+Adapter (110M)

OLID

  • Nearest Neighbors
  • GPT-Neo 2.7B
  • FLAN-T5-large (780M)
  • BERT+Adapter (110M)

Optional fifth model: Dolly-v2-3B

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.