dfki-nlp / interrolang Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 98.63 MB

InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations [EMNLP 2023 Findings]

Home Page: https://arxiv.org/abs/2310.05592

Dockerfile 0.16% Python 72.29% Shell 0.68% JavaScript 0.26% HTML 26.61%

data-analysis dialogue interpretability llms transformers xai

interrolang's People

Contributors

Stargazers

Watchers

Forkers

eltociear

interrolang's Issues

[Operations] nlpattribute sentence-level aggregation

Sentence-level aggregation: Based on nlpattribute importances, split the sentences (e.g. using spaCy, NLTK, etc.) and calculate the average saliency score.
A parse nlpattribute sentence (reconsider if this is a sensible name) should return all sentences together with their average saliency score.

[Operations] Handle "includes" string filters by accessing filtered temp_dataset

Add another input box to filter dataset (includes). We should then have three dropdown menu options, i.e. [1] regular input requesting an operation, [2] defining a new instance ("custom input") and [3] filter dataset by string
Solved by #45

All operations that currently have includes {span} prompts (see #36) need previousfilter and x adaptations and should be based on the contents of the temp_dataset.

[Description] Add description for all buttons

Refer to PR #90.

For each button, follow code here and add onmouseover and onmouseout. onmouseover function can receive descriptions.

[Config] different class names in different gin configs for OLID

I found the following inconsistency:
olid_adapter.gin, olid_flan-t5.gin and olid_nn.gin have the following classes:
Conversation.class_names = {1: "True", 0: "False"}

However, olid.gin has: Conversation.class_names = {1: "offensive", 0: "non-offensive"}

I guess the olid.gin is the correct version, right?

[Summary] Explanation Operations

Operation	Terminals / Prompts	Action	Description	Tools	Status
nlpattribute	nlpattribute token \| phrase \| sentence {classes}	feature_importance	Provides feature importances at the token (default), phrase or sentence level.	Captum (Integrated Gradients)	✅
globaltopk	important {number} {classes}	global_topk	Returns top k most attributed tokens across the entire dataset.	Captum (Integrated Gradients)	✅
nlpcfe	nlpcfe {number}	counterfactuals	Returns counterfactual explanations (model predicts another label) for a single instance.	Polyjuice	✅
adversarial	adversarial {number}		Returns adversarial examples (model predicts wrong label) for a single instance.	OpenAttack	✅
similar	similar {number}	similarity	Gets number of training data instances that are most similar to the current one.	Sentence Transformers	✅
rules	rules {number}		Outputs the decision rules for the dataset.	Anchors	⛔
interact	interact		Gets feature interactions.	HEDGE	⛔
rationalize	rationalize	rationalize	Explains the prediction for some specified instance in natural language.	Zero-shot prompting with GPTNeo parser	✅

[Operations] includes with other operators (OR, NOT)

We currently don't have the option to apply string filters (includes) with other logical operators like or or not.
This would also require frontend work to allow multiple inputs (or structure them with special characters, at least).

OR

Update action
Write prompts with includes or includes
Update grammar
Frontend

NOT

Update action
Write prompts with includes false / not includes
Update grammar
Frontend

[Operation] fix operations for adapter

mistake (count, sample)
globaltopk (class name, topk)
nlpattribute (topk, all)
filter should be able to receive multiple IDs

No output for nlp attribute (local_feature_importance_sentence.txt)

Given Input : Most important phrases in id 12

Log trace


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 881, in update_state
    parse_tree, parsed_text = self.compute_parse_text(text)
  File "/usr/src/app/logic/core.py", line 639, in compute_parse_text
    api_response = self.decoder.complete(
  File "/usr/src/app/logic/decoder.py", line 118, in complete
    completed = self.gen_completions(prompt, grammar)
  File "/usr/src/app/logic/decoder.py", line 84, in complete
    return predict_f(text=prompt, grammar=grammar)
  File "/usr/src/app/parsing/gpt/few_shot_inference.py", line 36, in predict_f
    parser = GuidedParser(grammar, tokenizer, model="gpt")
  File "/usr/src/app/parsing/guided_decoding/gd_logits_processor.py", line 43, in __init__
    self.text_parser = Lark(self.text_grammar, parser="lalr")
  File "/usr/local/lib/python3.9/site-packages/lark/lark.py", line 333, in __init__
    self.grammar, used_files = load_grammar(grammar, self.source_path, self.options.import_paths, self.options.keep_all_tokens)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1408, in load_grammar
    builder.load_grammar(grammar, source)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1233, in load_grammar
    tree = _parse_grammar(grammar_text, grammar_name)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 964, in _parse_grammar
    raise GrammarError("Unexpected input at line %d column %d in %s: \n\n%s" %
lark.exceptions.GrammarError: Unexpected input at line 87 column 5 in <string>: 

id: " id 12
    ^


[2023-06-02 08:16:10,663] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 961, in _parse_grammar
    tree = _get_parser().parse(text + '\n', start)
  File "/usr/local/lib/python3.9/site-packages/lark/parser_frontends.py", line 96, in parse
    return self.parser.parse(stream, chosen_start, **kw)
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 41, in parse
    return self.parser.parse(lexer, start)
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 171, in parse
    return self.parse_from_state(parser_state)
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 188, in parse_from_state
    raise e
  File "/usr/local/lib/python3.9/site-packages/lark/parsers/lalr_parser.py", line 178, in parse_from_state
    for token in state.lexer.lex(state):
  File "/usr/local/lib/python3.9/site-packages/lark/lexer.py", line 456, in lex
    yield self.next_token(state, parser_state)
  File "/usr/local/lib/python3.9/site-packages/lark/lexer.py", line 466, in next_token
    raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches '"' in the current parser context, at line 87 col 5

id: " id 12
    ^
Expected one of: 
	* _RBRACE
	* _COMMA
	* _LBRACE
	* _OVERRIDE
	* NUMBER
	* TILDE
	* _EXTEND
	* _RBRA
	* _NL_OR
	* _OR
	* _RPAR
	* OP
	* _DECLARE
	* _LPAR
	* REGEXP
	* RULE_MODIFIERS
	* _IGNORE
	* _LBRA
	* _DOT
	* _TO
	* TERMINAL
	* _NL
	* RULE
	* _COLON
	* STRING
	* _IMPORT
	* _DOTDOT

Previous tokens: Token('_COLON', ':')


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 881, in update_state
    parse_tree, parsed_text = self.compute_parse_text(text)
  File "/usr/src/app/logic/core.py", line 639, in compute_parse_text
    api_response = self.decoder.complete(
  File "/usr/src/app/logic/decoder.py", line 118, in complete
    completed = self.gen_completions(prompt, grammar)
  File "/usr/src/app/logic/decoder.py", line 84, in complete
    return predict_f(text=prompt, grammar=grammar)
  File "/usr/src/app/parsing/gpt/few_shot_inference.py", line 36, in predict_f
    parser = GuidedParser(grammar, tokenizer, model="gpt")
  File "/usr/src/app/parsing/guided_decoding/gd_logits_processor.py", line 43, in __init__
    self.text_parser = Lark(self.text_grammar, parser="lalr")
  File "/usr/local/lib/python3.9/site-packages/lark/lark.py", line 333, in __init__
    self.grammar, used_files = load_grammar(grammar, self.source_path, self.options.import_paths, self.options.keep_all_tokens)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1408, in load_grammar
    builder.load_grammar(grammar, source)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 1233, in load_grammar
    tree = _parse_grammar(grammar_text, grammar_name)
  File "/usr/local/lib/python3.9/site-packages/lark/load_grammar.py", line 964, in _parse_grammar
    raise GrammarError("Unexpected input at line %d column %d in %s: \n\n%s" %
lark.exceptions.GrammarError: Unexpected input at line 87 column 5 in <string>: 

id: " id 12
    ^


[2023-06-02 08:16:10 +0000] [8] [INFO] Exception getting bot response: Unexpected input at line 87 column 5 in <string>: 

id: " id 12
    ^

[2023-06-02 08:16:10,663] INFO in flask_app: Exception getting bot response: Unexpected input at line 87 column 5 in <string>: 

id: " id 12

[Summary] General-purpose LLMs as explained models

Note: This is meant to be done after the first submission.

A promising extension to InterroLang (probably even warranting its own paper) would be to replace BERT-type models with a single general-purpose LLM (e.g. LLaMa or the already-in-place GPT-Neo parser) that performs all of the tasks reasonably well. This would be a more modern approach, since BERTs are slowly getting out-of-date and LLMs can now run on consumer hardware locally. This would cause some changes, however, which I will document in the following:

Write instructions for the various tasks, e.g.
" Please predict one of the following labels: <label_1> … <label_n> Prediction: "
Overhaul of the entire feature importance operation category (nlpattribute, globaltopk) using Inseq.
⚠️ How matrices of feature attributions would be verbalized into a response has yet to be determined.
Pre-compute predictions and explanations with the new LLM

We can also get rid of many smaller language models, e.g. the GPT-2 for CFE generation, the SBERT for semantic similarity.

Ideally, we would end up with one model (for the entire framework) that assesses itself. It would take care of

Parsing / Intent recognition
Prediction of downstream tasks
Feature attribution (nlpattribute, globaltopk)
Perturbations (CFE, adversarial, augment)
Semantic similarity
Rationalization

The only two parts of the pipeline that would remain rule-based are the dialogue state tracking (custom inputs, clarification questions, previous filters) and the response generation (currently template-based).

Resources

llama.cpp (Efficient execution of up to 7B models on CPUs)
RedPajama-INCITE-Instruct-3B (Hugging Face) – maybe better for rationalization?
RedPajama-INCITE-Chat-3B – maybe better for response generation?

[Operation] Dataset Viewer

With dataset viewer, user should not only search instances with texts but also with id.

Idea:
when entering input, check if the input can be converted to digits.

[Data] Synthetic test set (GPT-3.5) for parsing accuracy evaluation

We need a custom dataset of prompts that we can evaluate our four systems on (#53).
We can use GPT-3.5 / -4, since some prompts (marked as "_chatgpt.txt") are already produced by it.

[Operations] Metadata (Model Cards & Datasheets)

Decide on methodology (Return pre-formatted strings vs. QA-type setup)
Write CSVs, e.g.

"number of parameters", "110 million"
"training objective", "Masked language modeling (MLM) and Next sentence prediction (NSP)"
"intended uses", "This model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering."

Write parsing data (prompts), e.g.

User: How many parameters does the model have?
Parsed: model parameters

Model Cards

Model card writer on Hugging Face

Action

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/actions/metadata/model.py#L4-L11

The existing model operation should be extended.

Prompts

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/prompts/metadata/describe_model.txt#L1-L11

Datasheet

Action

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/actions/metadata/data_summary.py#L41-L44

The exisitng data operation should be extended. However, the additional metadata collected in lines 46 ff. should stay the same.

Prompts

https://github.com/nfelnlp/InterroLang/blob/b9fdcb77139e6b34f61b1886cced7077dac03ec4/prompts/metadata/describe_data.txt#L1-L23

[Frontend] Save feedback from user

[Parsing] Evaluation of parsing

Encounter this error when executing generate_parsing_results.py file with models GPT and FLAN-T5

Exception: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx, likely OOM

Counterfactuals fail to work

Here is the stack trace for the following input: "cfe for id 14"

[2023-06-08 09:49:41,263] INFO in core: USER INPUT: cfe for id 14

Batches: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 40.87it/s]
[2023-06-08 09:49:41,335] INFO in core: adapters decoded text filter id 14 and cfe
[2023-06-08 09:50:46,811] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe_generation.py", line 51, in get_samples_from_pj
generated_samples = self.explainer.perturb(instance, ctrl_code=ctrl_code, num_perturbations=None,
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 263, in perturb
ctrl = self.detect_ctrl_code(orig_doc, generated_doc, eop)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 131, in detect_ctrl_code
meta.compute_metadata(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 546, in compute_metadata
p.compute_metadata(sentence_similarity)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 386, in compute_metadata
negs2, neg_heads2 = get_negations(bcore)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 156, in get_negations
for t in span:
TypeError: 'NoneType' object is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/projects/InterroLang/flask_app.py", line 190, in get_bot_response
response = BOT.update_state(user_text, conversation)
File "/home/ubuntu/projects/InterroLang/logic/core.py", line 892, in update_state
returned_item = run_action(
File "/home/ubuntu/projects/InterroLang/logic/action.py", line 48, in run_action
action_return, action_status = actions[p_text](
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe.py", line 80, in counterfactuals_operation
same, diff = cfe_explainer.cfe(instance, cfe_num, ctrl_code=ALL_CTRL_CODES, _id=_id)
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe_generation.py", line 67, in cfe
new_samples = self.get_samples_from_pj(instance, ctrl_code)
File "/home/ubuntu/projects/InterroLang/actions/perturbation/cfe_generation.py", line 54, in get_samples_from_pj
generated_samples = self.explainer.perturb(instance, ctrl_code=ALL_CTRL_CODES, num_perturbations=None,
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 263, in perturb
ctrl = self.detect_ctrl_code(orig_doc, generated_doc, eop)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/polyjuice_wrapper.py", line 131, in detect_ctrl_code
meta.compute_metadata(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 546, in compute_metadata
p.compute_metadata(sentence_similarity)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 386, in compute_metadata
negs2, neg_heads2 = get_negations(bcore)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/polyjuice/compute_perturbs/compute_ctrl_meta.py", line 156, in get_negations
for t in span:
TypeError: 'NoneType' object is not iterable

[2023-06-08 09:50:46,811] INFO in flask_app: Exception getting bot response: 'NoneType' object is not iterable

[Operation] Add predictions for feature importance

[Summary] DA prototype

nlpattribute sentence
custom input prediction

[Data] Pre-computed rationales as JSON

Computing rationales is only possible with a LLM such as GPT-Neo or the Dolly-3B model. When we do the user study, we have to choose one of the smaller models for parsing (Adapter, FLAN-T5), so users don't have to wait on the responses. This also means we cannot compute rationales on-the-fly, but have to have a JSON of pre-computed rationales for the entire three datasets.

First tests indicate that Dolly-3B might be the best rationalizer we can use. Few-shot also gives us better results than zero-shot.

Code

Write code that executes rationalize_operation on every instance and saves the result to a JSON file
Read JSON in rationalize_operation (have on-the-fly computation as a separate function instead)

Datasets

BoolQ
DailyDialog
OLID

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import AutoModelForCausalLM
import pandas as pd
import numpy as np
import torch
import json

from transformers import GPTNeoXTokenizerFast, GPTNeoXForCausalLM, GPTNeoXConfig

dataset_name = "boolq"
dataset = pd.read_csv("./data/boolq_validation.csv")
model = AutoModelForSequenceClassification.from_pretrained("andi611/distilbert-base-uncased-qa-boolq", num_labels=2)
instances = []
for i in range(len(dataset)):
    instances.append([dataset["question"][i], dataset["passage"][i]])
print(instances)

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("andi611/distilbert-base-uncased-qa-boolq")
model.to(device)

gpt_tokenizer = GPTNeoXTokenizerFast.from_pretrained("databricks/dolly-v2-3b", padding_side="left")
gpt_model = GPTNeoXForCausalLM.from_pretrained("databricks/dolly-v2-3b")
gpt_model.to(device)

explanations = []
for instance in instances:
    # if dataset_name == "boolq":
    text = 'Question: ' + instance[0] + '\nPassage: ' + instance[1]
    label_dict = {0: 'false', 1: 'true'}
    text_description = 'question and passage'
    fields = ['question', 'passage']
    fields_enum = ', '.join([f"'{f}'" for f in fields])
    output_description = 'answer'

    string = instance[0] + ' ' + instance[1]

    encoding = tokenizer.encode_plus(string, return_tensors='pt', max_length=512, truncation=True)
    input_ids = encoding["input_ids"]
    attention_mask = encoding["attention_mask"]
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    input_model = {
        'input_ids': input_ids.long(),
        'attention_mask': attention_mask.long(),
    }
    output_model = model(**input_model)[0]

    # Get logit
    model_predictions = np.argmax(output_model.cpu().detach().numpy())

    # model_predictions = model.predict()
    pred_str = label_dict[model_predictions]
    # else:
    #     return f"Dataset {dataset_name} currently not supported by rationalize operation", 1
    #
    prompt = f"{text}\n" \
             f"Based on {text_description}, the {output_description} is {pred_str}. " \
             f"Without using {fields_enum}, or revealing the answer or outcome in your response, " \
             f"explain why: "

    input_ids = gpt_tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to(device)
    generation = gpt_model.generate(
        input_ids,
        max_length=350,
        no_repeat_ngram_size=2,
    )
    decoded_generation = gpt_tokenizer.decode(generation[0], skip_special_tokens=True)
    #
    inputs = decoded_generation.split("Based on ")[0]
    explanation = decoded_generation.split("explain why: ")[1]
    print(explanation)
    explanations.append(explanation)

# jsonString = json.dumps(explanations)
# jsonFile = open("./netscratch/qwang/boolq_rationalization.json", "w")
# jsonFile.write(jsonString)
# jsonFile.close()

[Data] Additional prompts that capture dataset-specific descriptions

TTM had prompts that included "people with age [...]". All prompts that we have adapted and written so far are generic and applicable to any dataset (which is good!). However, for our three prototype datasets (BoolQ, OLID, DailyDialog), we might need additional prompts that include strings such as:

BoolQ

question
passage
document
article

e.g. "What would need to be changed in this document to predict False?"

OLID

tweet
Twitter post

e.g. "Please display a random tweet from the dataset and explain the predictions"

DailyDialog

dialogue
conversation

e.g. "For dialogue 250, please show the prediction"

Related:
metadata prompts README

[Operations] important/globaltopk + filters

Created from #25

Right now, globaltopk only operates on a fixed set of pre-computed global explanations. It is only possible to use important on the whole dataset or on a specific class ({classname}).
If there's a way to efficiently compute them on other temp_datasets, we can combine this operation with different filters.

prompts/work_in_progress

[Operation] nlpattribute throws an error for OLID

Config: olid.gin
Input: sentence level feature importance for id 1766
Parsed: filter id 1766 and nlpattribute sentence [e]
Traceback:

[2023-06-18 10:43:13,641] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
File "/home/ubuntu/projects/InterroLang/flask_app.py", line 192, in get_bot_response
response = BOT.update_state(user_text, conversation)
File "/home/ubuntu/projects/InterroLang/logic/core.py", line 898, in update_state
returned_item = run_action(
File "/home/ubuntu/projects/InterroLang/logic/action.py", line 51, in run_action
action_return, action_status = actions[p_text](
File "/home/ubuntu/projects/InterroLang/actions/explanation/feature_importance.py", line 374, in feature_importance_operation
return_s += get_sentence_level_feature_importance(conversation, filtered_text, simulation)
File "/home/ubuntu/projects/InterroLang/actions/explanation/feature_importance.py", line 195, in get_sentence_level_feature_importance
res_list = get_explanation(dataset_name, inputs, conversation, file_name="sentence_level")
File "/home/ubuntu/projects/InterroLang/actions/explanation/feature_importance.py", line 85, in get_explanation
res_list = generate_explanation(model, dataset_name, inputs, conversation, file_name=file_name)
File "/home/ubuntu/projects/InterroLang/actions/custom_input.py", line 249, in generate_explanation
attribution, predictions = compute_feature_attribution_scores(b, model, device)
File "/home/ubuntu/projects/InterroLang/actions/custom_input.py", line 165, in compute_feature_attribution_scores
attributions = explainer.attribute(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/log/init.py", line 35, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/attr/_core/layer/layer_integrated_gradients.py", line 365, in attribute
inputs_layer = _forward_layer_eval(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/gradient.py", line 182, in _forward_layer_eval
return _forward_layer_eval_with_neuron_grads(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/gradient.py", line 445, in _forward_layer_eval_with_neuron_grads
saved_layer = _forward_layer_distributed_eval(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/gradient.py", line 294, in _forward_layer_distributed_eval
output = _run_forward(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/captum/_utils/common.py", line 456, in _run_forward
output = forward_func(
File "/home/ubuntu/projects/InterroLang/actions/custom_input.py", line 117, in bert_forward
output_model = model(**input_model)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 1599, in forward
outputs = self.bert(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/adapters/context.py", line 108, in wrapper_func
results = f(self, *args, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 1042, in forward
embedding_output = self.embeddings(
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 245, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1123, in call_impl
hook_result = hook(self, input, result)
File "/home/ubuntu/miniconda3/envs/nlg/lib/python3.9/site-packages/OpenAttack/utils/transformers_hook.py", line 7, in call
output.retain_grad()
RuntimeError: can't retain_grad on Tensor that has requires_grad=False

It doesn't work for any ids. It seems that only the operations parsed as nlpattribute or nlpattribute sentence throw this error for OLID.

Set up TalkToModel env

Intergration

This is an overview how we could intergrate other desired dataset into TTM.

1. Get TTM

git clone [email protected]:dylan-slack/TalkToModel.git

2. Setup env

Get into the TTM directory and run these commands:

conda create -n ttm python=3.9
conda activate ttm

Then you should all dependencies:

pip install -r requirements.txt
pip install datasets

3. Add needed files

Firstly, put the configure file into folder /configs as /configs/boolq.gin

##########################################
# The boolq dataset conversation config
##########################################

# for few shot, e.g., "EleutherAI/gpt-neo-2.7B"
ExplainBot.parsing_model_name = "EleutherAI/gpt-neo-2.7B"


# set skip_prompts to true for quicker startup for finetuned models
# make sure to set to false using few-shot models
ExplainBot.skip_prompts = False

ExplainBot.t5_config = "./parsing/t5/gin_configs/t5-large.gin"
ExplainBot.seed = 0
ExplainBot.background_dataset_file_path = "./data/boolq_train.csv"
ExplainBot.model_file_path = "./data/boolq_model"
ExplainBot.dataset_file_path = "./data/boolq_validation.csv"

ExplainBot.name = "boolq"

ExplainBot.dataset_index_column = "idx"
ExplainBot.target_variable_name = "label"
ExplainBot.categorical_features = None
ExplainBot.numerical_features = None
ExplainBot.remove_underscores = False

ExplainBot.prompt_metric = "cosine"
ExplainBot.prompt_ordering = "ascending"

# Prompt params
Prompts.prompt_cache_size = 1_000_000
Prompts.prompt_cache_location = "./cache/boolq-prompts.pkl"
Prompts.max_values_per_feature = 2
Prompts.sentence_transformer_model_name = "all-mpnet-base-v2"
Prompts.prompt_folder = "./explain/prompts"
Prompts.num_per_knn_prompt_template = 1
Prompts.num_prompt_template = 7

# Explanation Params
Explanation.max_cache_size = 1_000_000

# MegaExplainer Params
MegaExplainer.cache_location = "./cache/boolq-mega-explainer-tabular.pkl"
MegaExplainer.use_selection = False

# Conversation params
Conversation.class_names = {1: "True", 0: "False"}

# Dataset description
DatasetDescription.dataset_objective = "predict to answer yes/no questions based on text passages"
DatasetDescription.dataset_description = "Boolean question answering (yes/no)"
DatasetDescription.model_description = "DistilBERT"

And change the global config files in global_config.gin:

GlobalArgs.config = "./configs/boolq.gin"

Then you should add datasets:

from datasets import load_dataset
val = load_dataset("super_glue", "boolq", split="validation").to_csv('data/boolq_validation.csv')
train = load_dataset("super_glue", "boolq", split="train").to_csv('data/boolq_train.csv')

What's more, you should download the model:
https://huggingface.co/andi611/distilbert-base-uncased-qa-boolq/tree/main. And put it under /configs as ./configs/boolq_model.

Adapt original files:

In /explain/logic.py,

Add load_hf_model()

@gin.configurable
def load_hf_model(model_id):
    """ Loads a (local) Hugging Face model from a directory containing a pytorch_model.bin file and a config.json file.
    """
    return TransformerModel(model_id)
    # transformers.AutoModel.from_pretrained(model_id)

Comment load_explanations:

# Load the explanations
# self.load_explanations(background_dataset=background_dataset)

Change else part from load_model():

else:
    model = load_hf_model(filepath)
    self.conversation.add_var('model', model, 'model')

4. Execution

python flash_app.py

[Operations] Adversarial attack is not displayed

I've been testing the adversarial operation and it currently fails with the following message:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My input was: Show me an adversarial sample for id 4

The decoded/parsed text: filter id 4 and adversarial

It seems that the input is parsed correctly but the operation itself fails when it tries to access the dataset.
I also tried to change the visualize parameter:
d = attack_eval.eval(dataset, visualize=True) (line 85 in adversarial.py) but it results in the same error message.
The adversarial attack itself is successful, I can see it in the console output.

Error in rationalize operation

Input : rationalize the prediction for id 9

Logs:

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/usr/src/app/timeout.py", line 19, in run
    self._result = self._func(*self._args, **self._kwargs)
  File "/usr/src/app/actions/explanation/rationalize.py", line 89, in rationalize_operation
    generation = conversation.decoder.gpt_model.generate(
  File "/usr/local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 1296, in generate
    return self.greedy_search(
  File "/usr/local/lib/python3.9/site-packages/transformers/generation_utils.py", line 1690, in greedy_search
    outputs = self(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 745, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/transformers/models/gpt_neo/modeling_gpt_neo.py", line 583, in forward
    position_embeds = self.wpe(position_ids)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/usr/local/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
[2023-06-02 08:31:46 +0000] [8] [INFO] Traceback getting bot response: Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 890, in update_state
    returned_item = run_action(
  File "/usr/src/app/logic/action.py", line 48, in run_action
    action_return, action_status = actions[p_text](
TypeError: cannot unpack non-iterable NoneType object

[2023-06-02 08:31:46,296] INFO in flask_app: Traceback getting bot response: Traceback (most recent call last):
  File "/usr/src/app/flask_app.py", line 164, in get_bot_response
    response = BOT.update_state(user_text, conversation)
  File "/usr/src/app/logic/core.py", line 890, in update_state
    returned_item = run_action(
  File "/usr/src/app/logic/action.py", line 48, in run_action
    action_return, action_status = actions[p_text](
TypeError: cannot unpack non-iterable NoneType object

[2023-06-02 08:31:46 +0000] [8] [INFO] Exception getting bot response: cannot unpack non-iterable NoneType object
[2023-06-02 08:31:46,296] INFO in flask_app: Exception getting bot response: cannot unpack non-iterable NoneType object

[Bug] labels are not updated

https://github.com/nfelnlp/InterroLang/blob/b565e481d4fac4f1b3284041081454ef2aefea20/actions/filter/includes_token.py#L23-L25

[Parsing] Fine-tune FLAN-T5 models on final dataset

Model directory

Datasets

BoolQ (Acc: 95 %)
DailyDialog (Acc: 96 %)
OLID (Acc: 93 %)

[Bug] textarea

Due to <textarea> tag, although quit is given, it still shows sth related to model, because <textarea> will automatically add a new empty line "\n" before your input. Although we have already cleared the input after clicking the send button: https://github.com/nfelnlp/InterroLang/blob/7faea4ac2dc65b8925c194252b92911947d6d665/templates/index.html#L660

Initial view of dataset viewer should be the same as with an empty input (unfiltered dataset)

Remainder from #45

[Operations] For DA explainer, the mistake operation yields 100% accuracy

[Data] Dev set for parsing accuracy evaluation

We need a custom dataset of prompts (around 100 instances) that we can test our four systems on before the user study.

A test set creation will be separate task done in parallel to the user study (#73).

Agree on folder structure

Suggested structure

actions (on top level instead of under explain/)
- cfe_code
- feature_importance
- similarity
cache
- subdirectories for boolq, dailydialog, olid
- in each subdirectory: JSON files of pre-computed predictions and explanations
configs
- boolq.gin
- dailydialog.gin
- olid.gin
data
- subdirectories for boolq, dailydialog, olid
- in each subdirectory: CSV files containing the raw datasets
experiments
explained_models
- "boolq" and "da_classifier" should both be subdirectories
logic (folder)
- action.py
- conversation.py
- core.py (previous explain/logic.py)
- dataset_description.py
- decoder.py
- grammar.py
- parser.py
- prompts.py
- sample_prompts_by_action.py
- utils.py (read_and_format_data, etc.)
- write_to_log.py
parsing (fine-tuning of t5 which is a separate workflow)
~~prompts (on top level instead of under explain/)~~
- generation.py (explain/prompts.py > [generate_prompts, filter_prompts, get_embedding, …])
- retrieval.py (explain/prompts.py > [get_prompts, get_k_nearest_prompts])
tests

Stuff we can delete from TTM

explain/complete.py (optional NLG [?] engine)
explain/explanation.py (additional logic for handling feature importance)
explain/feature_interaction.py (not trivially applicable)
explain/mega_explainer (we only need the IG explainer)

specific to TTM datasets (will be replaced by BoolQ, OLID, DailyDialog)

configs/{…}.gin
data

[Summary] Nice to Have

In actions/filter.py: labelfilter(), predfilter(), lengthfilter() modify the temp_dataset from conversation. However, most of our operations won't use it, since they just read data from cache directly -> #52
Adversarial examples: we have to use one more new package OpenAttack. Implementation for our use case find here. Disadv. is about the execution time.

2023-04-03

Visualization of explanation. See #23

2023-05-11

Add keyboard shortcuts similar to a Python terminal, e.g. to reset the temp dataset, remove the custom input, or sample actions

2023-05-14

Pagination for dataset viewer (allow user to click through more than just the first page of 10 results, see "Dataset Preview" on Hugging Face)
Hints or tooltips for operations, e.g. that operations like cfe, nlpattribute, rationalize can only be used on single instances --> #90

2023-05-15

Clicking on an icon next to an instance in dataset viewer enters "ID xx" into input box
Clicking on a "On this subset..." button next to dataset viewer inserts a random "includes" prompt and caches this temp dataset to be used in the backend

2023-05-17

Add pytest for Lark grammar

2023-05-20

Input box should dynamically expand into multi-line view (see ChatGPT interface)
When hovering over "ID xx" in previous turns, the contents should be shown in a pop-up tooltip
Sample prompts should dynamically choose IDs (not necessarily IDs as hard-coded in prompts folder)
Class names should be highlighted with a background color
Three buttons with possible follow-up questions below the last response -> Clicking them automatically inserts them into the query (User can still edit them)

[Data] Adapt {span} prompts to custom input functionality

There are a few prompts that are based on the notion of the string filter being part of the user question. However, we changed that recently to be handled by either the string filter input box next to the dataset viewer or the custom input dropdown option, so these prompts would have to be slightly adapted:

~~global feature importance~~ (not applicable)
filter
includes {span} -> includes
When a string filter is used, the backend will use the filtered dataset as the temp_dataset.
The user questions are modified slightly to allow for two kinds of formats:

The span is part of the question, i.e. a {span} placeholder
The span is not mentioned explicitly, but reffered to as "my query", "search term" or "filter", e.g. "Can you display all occurrences where my filter applies?"

[Operations] New operation requests from User Study

least important feature (globaltopk only) #122
least frequent keyword #116
include + similar (see comment below)
custom input rationale #117
buttons for include operations #122

[Summary] Filters

This issue keeps track of the filter actions and functionality. Apart from the string filter (includes), everything else is optional for now.

String filter (includes)
Checks if a span/string is included in one of the text fields, e.g. includes spider
Prediction filter (predictionfilter)
Checks if the model predicted this label, e.g. predictionfilter offensive (OLID)
Label filter (labelfilter) [in TTM, but not yet adapted]
Checks if the true label is equal to this label, e.g. labelfilter commissive (DD)
Length filter (lengthfilter)
Checks if the instance is above or below a certain length, e.g. lengthfilter words above 30 would show all instances with 30 words or more. Possibly extensible to more parameters like text fields (is one specific field longer than the value). This should work similarly to the numerical filters in TTM (there are already equality terms implemented in the grammar). Possible arguments are words, sents (sentences) and chars (characters)
Similarity filter (similarityfilter or similar n and …)
Checks if the instance is among the n instances selected with similar n in a previous turn.

[Data] Analyze deprecated talktomodel prompts for potential adaptations

In prompts/talktomodel, there are many complex user questions that had to be removed, because they contain {num_features}, {cat_features} or similar deprecated parts. It might be worthwhile to check all files if we can somehow write adaptations to our NLP datasets.

[Operations] spanimportance

Created from #25

An explanation operation that, based on the pre-computed feature attributions, would retrieve k instances where a custom input span has the highest attribution score. The sorting could be influenced by (1) the rank of the span in sorted attributions and (2) the absolute difference to the next highest attributed word/sentence.

spanimportance topk {k}

prompts/talktomodel

[Operations] adversarial operation on DA

[Data] Annotate user responses (creating "user" test set)

We will need to annotate the feedback files manually. In particular, based on the original user questions, we should annotate the gold parse. This lets us compare them with the actual parses and compute accuracies in #53

Repo group
BoolQ (Hosted Interface group)
DailyDialog (Hosted Interface group)
OLID (Hosted Interface group)

[Operations] replace operation

A possible adaptation of the deprecated TTM operation change (actions/perturbation/what_if.py)
is replace where any string can be replaced with a custom input.

This would require two custom inputs, in fact. It would modify the temp_dataset. It is also similar to cfe, but here the user has full control of what they would insert in place of the original string from the data.

A major limitation is that even prediction is not easily possible, because the entire dataset would have to be recomputed. However, this might still be relevant for a small temp_dataset, e.g. you filter by "spider" and are left with 13 instances for BoolQ. With this scale, it would certainly be possible to compute predictions and explanations on the fly.

Newer transformers is incompatible with polyjuice_nlp

1.

While testing cfe code, which requires both transformers and polyjuice_nlp packages, figures out that after installing everything needed according to the offical website from polyjuice_nlp, there is always an error:

ValueError: The following `model_kwargs` are not used by the model: ['n'] (note: typos in the generate arguments will also show up in this list)

This issue is also mentioned in repo of transformers.

The easiest way to solve it, is just to use lower version.

2.

If running code and encounter this problem:

AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils'.

One similar issue can be found here.

Make sure that your transformers version is 4.18.0

3.

If you want to install polyjuice_nlp, it's better to clone the git and install then directly(instead of pip/conda).

[Summary] Response format of explanations

nlpattribute -> original text, visualization for ids.
cfe -> orginal text
openattack
nlpagument (looks good)

[Operations] likelihood + filters

Created from #25

The likelihood operation might be combined with filter operations such as includes and lengthfilter (#52).

prompts/talktomodel

Similar operation parser issue

During parsing for similar prompts, number of similar items to be retrieved are omitted by parser.

input: could you locate 6 comparable data point to id 75 for me? parsed: filter id 75 and similar [e] Batches: 100%|███████████████████████████████████| 1/1 [00:00<00:00, 138.85it/s] Batches: 100%|██████████████████████████████████| 75/75 [00:00<00:00, 95.83it/s] [2023-05-20 11:08:59,960] INFO in write_to_log: {'bot_name': 'daily_dialog', 'username': 'unknown', 'id': 'e89e983cc655e38ebdd12604539f48ef9fd9f6140419140cbd6e53b94fbe', 'system_input': 'Could you locate 6 comparable data point to ID 75 for me?', 'parsed_text': ' filter id 75 and similar [e]', 'system_response': "The original text for <b>id 75</b>:<br><summary>a good rest is all you need, and drink more water. i'll write you a prescription....</summary><details>a good rest is all you need, and drink more water. i'll write you a prescription.</details><br>Here are some instances similar to <b>id 75</b>:<br><b> id 1989</b> (cossim 0.665): <summary>try to get outdoors more and be sure to get more rest....</summary><details>try to get outdoors more and be sure to get more rest.</details>", 'time': '05_20_2023_02_08_59_PDT'} [2023-05-20 11:08:59,960] INFO in _internal: 134.96.189.13 - - [20/May/2023 11:08:59] "POST /get_response HTTP/1.1" 200 -

[Summary] gin configs

Nearest neighbor

BoolQ
OLID
DailyDialog

GPT-Neo 2.7B

BoolQ
OLID
DailyDialog

Fine-tuned FLAN-T5-large

BoolQ
OLID
DailyDialog

Adapter + BERT-base-uncased

BoolQ
OLID
DailyDialog

[Data] Prompts for micro-, macro-, weighted-F1, precision, recall

Add three possible options to the dataset description prompts.

[Summary] User study functionality

Add a demonstration for users before they start the evaluation (using Selenium, automatic user questions, or screen recording)
Investigate how to run Flask application on a server, so we could access it via URL in a browser @OguzCennet
For "Feedback" option, check how to collect acceptability ratings in the user study (decide on rating values, save to files etc.)
Test sample interface ("Help me generate a question about...") with all operations
Check if the response time of all operations is not exceeding 60 seconds

Allow special character in custom input for BoolQ to be replaced with "[SEP]" token to handle two text inputs

For BoolQ, "question" and "passage" are two separate text inputs and we need the option on the interface for the user to specify what the question and what the passage is.
One solution might be the "|" (pipe) character that could then be replaced by whitespaces.

[Operations] similar and rationalize operations for OLID and DailyDialog

In order to complete the prototypes for OLID and DailyDialog, we need to implement adaptations for the operations similar and rationalize.

OLID

similar
rationalize

DailyDialog

similar
rationalize

[Summary] Parsing evaluation (User response set)

Blocked by #41 and #106

BoolQ

Nearest Neighbors
GPT-Neo 2.7B
FLAN-T5-large (780M)
BERT+Adapter (110M)

DailyDialog

Nearest Neighbors
GPT-Neo 2.7B
FLAN-T5-large (780M)
BERT+Adapter (110M)

OLID

Nearest Neighbors
GPT-Neo 2.7B
FLAN-T5-large (780M)
BERT+Adapter (110M)

Optional fifth model: Dolly-v2-3B

dfki-nlp / interrolang Goto Github PK

interrolang's People

Contributors

Stargazers

Watchers

Forkers

interrolang's Issues

OR

NOT

Resources

Model Cards

Action

Prompts

Datasheet

Action

Prompts

Code

Datasets

BoolQ

OLID

DailyDialog

Intergration

1. Get TTM

2. Setup env

3. Add needed files

Adapt original files:

4. Execution

Datasets

Suggested structure

Stuff we can delete from TTM

specific to TTM datasets (will be replaced by BoolQ, OLID, DailyDialog)

2023-04-03

2023-05-11

2023-05-14

2023-05-15

2023-05-17

2023-05-20

1.

2.

3.

Nearest neighbor

GPT-Neo 2.7B

Fine-tuned FLAN-T5-large

Adapter + BERT-base-uncased

OLID

DailyDialog

BoolQ

DailyDialog

OLID

Recommend Projects

Recommend Topics

Recommend Org