facebookresearch / clutrr Goto Github PK

View Code? Open in Web Editor NEW

89.0 11.0 14.0 11.34 MB

Diagnostic benchmark suite to explicitly test logical relational reasoning on natural language

License: Other

Python 89.22% Shell 10.78%

clutrr's Introduction

CLUTRR

Compositional Language Understanding with Text-based Relational Reasoniong

A benchmark dataset generator to test relational reasoning on text

Code for generating data for our paper "CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text" at EMNLP 2019

Blog: https://www.cs.mcgill.ca/~ksinha4/introducing-clutrr/
Baselines: https://github.com/koustuvsinha/clutrr-baselines

Dependencies

pandas - to store and retrieve in csv
names - to generate fancy names
tqdm - for fancy progressbars

Install

python setup.py develop

Tasks

CLUTRR is highly modular and thus can be used for various probing tasks. Here we document the various types of tasks available and the corresponding config arguments to generate them. To run a task, refer the following table and run:

python main.py --train_task <> --test_tasks <> <args>

Where, train_task is in the form of <task_id>.<relation_length>, and test_tasks is a comma separated list of the same form. For eg:

python main.py --train_tasks 1.3 --test_tasks 1.3,1.4

You can provide general arguments as well, which are defined in the next section.

Task	Description
1	Basic family relations, free of noise
2	Family relations with supporting facts
3	Family relations with irrelevant facts
4	Family relations with disconnected facts
5	Family relations with all facts (2-4)
6	Family relations - Memory task: retrieve the relations already defined in the text
7	Family relations - Mix of Memory and Reasoning - 1 & 6

Generated data is stored in data/ folder. i

Generalizability

Each task mentioned above can be used for different length k of the relations. For example, Task 1 can have a train set of k=3 and test set of k=4,5,6, etc. See the above section in how to provide such arguments quickly.

AMT Paraphrasing

We collect paraphrases for relations k=1,2 and 3 from Amazon Mechanical Turk using ParlAI MTurk interface. The collected paraphrases can be re-used as templates to generate arbitrary large dataset in the above configurations. We will release the templates shortly here.

To use the templates, pass --use_mturk_template flag and location of the template using --template_file argument. The flag --template_length is optional and it governs the maximum length k to use to replace the sentences. The script auto-downloads our collected and cleaned template files from the server using setup() method in main.py.

Transductive and Inductive Setting

CLUTRR provides both transductive and inductive setting for relational reasoning. In the transductive setting, the relation patterns encountered in the training set is the same as in the test set. While this setup is not interesting, it can be used to perform basic sanity checks of the model. In the inductive setting, the relation patterns are split 80-20 in training and testing. Furthermore, with the ability to split AMT placeholders, CLUTRR provides 4 scenarios to play with using the correct flags:

Setup	Flags	Description
(1)	(default)	same pattern in train & test, same AMT placeholder = EASY as data leak
(2)	`--template_split`	same pattern in train & test, different AMT placeholder = Transductive, medium difficulty
(3)	`--holdout`	different pattern in train & test, same AMT placeholder = Inductive, but still could be easy for language models to exploit on the syntax
(4)	`--template_split --holdout`	different pattern in train & test, different AMT placeholder = Inductive, and hardest setup

Thanks to @NicolasAG for adding this information in the README.

Rules

We create an ideal simple kinship world, which is derived from a set of clauses or rules. The rules are defined in rules_store.yaml file.

Usage

To generate the simple setup on task 1, first cd into clutrr/clutrr folder, and run:

python main.py --train_tasks 1.2 --test_tasks 1.2 --train_rows 500 --test_rows 10 --equal --holdout --use_mturk_template --data_name "Robust Reasoning - clean - AMT" --unique_test_pattern

Pre-generated datasets used in our paper can be found here.

CLI Usage

usage: main.py [-h] [--max_levels MAX_LEVELS] [--min_child MIN_CHILD]
               [--max_child MAX_CHILD] [--p_marry P_MARRY] [--boundary]
               [--output OUTPUT] [--rules_store RULES_STORE]
               [--relations_store RELATIONS_STORE]
               [--attribute_store ATTRIBUTE_STORE] [--train_tasks TRAIN_TASKS]
               [--test_tasks TEST_TASKS] [--train_rows TRAIN_ROWS]
               [--test_rows TEST_ROWS] [--memory MEMORY]
               [--data_type DATA_TYPE] [--question QUESTION] [-v]
               [-t TEST_SPLIT] [--equal] [--analyze] [--mturk] [--holdout]
               [--data_name DATA_NAME] [--use_mturk_template]
               [--template_length TEMPLATE_LENGTH]
               [--template_file TEMPLATE_FILE] [--template_split]
               [--combination_length COMBINATION_LENGTH]
               [--output_dir OUTPUT_DIR] [--store_full_puzzles]
               [--unique_test_pattern]

optional arguments:
  -h, --help            show this help message and exit
  --max_levels MAX_LEVELS
                        max number of levels
  --min_child MIN_CHILD
                        max number of children per node
  --max_child MAX_CHILD
                        max number of children per node
  --p_marry P_MARRY     Probability of marriage among nodes
  --boundary            Boundary in entities
  --output OUTPUT       Prefix of the output file
  --rules_store RULES_STORE
                        Rules store
  --relations_store RELATIONS_STORE
                        Relations store
  --attribute_store ATTRIBUTE_STORE
                        Attributes store
  --train_tasks TRAIN_TASKS
                        Define which task to create dataset for, including the
                        relationship length, comma separated
  --test_tasks TEST_TASKS
                        Define which tasks including the relation lengths to
                        test for, comma separaated
  --train_rows TRAIN_ROWS
                        number of train rows
  --test_rows TEST_ROWS
                        number of test rows
  --memory MEMORY       Percentage of tasks which are just memory retrieval
  --data_type DATA_TYPE
                        train/test
  --question QUESTION   Question type. 0 -> relational, 1 -> yes/no
  -v, --verbose         print the paths
  -t TEST_SPLIT, --test_split TEST_SPLIT
                        Testing split
  --equal               Make sure each pattern is equal. Warning: Time
                        complexity of generation increases if this flag is
                        set.
  --analyze             Analyze generated files
  --mturk               prepare data for mturk
  --holdout             if true, then hold out unique patterns in the test set
  --data_name DATA_NAME
                        Dataset name
  --use_mturk_template  use the templating data for mturk
  --template_length TEMPLATE_LENGTH
                        Max Length of the template to substitute
  --template_file TEMPLATE_FILE
                        location of placeholders
  --template_split      Split on template level
  --combination_length COMBINATION_LENGTH
                        number of relations to combine together
  --output_dir OUTPUT_DIR
                        output_dir
  --store_full_puzzles  store the full puzzle data in puzzles.pkl file.
                        Warning: may take considerable amount of disk space!
  --unique_test_pattern
                        If true, have unique patterns generated in the first
                        gen, and then choose from it.

Citation

If our work is useful for your research, consider citing it using the following bibtex:

@article{sinha2019clutrr,
  Author = {Koustuv Sinha and Shagun Sodhani and Jin Dong and Joelle Pineau and William L. Hamilton},
  Title = {CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text},
  Year = {2019},
  journal = {Empirical Methods of Natural Language Processing (EMNLP)},
  arxiv = {1908.06177}
}

Papers using CLUTRR

Nicolas Gontier, Koustuv Sinha, Siva Reddy, Chris Pal, Measuring Systematic Generalization in Neural Proof Generation with Transformers (NeurIPS 2020) Paper Code & Data
Pasquale Minervini, Sebastian Riedel, Pontus Stenetorp, Edward Grefenstette, Tim Rocktäschel, Learning Reasoning Strategies in End-to-End Differentiable Proving (ICML 2020) Paper Code & Data

Join the CLUTRR community

Website: https://www.cs.mcgill.ca/~ksinha4/clutrr/
Using CLUTRR in your paper? Feel free to send me an email to include your paper in our README!

See the CONTRIBUTING file for how to help out.

License

CLUTRR is CC-BY-NC 4.0 (Attr Non-Commercial Inter.) licensed, as found in the LICENSE file.

clutrr's People

Contributors

Stargazers

Watchers

Forkers

andrewdmeier nicolasag wibruce sjoerdapp sailfish009 wesley12138 ashutosh-adhikari moonbzyx mmorris44 pminervini 00mjk kliang5 standardgalactic cnm13ryan

clutrr's Issues

Use Hydra for config management

As in GLC, use Hydra for config management in the revamp

yaml syntax

in the store currently there is yaml.load
this returns an error and should be changed to
yaml.safe_load

Erroneous rule?

Hi, thanks for your great work!

After generating data with

python main.py --train_tasks 1.2,1.3 --test_tasks 1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10 --train_rows 5000 --test_rows 500 --holdout

in 1.10_test.csv, I found the following story:

[Laura] has a daughter called [Penny]. The husband of [Penny] is [Robert]. [Craig] is a brother of [Robert]. [Robert] is the father of [William]. [Ruthann] is a sister of [Robert]. [Eugenia] is [Craig]'s daughter. [Alicia] is the aunt of [Eugenia]. [William] is a brother of [Gary]. [Robert] has a son called [Gary]. [Robert] is [Ruthann]'s brother.

Where the target is:

['[Laura] has a daughter called [Alicia]. ']

After convincing myself that this cannot hold, I looked into the proof state also provided in the csv:

[{('Laura', 'daughter', 'Alicia'): [('Laura', 'son', 'Craig'), ('Craig', 'sister', 'Alicia')]},
{('Laura', 'son', 'Craig'): [('Laura', 'son', 'Robert'), ('Robert', 'brother', 'Craig')]},
{('Laura', 'son', 'Robert'): [('Laura', 'daughter', 'Ruthann'), ('Ruthann', 'brother', 'Robert')]},
{('Laura', 'daughter', 'Ruthann'): [('Laura', 'daughter', 'Penny'), ('Penny', 'sister', 'Ruthann')]},
{('Penny', 'sister', 'Ruthann'): [('Penny', 'son', 'Gary'), ('Gary', 'aunt', 'Ruthann')]},
{('Penny', 'son', 'Gary'): [('Penny', 'husband', 'Robert'), ('Robert', 'son', 'Gary')]},
{('Gary', 'aunt', 'Ruthann'): [('Gary', 'father', 'Robert'), ('Robert', 'sister', 'Ruthann')]},
{('Craig', 'sister', 'Alicia'): [('Craig', 'daughter', 'Eugenia'), ('Eugenia', 'aunt', 'Alicia')]},
{('Gary', 'father', 'Robert'): [('Gary', 'brother', 'William'), ('William', 'father', 'Robert')]}]

The spicy part here is in line 5 "Penny sister Ruthann", if this would be true then Penny would be sister and wife of Robert.
However, it is proven with [('Penny', 'son', 'Gary'), ('Gary', 'aunt', 'Ruthann')].

It seems this arises from one of the rules in rules-store.yaml:

child:inv-un:sibling

If I understand correctly this says if A has child B and B has aunt/uncle C then A and C are siblings.
Note, however, this does not hold in general as C could be sibling of the wife/husband of A but not of A.
In our example, exactly this is the case as Ruthann is the sister of Robert but not of Penny.

Sorry if I have a misunderstanding until here, have I overlooked something?

Would it be enough to simply delete this rule from rules-store.yaml and re-generate the data?

Thanks a lot

requests.ConnectionError during generating the data

Robust reasoning dataset for cycle noise (supporting facts) doesn't have edge types for the cycle's edges

Hello!

Thanks for making the CLUTRR dataset available. I have been using it to benchmark compositional reasoning in ML models. I think it is a useful benchmark and have come across multiple instances of it being used in recent papers that present models that tackle reasoning type problems in NLP.

Now, coming to the issue:

I was using the dataset from your EMNLP paper
provided here to test out some graph models. It seems that there is no edge information for task 2.k (where the noise is the addition of nodes that correspond to adding cycles to the original chain in the story graph). For the other types of noise information (3.k, 4.k) it is easy to just random sample edge types since the noise additions are independent/terminal and don't feed back into the same logic graph. But that's not possible for 2.k type tasks.

For example, for the following story:

'[Mary] and her mother [Nettie] went to the mall to try on new clothes. [Mary] has a daughter named [Jennifer] [Cecilia] took her sister, [Mary], out to dinner for her birthday. [Cecilia] bought her mother, [Nettie], a puppy for her birthday. [Ryan] bought a new dress for his daughter [Jennifer].'

whose corresponding edge representation is:

[(0, 1), (1, 2), (2, 3), (2, 4), (4, 3)]

The edge types for only the first three nodes are provided:

['daughter', 'mother', 'mother']

whereas presumably edge (2,4) should have the edge type 'sister' and (4,3) should have an edge type 'mother' for the noise node Cecilia. Looking through the robust reasoning dataset, there is no info on the edge types of noisy nodes.

Can you please provide the corresponding datasets

data_7c5b0e70
data_06b8f2a1
data_523348e6
`data_d83ecc3e
with the noisy edge types?

If not, can you please help me understand how GAT results were obtained in Table 2 of your paper since the graph formulation of the task requires the adjacency matrix with edge type entries right?

Only the first dataset seems important as far the paper is concerned so the rest are not super important. I believe (please correct me if I'm wrong) that the first one is used to report results for GAT in table 2 in the paper since that is the only one where k=2,3 as reported in section 4.2 of the paper.

Thanks!

--test_rows not always accurate

Hello,
I noticed something while generating clultrr data:
This is the exact command I ran: python main.py --train_tasks 4.2,4.3,4.4 --test_tasks 4.2,4.3,4.4,4.5,4.6,4.7,4.8,4.9,4.10 --train_rows 100000 --test_rows 10000 --equal --data_name 'r3-disco_l234' and these are my number of lines in the csv test files:

$ wc -l data/data_r3-disco_l234_*/*_test.csv

10011 data/data_r3-disco_l234_1571563154.7491844/4.10_test.csv  ---> 10k : ok
 3100 data/data_r3-disco_l234_1571563154.7491844/4.2_test.csv   ---> much less than 10k... 
 2941 data/data_r3-disco_l234_1571563154.7491844/4.3_test.csv   ---> much less than 10k... 
 3007 data/data_r3-disco_l234_1571563154.7491844/4.4_test.csv   ---> much less than 10k... 
10025 data/data_r3-disco_l234_1571563154.7491844/4.5_test.csv  ---> 10k : ok
10014 data/data_r3-disco_l234_1571563154.7491844/4.6_test.csv  ---> 10k : ok
10038 data/data_r3-disco_l234_1571563154.7491844/4.7_test.csv  ---> 10k : ok
10044 data/data_r3-disco_l234_1571563154.7491844/4.8_test.csv  ---> 10k : ok
10023 data/data_r3-disco_l234_1571563154.7491844/4.9_test.csv  ---> 10k : ok

I trained on tasks 4.2,4.3,4.4 and it seems like when generating the test sets, it combined all these together as one task: note that 3100+2941+3007~10k

--train_tasks 1.# always generates few stories from task 1.3

Hi,

I noticed that when running python main.py with --train_tasks 1.2,1.4,1.6 it also generated a few stories from task 1.3.

This is the detailed command I run to reproduce the bug: python main.py --train_tasks 1.2,1.4,1.6 --test_tasks 1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10 --train_rows 100000 --test_rows 10000 --equal --template_split
which generated the following training lines:

96,134 lines for relations of length 2
96,326 lines for relations of length 4
94,374 lines for relations of length 6
7,231 lines for relations of length 3
These numbers can be found with the following command: cat data/.../1.2,1.4,1.6_train.csv | grep task_1.3 | wc -l

A similar behavior is observed with --train_tasks 1.2,1.4,1.8.

I didn't try with other tasks than 1.#

Thanks :)

AMT templates issues

This is a list of issues on the AMT templates:

Known issues:

Some templates are not in the correct relation/gender category.
Some templates have the wrong pronouns, eg: ENT_0_female with he/him/his.
Some templates have Worker: as prefix.
Some templates are not complete sentences.

template	train or test file	listed as	should be	corrected
ENT_1_female and ENT_0_male .	train	`wife/male-female`	removed?	⛔
ENT_1_female and ENT_0_male .	train	`granddaughter/male-female`	removed?	⛔
ENT_1_male and his son ENT_0_male went to look at cars. ENT_1_male ended up buying the Mustang.	test	`son/male-male`	`father/male-male`	⛔
ENT_1_female is a girl with a grandmother named ENT_0_female .	train	`grandmother/female-female`	`granddaughter/female-female`	⛔
ENT_1_female went to visit her grandmother , ENT_0_female , in the retirement home .	train	`grandmother/female-female`	`granddaughter/female-female`	⛔
ENT_0_female was sick. He stayed home from school and his grandmother, ENT_1_female, watched him. She made him chicken soup to feel better.	train	`grandmother/female-female`	wrong pronoun. replace `he`/`his`/`him` by `she`/`her`/`her` in template	⛔
ENT_1_female took her grandson ENT_0_female to the zoo. He loved feeding the monkeys.	train	`grandmother/female-female`	replace `grandson`/`he` by `granddaughter`/`she` in template	⛔
ENT_0_female went to visit his grandmother, ENT_1_female, at the nursing home. She was grateful for the company, she had n't had a family visit in months.	train	`grandmother/female-female`	wrong pronoun. replace `he`/`his`/`him` by `she`/`her`/`her` in template	⛔
ENT_0_female stayed with his grandmother ENT_1_female last summer on her farm. He had a great time.	train	`grandmother/female-female`	wrong pronoun. replace `he`/`his`/`him` by `she`/`her`/`her` in template	⛔
Worker: ENT_0_female looks just like her grandmother, ENT_1_female did as a child.	train	`grandmother/female-female`	remove `Worker:` from template	⛔

data quality issue?

hi i found data_06b8f2a1 1.3_test.csv appears to include lots of wrong annotations. for example, {"Unød: 0": 2, "id": "fe81eae5-c860-417f-8272-fbea0585d016", "story": "[Kathleen] was excited because she was meeting her father, [Henry], for lunch. [Howard] and his son [Wayne] went to look at cars. [Howard] ended up buying the Mustang. [Howard] likes to spend time with his aunt, [Kathleen], who was excellent at cooking chicken.", "query": ["Wayne", "Henry"], "text_query": NaN, "target": "father", "text_target": ["[Henry] was so proud of his son, [Wayne]. he received a great scholarship to college."], "clean_story": "[Howard] and his son [Wayne] went to look at cars. [Howard] ended up buying the Mustang. [Howard] likes to spend time with his aunt, [Kathleen], who was excellent at cooking chicken. [Kathleen] was excited because she was meeting her father, [Henry], for lunch.", "proof_state": [{"('Wayne', 'father', 'Henry')": [["Wayne", "sister", "Kathleen"], ["Kathleen", "father", "Henry"]]}, {"('Wayne', 'sister', 'Kathleen')": [["Wayne", "son", "Howard"], ["Howard", "aunt", "Kathleen"]]}], "f_comb": "son-aunt-father", "task_name": "task_1.3", "story_edges": [[0, 1], [1, 2], [2, 3]], "edge_types": ["son", "aunt", "father"], "query_edge": [0, 3], "genders": "Wayne:male,Howard:male,Kathleen:female,Henry:male", "syn_storøy": NaN, "node_mapping": {"16": 0, "17": 1, "3": 2, "0": 3}, "task_split": "test"}
Henry is definitely NOT Wayne's father?!
In general, lots of mother/father-in-laws were incorrectly annotated as father/mother.
Did I miss anything here?

Data is not available from the website given

The page https://cs.mcgill.ca/~ksinha4/data/ referenced here https://github.com/facebookresearch/clutrr/blob/main/clutrr/main.py#L39 is not available anymore. Would it be possible to update it?

[META] Revamp - move graph generation logic to GLC

Confession time! I have not been able to keep up with the issues in this repository owing to my own commitments to other projects, covid isolation, among other things. However, I don't want to bore you with excuses anymore! In the last cycle, we developed GraphLog, which inherently follows the same graph generation pipeline and is under use by our lab for several projects. Gaining insights from GraphLog generation, and the follow-up works, I have been able to improve the core graph generation logic to be faster, provable, and reproducible.

I have released the core logic in a separate repo, GLC, to continue its development separately. In the coming week, I plan to integrate GLC with CLUTRR, which would hopefully resolve several issues I have received both through Github and mail about the slow and unreliable generation pipeline. Thank you for your patience and please let me know any features you want through the issues!

Task 6 -- Family relations - Memory task: retrieve the relations already defined in the text

Hi, I think there is something wrong with generating datasets for "Task 6 -- Family relations - Memory task: retrieve the relations already defined in the text" -- If I do e.g.:

$ PYTHONPATH=. python3 main.py --train_tasks 6.2,6.3 --test_tasks 6.2,6.3

I get instances such as this one:

$ tail -n 1 ~/workspace/clutrr/data/data_08aa323e/6.2_test.csv
84,f40ac862-0ba6-4f70-8c3a-9e060edd8bed,[Calvin] is [Henry]'s grandfather.  [Travis] has a son called [Henry]. ,"('Travis', 'Calvin')",Who is [Calvin] from the point of relation of [Travis] ? ,father,['[Calvin] is the father of [Travis]. '],[Travis] has a son called [Henry].  [Calvin] is [Henry]'s grandfather. ,"[{('Travis', 'father', 'Calvin'): [('Travis', 'son', 'Henry'), ('Henry', 'grandfather', 'Calvin')]}]",son-grandfather,task_6.2,"[(0, 1), (1, 2)]","['son', 'grandfather']","(0, 2)","Travis:male,Henry:male,Calvin:male",,"{2: 0, 10: 1, 0: 2}",test

which does not reduce in retrieving relations already defined in the text, but requires some sort of reasoning.

Am I doing anything wrong here?

Releasing v1.3 data

Hi authors, thanks for creating this great dataset!
Would it be possible to share the "GPT3 cleaned data: CLUTRR v1.3" as mentioned in this blog post?
This will save a lot of time for users to generate the data themselves and enable fair comparison of different methods on the same data. Thanks!

amt_placeholders_clean.train.json missing

Nevermind, solved! :)

Issues in the AMT templates and how to mitigate them

Dear authors, @koustuvsinha @pminervini @shagunsodhani

Thanks for the great work!

I downloaded the dataset from the provided link https://drive.google.com/file/d/1SEq_e1IVCDDzsBIBhoUQ5pOVH5kxRoZF/view and found a few mistakes in the test dataset. Below are 4 mistakes that I found from the first 10 data instances in file data_06b8f2a1/1.3_test.csv in the dataset. It seems to me that a big portion of the data may not be correct.

Index	Story	Query	Target	Comment
2	[Kathleen] was excited because she was meeting her father, [Henry], for lunch. [Howard] and his son [Wayne] went to look at cars. [Howard] ended up buying the Mustang. [Howard] likes to spend time with his aunt, [Kathleen], who was excellent at cooking chicken.	('Wayne', 'Henry')	father	The target should be greatgrandfather.
5	[Johanna] spent a great day shopping with her daughter, [Vickie]. [Vickie] wanted to visit her grandmother [Donna], but [Donna] was asleep. [Johanna] and [Philip] left that evening to go bowling.	("Philip","Donna")	mother	We cannot tell any relationship for Philip.
6	[Johanna] enjoyed a homemade dinner with her son [Cedric] [Wayne] and his son, [Cedric], went over to [Donna]'s house for the holidays. [Wayne] loved seeing his mother, but [Cedric] was less enthusiastic.	("Johanna","Donna")	mother	The target should be mother_in_law.
9	[Devin] and his Aunt [Kathleen] flew first class [Devin] has a few children, [Philip], Bradley and Claire [Kathleen] vowed to never trust her father, [Henry] with her debit card again.	("Philip","Henry")	father	The target should be greatgrandfather.

Since other users already submitted issues to report errors in the dataset a year ago, is there any update to the dataset (e.g., a cleaner version with fewer mistakes)?
Thanks a lot!

Make generator robust to large samples

Generation script right now crashes due to OOM when trying to generate 100k stories. Need to make it more resource efficient.