This library contains the generator and datasets for the KANDY benchmark.
Datasets are available in the release section.
Baseline experiments described in the paper are available in the baseline
subfolder.
Humans can learn effortlessly new concepts from limited examples, on the other hand neural networks struggle when presented with limited examples and when they are required to compose previously acquired knowledge into new concepts. We argue that one of the reasons of these limitations is the radically different learning regimes to which humans and neural networks are subject to.
Human education is heavily reliant on teachers "scheduling" concept difficulties in a way which elicits correct chains of thoughts and associations, and capable of presenting "old concepts" whenever they are deemed useful for the current task. Moreover, as the student gets more confident with its answers, a teacher gradually reduces supervision, promoting a more independent decision making process.
This code presents samples from a sequence of tasks in a non i.i.d. fashion, by emulating a teacher's behavior on a sequence of binary classification tasks. Each task is identified by two sets of hierarchical symbolic rules defined on a simple geometric domain and the student is provided with an image, with the objective of determining whether the input belongs to the positive or negative set. The teacher presents samples from each task sequentially and, inside each task, it can decide to provide samples either with or without supervision, according to a decayed schedule.
The schedule is governed by an exponential distribution providing maximum supervision at the beginning of each task and minimum at the end. This schedule emulates, for example, the behavior of a math teacher presenting topics from counting to differential equations, sorted in the way eliciting the best learning response.
The code provides a way to define curricula in YAML format and a generator which can be queried both for traditional datasets (for i.i.d. training or testing) and tutored samples generated one at the time. Optionally, noise can be injected in task identifiers to verify whether the learner relies onto them or not.
We release two benchmarks in the form of curricula of tasks,
that showcase the features available to the end-user. We name the two curricula Easy
and Hard, due to the different complexity of the involved tasks. In both curricula, each
sample is a {shape: SH, color: CO, size: SZ}
, and they represent
atomic objects. No list expansion operator appears in symbolic annotations. Atomic
objects can be small or large (
Each task image contains base concepts (configurable, in the KANDY benchmark they are: triangles, squares and circles, in 6 different colors and 2 different sizes), which can be composed spatially by the following recursive operators:
in
: subconcepts are drawn one inside the otherquadrant_ul
: subconcepts are drawn one inside the other in the upper left quadrantquadrant_ur
: subconcepts are drawn one inside the other in the upper right quadrantquadrant_ll
: subconcepts are drawn one inside the other in the lower left quadrantquadrant_lr
: subconcepts are drawn one inside the other in the lower right quadrantrandom
: subconcepts are drawn at random positionsstack
: subconcepts are drawn top-to-bottom, the horizontal dimension of the bounding box is the same as the parentstack_reduce_bb
: subconcepts are drawn top-to-bottom, the horizontal dimension of the bounding box is reduced to match the vertical oneside_by_side
: subconcepts are drawn left-to-right, the vertical dimension of the bounding box is the same as the parentside_by_side_reduce_bb
: subconcepts are drawn left-to-right, the vertical dimension of the bounding box is reduced to match the horizontal onediag_ul_lr
: subconcepts are drawn diagonally, from the upper left corner to the lower rightdiag_ll_ur
: subconcepts are drawn diagonally, from the lower left corner to the upper rightgrid
: subconcepts are arranged in a square grid, left-to-right, top-to-bottom
Spatial compositions can be arbitrarily complex, however, as the available space for each level decreases exponentially, there is a practical limit after which images get overcrowded. This can be counteracted only in part by using larger canvas sizes.
Atomic concepts can be specified by their properties, e.g.: {"shape": triangle, "color": red, size: small}
The following shorthand notations are available to avoid specifying combinations of properties exhaustively:
-
~
: any value, e.g."size": ~
is either of small or large -
not_X
: anything except X, e.g.shape: not_square
is either a triangle or a circle -
A|B|...|K
: disjunction, e.g.color: red|green|blue
is any of red, green or blue. -
Set operators
union(L), intersection(L), difference(L), symmetric_difference(L)
: they are replaced by the associative unfolding of the list with respect to the operator (the list must contain only atomic objects). They can be used to constrain grounding (especially when used in conjunction withstore_before
andrecall
operators).
Top level concepts are implicitly disjunctive, allowing for complex tasks such as discriminating between "three red circles arranged diagonally, or three blue squares arranged vertically" (positive set) versus "three blue squares arranged diagonally, or three red circles arranged vertically" (negative set).
A positive/negative set is a list of top level objects. As syntactic sugar, to simplify specification of tasks, the following list manipulation operators are defined:
sample(n, L)
: randomly sample n elements from list L with repetitionpick(n, L)
: randomly sample n elements from list L without repetitionfirst(n, L)
: return first n elements from list Llast(n, L)
: return last n elements from list Lshift(n, L)
: barrel shift list L by n to the right (left if n < 0), eg.shift(2, [triangle, square, circle, circle, square]) -> [circle, square, triangle, square, circle]
repeat(n, L)
: repeat list L n timesrandom_repeat(min, max, L)
: select a random integer between min and max (included) and repeat list L that amount of timespermute(L)
: return a random permutation of list Lmirror(L)
: append a mirrored list to L, eg.mirror([triangle, square, circle]) -> [triangle, square, circle, circle, square, triangle]
palindrome(L)
: append a mirrored list to L excluding the last element which acts as pivot, eg.palindrome([triangle, square, circle]) -> [triangle, square, circle, square, triangle]
sort(order, keys, L)
: sort list L inasc
ending/desc
ending order, according to keysargsort(idx, L)
: manually sorts L according to indices (in 0..len(L) - 1)store(alias, obj)
: memorizes obj (any between a list, an atomic or compound object) for later retrievalrecall(alias)
: it is replaced by the previously memorized object associated with that aliasany_composition(L)
: it is replaced by a random compositional operatorany_displacement(L)
: it is replaced by a random non-overlapping compositional operator (stack, side_by_side, diag_ul_lr, diag_ll_ur, grid
)any_line(L)
: it is replaced by a random line compositional operator (stack, side_by_side, diag_ul_lr, diag_ll_ur
)any_quadrant(L)
: it is replaced by a random quadrant compositional operator (quadrant_ul, quadrant_ur, quadrant_ll, quadrant_lr
)quadrant_or_center(L)
: it is replaced by a random in-like compositional operator (in, quadrant_ul, quadrant_ur, quadrant_ll, quadrant_lr
)
Sorting keys are:
n
: element cardinality (1 for atomic elements, len(L) for compound elements)shape
: number of sides circle (0) < triangle (3) < square (4)color
: color, in the order defined in config.ymlsize
: size, in the order defined in config.yml, which may not reflect pixel size If the list contains one or more compound objects, the only valid sorting key isn
, otherwise keys are evaluated in order (e.g.,[color, size]
first sorts by color and within objects of the same color it sorts by size).
These operators are always applied AFTER grounding atomic objects, therefore repeat(n: 3, list: [{shape: not_square, color: red, size: ~])
will be resolved in the following way:
- Randomly ground attributes in atomic shapes, e.g.
{shape: circle, color: red, size: small}
- Apply the expansion operators:
[{shape: circle, color: red, size: small}, {shape: circle, color: red, size: small}, {shape: circle, color: red, size: small}]
. In this example, the repetition produces three identical objects.
In case this behavior is not desired, the following variants of list operators perform expansion BEFORE grounding:
sample_before(n, L)
pick_before(n, L)
first_before(n, L)
last_before(n, L)
shift_before(n, L)
repeat_before(n, L)
random_repeat_before(min, max, L)
permute_before(L)
mirror_before(L)
palindrome_before(L)
store_before(alias, L)
. Other operators are either equivariant with respect to grounding or are semantically ill-defined (e.g. sorting a list before knowing how it will be grounded) and they do not possess a_before
variant.
An example which uses both before- and after-grounding list expansions is the following:
side_by_side:
- palindrome:
- {shape: ~, color: ~, size: ~}
- stack:
- repeat_before:
n: 3
list:
- {shape: ~, color: ~, size: ~}
This structure is resolved in the following way:
- Apply the before-grounding expansions:
side_by_side:
- palindrome:
- {shape: ~, color: ~, size: ~}
- stack:
- {shape: ~, color: ~, size: ~}
- {shape: ~, color: ~, size: ~}
- {shape: ~, color: ~, size: ~}
- Perform random grounding:
side_by_side:
- palindrome:
- {shape: triangle, color: red, size: large}
- stack:
- {shape: square, color: yellow, size: small}
- {shape: triangle, color: green, size: large}
- {shape: circle, color: cyan, size: large}
- Apply the after-grounding expansions:
side_by_side:
- {shape: triangle, color: red, size: large}
- stack:
- {shape: square, color: yellow, size: small}
- {shape: triangle, color: green, size: large}
- {shape: circle, color: cyan, size: large}
- {shape: triangle, color: red, size: large}
Different variants of the same structure, like:
side_by_side:
- palindrome_before:
- {shape: ~, color: ~, size: ~}
- stack:
- repeat:
n: 3
list:
- {shape: ~, color: ~, size: ~}
would have yielded:
side_by_side:
- {shape: triangle, color: red, size: large}
- stack:
- {shape: square, color: yellow, size: small}
- {shape: square, color: yellow, size: small}
- {shape: square, color: yellow, size: small}
- {shape: square, color: cyan, size: large}
Caveat: Set operators are always applied BEFORE grounding. If they contain recall
expansions, these are evaluated before grounding as well, this means they can ONLY recall aliases defined by a store_before
operator.
Recall expansions outside set operators do not possess this limitation.
List manipulation operators and compositional operators can be arbitrarily composed in the YAML specification file. Additionally, YAML aliases (& *
) and overrides (<<
) are supported for shorthand definitions.
To maximize experimental robustness, three mechanisms are enforced:
- A random seed can be specified in the configuration file for the teacher. Each task will have a unique random generator with a seed computed from the teacher's and the task id.
- Rejection sampling is used to produce
samples
unique samples for each task. A patience hyperparameter determines the number of failed trials before giving up. If patience is not exhausted, samples are guaranteed to be unique and will be presented without repetition, otherwise, sampling with repetition will be performed. - Optionally, a (set of) Prolog rule(s) can be defined for each task in the form
valid(X) :- ...
, an interpreter will load facts and rules from a background knowledge file (specified inconfig.yml
) and check every generated sample. If a positive sample does not satisfy the rule(s), or a negative one satisfies it, the sample is rejected. Multiple task rules can be specified using YAML multiline strings.
The following is an example curriculum composed of two tasks:
-
name: my_task
gamma: 0.9 # maximum supervision
beta: 0.5 # minimum supervision
samples: 5000 # number of samples
train_split: 0.8 # training set split ratio. The training set has samples * train_split elements
val_split: 0.1 # validation set split ratio. The test set is computed as (1.0 - train_split - val_split)
noisy_color: True # Inject random noise to colors (while preserving semantic classes)?
noisy_size: False # Inject random noise to sizes?
positive_set: # Definition of the positive set for the task. It contains a list of possible samples.
- stack: # Compositional operator: receives a list of subelements.
- palindrome:
- {shape: triangle|circle, color: ~, size: not_small} # Atomic object. ~/null indicates any value. a|b indicates a choice. not_a indicates a negation.
- {shape: ~, color: red, size: large}
#- {shape: square, color: ~, size: ~}
- side_by_side:
- sort:
order: asc
keys: [shape, color]
list:
- {shape: ~, color: red, size: small}
- {shape: ~, color: green, size: small}
- {shape: ~, color: blue, size: small}
- {shape: circle, color: blue, size: large}
- {shape: square, color: magenta, size: small}
negative_set:
- stack:
- &my_green_object {shape: ~, color: green, size: ~} # Definition of an alias for an object used multiple times.
- &my_red_object {shape: ~, color: red, size: ~}
- &my_triangle {shape: triangle, color: ~, size: ~}
- &my_small {shape: ~, color: ~, size: small}
- grid:
- random_repeat:
min: 2
max: 5
list:
- sample:
n: 3
list:
- *my_green_object
- *my_red_object
- mirror:
- *my_green_object
- {shape: square, color: not_magenta, size: large}
- *my_red_object
- *my_triangle
- *my_small
-
name: yet another task # spaces are fully supported
gamma: 0.0 # This task is fully unsupervised
beta: 0.0
samples: 3
train_split: 0.5
val_split: 0.2
noisy_color: False
noisy_size: False
# This task has two (disjunctive) prolog rules for rejection sampling (and a predicate they depend on). It is assumed that predicates color/2 and contains/2 are defined in a .pl file linked in config.yml, otherwise the entire knowledge base could be specified here...
positive_rule: >-
is_green(X) :- color(X, green).
valid(X) :- is_green(X).
valid(X) :- contains(X, Y), is_green(Y).
positive_set:
- *my_green_object
negative_set:
- *my_red_object
All the tasks are assumed to be binary decision tasks determining whether an image belongs to a positive or negative set. Important: if there are enough unique samples, distributions are balanced (0.5 positives, 0.5 negatives), otherwise, the ratio between positive and negative samples will reflect the internal distribution of symbols (i.e., a task specified with positive set consisting of 7 elements and a negative set consisting of 3 elements will approach a 0.7/0.3 positives-to-negatives ratio).
Task description | Positive example | Negative example |
---|---|---|
The image contains a triangle | ||
Objects are arranged vertically | ||
Shapes are all the same | ||
Shapes form a palindrome |
yaml
logging
numpy
colorsys
pillow
pyswip >= 0.2.11
Note: The datasets are self-contained, and they require no library (see below).
- Write a YAML configuration (see the example
config.yml
). - Write a YAML curriculum (see
easy.yml
andhard.yml
). - Create a teacher:
cg = CurriculumGenerator("config.yml", "curriculum.yml", logger=logger) # Logger is an optional instance of logging for debug purposes.
- Use the teacher:
- As task-incremental supervised dataset:
cg.reset() for i in range(len(cg.tasks)): for j in range(num_batches): for batch in cg.get_batch(i, batch_size, split): # batch is a list of tuples (sample_img, label, task_id, symbol) # label is 1 for positive and 0 for negative samples # split is "train"/"val"/"test". use(batch)
- As task-incremental stream with decayed supervisions:
cg.reset() for i in range(len(cg.tasks)): for j in range(num_batches): for sample in cg.get_stream(i, split): (sample_img, label, supervised, task_id, symbol) = sample # label is 1 for positive, 0 for negative samples. # supervised is 1 for supervised samples, 0 for unsupervised ones. # split is "train"/"val"/"test". masked_label = label if supervised else None use(sample_img, masked_label, task_id, symbol)
- As tutor generator:
for batch in cg.generate_curriculum(split, task_id_noise, batch_size): # task_id_noise is the probability of corrupting task_id labels # batch is a list of (sample_img, label, supervised, task_id, symbol) samples use(batch)
- As i.i.d. shuffled tutor generator:
Note: Discardfor batch in cg.generate_shuffled_curriculum(split, task_id_noise, batch_size): # task_id_noise is the probability of corrupting task_id labels # batch is a list of (sample_img, label, task_id, symbol) samples # Compared to generate_curriculum, task_ids are not sequential. use(batch)
task_id
for continual learning applications where it is unrealistic to know task progression. - As task-incremental supervised dataset:
The ready-to-use datasets described in the KANDY paper are available for download in the releases. They do not require this library, as they consist of .png
images with annotations in separate .csv
files.
Annotations include:
filename
: the path (relative to the csv file) of the imagetask_id
: id of the task within the curriculum (noise injection is disabled for released datasets)label
: 1 if the sample belongs to the positive set, 0 otherwisesupervised
: 1 if the label should be available to the learner, 0 if the label should only be used for evaluation of unsupervised learningsymbol
: a string describing the underlying structure of the sample. It should NOT be provided as supervision to the learner.
Additionally for reproducibility, dataset specifications (files specs.txt
) and the config/curriculum files (.yml
and .pl
) are provided as well.
- Load annotation file from
samples/sets/{split}/annotations.csv
- Select task of interest by id (or perform multi-task learning)
- Compute a "global" loss function by multiplexing a supervised loss (when
supervised == 1
) and an unsupervised loss (whensupervised == 0
). A straight-forward differentiable solution could be$supervised \cdot supervised_loss(sample, label) + (1 - supervised) \cdot unsupervised_loss(sample)$ , but unless batch size is 1, more sophisticated approaches are required. - Evaluate predictions using the true label also when
supervised
was 0 - Exploit
symbol
for method-specific explanations (e.g., attention masks matching the hierarchy?)
-
- If you want to observe tasks in order, load annotation file from
samples/sets/{split}/annotations.csv
- Otherwise, load annotation file from
samples/shuffled_curriculum/{split}/annotations.csv
- If you want to observe tasks in order, load annotation file from
- Compute a "global" loss function by multiplexing a supervised loss (when
supervised == 1
) and an unsupervised loss (whensupervised == 0
). A straight-forward differentiable solution could be$supervised \cdot supervised_loss(sample, label) + (1 - supervised) \cdot unsupervised_loss(sample)$ , but unless batch size is 1, more sophisticated approaches are required. - Apply a catastrophic forgetting mitigation strategy (you are allowed to memorize samples in a replay buffer, but not allowed to go over the stream multiple times)
- Evaluate predictions using the true label also when
supervised
was 0 - Exploit
symbol
for method-specific explanations (e.g., attention masks matching the hierarchy?)
Note: Samples in the shuffled curriculum folder are contained in a pseudo-task folder 0
.
- Load annotation file from
samples/sets/{split}/annotations.csv
. Images are not required. - Write your own encoding procedure of
symbol
strings - Write your own background knowledge (
bg.pl
in the main repository contains a minimal set of predicates which can be adapted to a different encoding), avoiding "cheat predicates" (i.e., do not copy the rules used at rejection in the.yml
file) - If your ILP framework does not support predicate invention, or if it is too expensive, feel free to add helper predicates to your background knowledge
- Manually evaluate induced predicates against rules in the
.yml
file (multiple correct predicates are possible and different encodings can produce radically different results).
Given the following symbol:
stack([
diag_ul_lr([
{"shape": "triangle", "color": "red", "size": "small"},
{"shape": "square", "color": "red", "size": "small"},
{"shape": "triangle", "color": "red", "size": "small"}
],
{"shape": "circle", "color": "blue", "size": "large}
])
There can be multiple encodings, some examples include:
- "natural" encoding: the complex hierarchy is a compound term and atomic shapes are atoms. Each operator is a functor applied to a list. The symbol itself can be used as ILP example.
positive_example(stack([diag_ul_lr([tr_red_sm, sq_red_sm, tr_red_sm]), ci_blu_lg])).
- "pointer" encoding: each level of the hierarchy is an atomic term, a predicate is used to link compositions in the background knowledge. The ILP example needs a separate specification.
% In the background knowledge: defined_as(comp_0001, diag_ul_lr, [tr_red_sm, sq_red_sm, tr_red_sm]). defined_as(comp_0002, stack, [comp_0001, ci_blu_lg]). % In the examples file: positive_example(comp_0002).
- "cartesian" encoding: each object is encoded by its shape and its position (e.g.,
is(obj0001, triangle, red, small), at_xy(obj0001, 30, 42
). We expect induction with this encoding to be extremely difficult as no compositional bias can be enforced at the representation level.
How to use the benchmark under neuro-symbolic frameworks heavily depends on their constraints and capabilities. The main goals of this benchmark concerning neuro-symbolic settings are:
- Evaluate representation learning of arbitrary (recursive) structures
- Evaluate reasoning in compositional domains
- Encourage continual/curriculum techniques in the community, for the investigation of the stability-plasticity trade-off also in reasoning settings
- Encourage investigation of the effects of label scarcity in complex domains.
In general, avoid supervision at the structure level (i.e., the symbol
annotation CANNOT be used as supervision on the neural component), you should aim for an end-to-end supervision (only label
annotation during training).
If unsupervised learning is not available for your framework, either use a supervised == 0
.
Ideally, let the framework discover compositions with its neural component (i.e., avoid manually defining a feature extractor for vertically stacked objects, one for horizontally placed, etc.), as representation learning of arbitrarily complex hierarchies is a core feature of the benchmark.
We do not provide pretraining datasets, you are free to define them, as long as they still allow for fair evaluation of your method (i.e., they can be used to reduce training time, or the search space in reasoning, but they cannot be used to allow some form of memorization of tasks in the dataset).
You are encouraged to define additional tasks and/or modify the generator, for example to assess generalization of reasoning/representations to new tasks, higher-order capabilities, reasoning under noisy perception, etc.