The MNIST addition dataset is a modified version of the MNIST dataset where each image is a concatenation of two MNIST images and the target label is the sum of the two digits.
Synthetic datasets for concept-based learning experiments.
This class provides several toy datasets with known ground-truth concept
relationships and causal structures. Each dataset includes input features,
binary concepts, tasks, and a directed acyclic graph (DAG) representing
concept-to-task relationships.
xor: Simple XOR dataset with 2 input features, 2 concepts (C1, C2), and
1 task (xor). The task is the XOR of the two concepts.
trigonometry: Dataset with 7 trigonometric input features derived from
3 hidden variables, 3 concepts (C1, C2, C3) representing the signs of the
hidden variables, and 1 task (sumGreaterThan1).
dot: Dataset with 4 input features, 2 concepts based on dot products
(dotV1V2GreaterThan0, dotV3V4GreaterThan0), and 1 task (dotV1V3GreaterThan0).
checkmark: Dataset with 4 input features and 4 concepts (A, B, C, D),
where C = NOT B and D = A AND C, demonstrating causal relationships.
param dataset:
Name of the toy dataset to load. Must be one of: ‘xor’, ‘trigonometry’,
‘dot’, or ‘checkmark’.
type dataset:
str
param root:
Root directory to store/load the dataset files. If None, defaults to
‘./data/toy_datasets/{dataset_name}’. Default: None
type root:
str, optional
param seed:
Random seed for reproducible data generation. Default: 42
type seed:
int, optional
param n_gen:
Number of samples to generate. Default: 10000
type n_gen:
int, optional
param concept_subset:
Subset of concept names to use. If provided, only the specified concepts
will be included in the dataset. Default: None (use all concepts)
Synthetic dataset for concept bottleneck completeness experiments.
This dataset generates synthetic data to study complete vs. incomplete concept
bottlenecks. Data is generated using randomly initialized multi-layer perceptrons
with ReLU activations. Input features are sampled from a multivariate normal
distribution, and concepts are derived through nonlinear transformations.
Hidden concepts can be included to simulate incomplete bottlenecks.
The dataset uses a two-stage generation process:
1. Map inputs X to concepts C (both observed and hidden) via nonlinear function g
2. Map concepts C to tasks Y via nonlinear function f
Parameters:
name (str) – Name identifier for the dataset (used for file storage).
root (str, optional) – Root directory to store/load the dataset files. If None, defaults to
‘./data/completeness_datasets/{name}’. Default: None
seed (int, optional) – Random seed for reproducible data generation. Default: 42
n_gen (int, optional) – Number of samples to generate. Default: 10000
p (int, optional) – Dimensionality of each view (feature group). Default: 2
n_views (int, optional) – Number of views/feature groups. Total input features = p * n_views.
Default: 10
n_concepts (int, optional) – Number of observable concepts (not including hidden concepts). Default: 2
n_hidden_concepts (int, optional) – Number of hidden concepts not observable in the bottleneck. Use this to
simulate incomplete concept bottlenecks. Default: 0
n_tasks (int, optional) – Number of downstream tasks to predict. Default: 1
concept_subset (list of str, optional) – Subset of concept names to use. If provided, only the specified concepts
will be included. Concept names follow format ‘C0’, ‘C1’, etc. Default: None
The color MNIST dataset is a modified version of the MNIST dataset where
each digit is colored either red or green. The concept labels are the digit
and the color of the digit. The task is to predict whether the digit is
even or odd.
The MNIST addition dataset is a modified version of the MNIST dataset
where each image is a concatenation of two MNIST images and the target
label is the sum of the two digits. The concept label is a one-hot
encoding of the two digits.
The partial MNIST addition dataset is a modified version of the MNIST
addition dataset where the concept annotation is partial. The concept
associated with the second digit is not provided.
The MNIST even-odd dataset is a modified version of the MNIST dataset where
the task is to predict whether the digit is even or odd. The concept label
is a one-hot encoding of the digit.
CelebA is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations.
This class wraps torchvision’s CelebA dataset to work with the ConceptDataset framework.
The dataset can be downloaded from the official website: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.
Parameters:
root – Root directory where the dataset is stored or will be downloaded.
split – The split of the dataset to use (‘train’, ‘valid’, or ‘test’). Default is ‘train’.
transform – The transformations to apply to the images. Default is None.
download – Whether to download the dataset if it does not exist. Default is False.
task_label – The attribute(s) to use for the task. Default is ‘Attractive’.
concept_subset – Optional subset of concept labels to use.
label_descriptions – Optional dict mapping concept names to descriptions.