torch_concepts.data.CompletenessDataset¶

class CompletenessDataset(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]¶

Synthetic dataset for concept bottleneck completeness experiments.

This dataset generates synthetic data to study complete vs. incomplete concept bottlenecks. Data is generated using randomly initialized multi-layer perceptrons with ReLU activations. Input features are sampled from a multivariate normal distribution, and concepts are derived through nonlinear transformations. Hidden concepts can be included to simulate incomplete bottlenecks.

The dataset uses a two-stage generation process: 1. Map inputs X to concepts C (both observed and hidden) via nonlinear function g 2. Map concepts C to tasks Y via nonlinear function f

Parameters:

name (str) – Name identifier for the dataset (used for file storage).
root (str, optional) – Root directory to store/load the dataset files. If None, defaults to ‘./data/completeness_datasets/{name}’. Default: None
seed (int, optional) – Random seed for reproducible data generation (also determines the on-disk cache filename). Default: 42
n_gen (int, optional) – Number of samples to generate. Default: 10000
p (int, optional) – Dimensionality of each view (feature group). Default: 2
n_views (int, optional) – Number of views/feature groups. Total input features = p * n_views. Default: 10
n_concepts (int, optional) – Number of observable concepts (not including hidden concepts). Default: 2
n_hidden_concepts (int, optional) – Number of hidden concepts not observable in the bottleneck. Use this to simulate incomplete concept bottlenecks. Default: 0
n_tasks (int, optional) – Number of downstream tasks to predict. Default: 1
concept_subset (list of str, optional) – Subset of concept names to use. If provided, only the specified concepts will be included. Concept names follow format ‘C0’, ‘C1’, etc. Default: None

input_data¶

Input features tensor of shape (n_samples, p * n_views).

Type:: torch.Tensor

concepts¶

Concepts and tasks tensor of shape (n_samples, n_concepts + n_tasks). Note: Hidden concepts are NOT included in this tensor.

Type:: torch.Tensor

annotations¶

Metadata about concept names, cardinalities, and types.

Type:: Annotations

graph¶

Directed acyclic graph representing concept-to-task relationships. All concepts influence all tasks in this dataset.

Type:: pandas.DataFrame

concept_names¶

Names of all concepts and tasks. Format: [‘C0’, ‘C1’, …, ‘y’]

Type:: list of str

n_concepts¶

Total number of observable concepts and tasks (includes both, excludes hidden).

Type:: int

n_features¶

Dimensionality of input features (p * n_views).

Type:: tuple or int

Examples

Basic usage with complete bottleneck:

>>> from torch_concepts.data import CompletenessDataset
>>>
>>> # Create dataset with complete bottleneck (no hidden concepts)
>>> dataset = CompletenessDataset(
...     name='complete_exp',
...     n_gen=5000,
...     n_concepts=5,
...     n_hidden_concepts=0,
...     seed=42
... )
>>> print(f"Dataset size: {len(dataset)}")
>>> print(f"Input features: {dataset.n_features}")
>>> print(f"Concepts: {dataset.concept_names}")

Creating incomplete bottleneck with hidden concepts:

>>> from torch_concepts.data import CompletenessDataset
>>>
>>> # Create dataset with incomplete bottleneck
>>> dataset = CompletenessDataset(
...     name='incomplete_exp',
...     n_gen=5000,
...     n_concepts=3,          # 3 observable concepts
...     n_hidden_concepts=2,   # 2 hidden concepts (not in bottleneck)
...     seed=42
... )
>>> # The hidden concepts affect tasks but are not observable
>>> print(f"Observable concepts: {dataset.n_concepts}")

References

__init__(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]¶

Methods

`__init__`(name[, root, seed, n_gen, p, ...])
`add_exogenous`(name, value[, convert_precision])
`add_scaler`(key, scaler)	Add a scaler for preprocessing a specific tensor.
`build`()	Generate synthetic completeness data and save to disk.
`collate`(samples)	Collate samples into a batch, re-annotating the ground-truth concepts.
`download`()	No download needed for synthetic datasets.
`load`()	Load the dataset (wraps load_raw).
`load_raw`()	Load the generated dataset from disk.
`maybe_build`()
`maybe_download`()
`remove_exogenous`(name)
`set_concepts`(concepts)	Set concept annotations for the dataset.
`set_graph`(graph)	Set the adjacency matrix of the causal graph between concepts as a pandas DataFrame.

Attributes

`annotations`	Annotations for the concepts in the dataset.
`concept_names`	List of concept names in the dataset.
`exogenous`	Mapping of dataset's exogenous variables.
`graph`	Adjacency matrix of the causal graph between concepts.
`has_concepts`	Whether the dataset has concept annotations.
`has_exogenous`	Whether the dataset has exogenous information.
`n_concepts`	Number of concepts in the dataset.
`n_exogenous`	Number of exogenous variables in the dataset.
`n_features`	Shape of features in dataset's input (excluding number of samples).
`n_samples`	Number of samples in the dataset.
`processed_filenames`	List of processed filenames that will be created during build step.
`processed_paths`	The absolute paths of the processed files that must be present in order to skip building.
`raw_filenames`	No raw files needed - data is generated.
`raw_paths`	The absolute paths of the raw files that must be present in order to skip downloading.
`root_dir`
`shape`	Shape of the input tensor.