torch_concepts.data.datasets.toy.CompletenessDataset

class CompletenessDataset(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]

Synthetic dataset for concept bottleneck completeness experiments.

This dataset generates synthetic data to study complete vs. incomplete concept bottlenecks. Data is generated using randomly initialized multi-layer perceptrons with ReLU activations. Input features are sampled from a multivariate normal distribution, and concepts are derived through nonlinear transformations. Hidden concepts can be included to simulate incomplete bottlenecks.

The dataset uses a two-stage generation process: 1. Map inputs X to concepts C (both observed and hidden) via nonlinear function g 2. Map concepts C to tasks Y via nonlinear function f

Parameters:
  • name (str) – Name identifier for the dataset (used for file storage).

  • root (str, optional) – Root directory to store/load the dataset files. If None, defaults to ‘./data/completeness_datasets/{name}’. Default: None

  • seed (int, optional) – Random seed for reproducible data generation. Default: 42

  • n_gen (int, optional) – Number of samples to generate. Default: 10000

  • p (int, optional) – Dimensionality of each view (feature group). Default: 2

  • n_views (int, optional) – Number of views/feature groups. Total input features = p * n_views. Default: 10

  • n_concepts (int, optional) – Number of observable concepts (not including hidden concepts). Default: 2

  • n_hidden_concepts (int, optional) – Number of hidden concepts not observable in the bottleneck. Use this to simulate incomplete concept bottlenecks. Default: 0

  • n_tasks (int, optional) – Number of downstream tasks to predict. Default: 1

  • concept_subset (list of str, optional) – Subset of concept names to use. If provided, only the specified concepts will be included. Concept names follow format ‘C0’, ‘C1’, etc. Default: None

input_data

Input features tensor of shape (n_samples, p * n_views).

Type:

torch.Tensor

concepts

Concepts and tasks tensor of shape (n_samples, n_concepts + n_tasks). Note: Hidden concepts are NOT included in this tensor.

Type:

torch.Tensor

annotations

Metadata about concept names, cardinalities, and types.

Type:

Annotations

graph

Directed acyclic graph representing concept-to-task relationships. All concepts influence all tasks in this dataset.

Type:

pandas.DataFrame

concept_names

Names of all concepts and tasks. Format: [‘C0’, ‘C1’, …, ‘y’]

Type:

list of str

n_concepts

Total number of observable concepts and tasks (includes both, excludes hidden).

Type:

int

n_features

Dimensionality of input features (p * n_views).

Type:

tuple or int

Examples

Basic usage with complete bottleneck:

>>> from torch_concepts.data.datasets import CompletenessDataset
>>>
>>> # Create dataset with complete bottleneck (no hidden concepts)
>>> dataset = CompletenessDataset(
...     name='complete_exp',
...     n_gen=5000,
...     n_concepts=5,
...     n_hidden_concepts=0,
...     seed=42
... )
>>> print(f"Dataset size: {len(dataset)}")
>>> print(f"Input features: {dataset.n_features}")
>>> print(f"Concepts: {dataset.concept_names}")

Creating incomplete bottleneck with hidden concepts:

>>> from torch_concepts.data.datasets import CompletenessDataset
>>>
>>> # Create dataset with incomplete bottleneck
>>> dataset = CompletenessDataset(
...     name='incomplete_exp',
...     n_gen=5000,
...     n_concepts=3,          # 3 observable concepts
...     n_hidden_concepts=2,   # 2 hidden concepts (not in bottleneck)
...     seed=42
... )
>>> # The hidden concepts affect tasks but are not observable
>>> print(f"Observable concepts: {dataset.n_concepts}")

References

__init__(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]

Methods

__init__(name[, root, seed, n_gen, p, ...])

add_exogenous(name, value[, convert_precision])

add_scaler(key, scaler)

Add a scaler for preprocessing a specific tensor.

build()

Generate synthetic completeness data and save to disk.

download()

No download needed for synthetic datasets.

load()

Load the dataset (wraps load_raw).

load_raw()

Load the generated dataset from disk.

maybe_build()

maybe_download()

maybe_reduce_annotations(annotations[, ...])

Set concept and labels for the dataset. :param annotations: Annotations object for all concepts. :param concept_names_subset: List of strings naming the subset of concepts to use. If None, will use all concepts.

remove_exogenous(name)

set_concepts(concepts)

Set concept annotations for the dataset.

set_graph(graph)

Set the adjacency matrix of the causal graph between concepts as a pandas DataFrame.

Attributes

annotations

Annotations for the concepts in the dataset.

concept_names

List of concept names in the dataset.

exogenous

Mapping of dataset's exogenous variables.

graph

Adjacency matrix of the causal graph between concepts.

has_concepts

Whether the dataset has concept annotations.

has_exogenous

Whether the dataset has exogenous information.

n_concepts

Number of concepts in the dataset.

n_exogenous

Number of exogenous variables in the dataset.

n_features

Shape of features in dataset's input (excluding number of samples).

n_samples

Number of samples in the dataset.

processed_filenames

List of processed filenames that will be created during build step.

processed_paths

The absolute paths of the processed files that must be present in order to skip building.

raw_filenames

No raw files needed - data is generated.

raw_paths

The absolute paths of the raw files that must be present in order to skip downloading.

root_dir

shape

Shape of the input tensor.