torch_concepts.data.CompletenessDataset¶
- class CompletenessDataset(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]¶
Synthetic dataset for concept bottleneck completeness experiments.
This dataset generates synthetic data to study complete vs. incomplete concept bottlenecks. Data is generated using randomly initialized multi-layer perceptrons with ReLU activations. Input features are sampled from a multivariate normal distribution, and concepts are derived through nonlinear transformations. Hidden concepts can be included to simulate incomplete bottlenecks.
The dataset uses a two-stage generation process: 1. Map inputs X to concepts C (both observed and hidden) via nonlinear function g 2. Map concepts C to tasks Y via nonlinear function f
- Parameters:
name (str) – Name identifier for the dataset (used for file storage).
root (str, optional) – Root directory to store/load the dataset files. If None, defaults to ‘./data/completeness_datasets/{name}’. Default: None
seed (int, optional) – Random seed for reproducible data generation (also determines the on-disk cache filename). Default: 42
n_gen (int, optional) – Number of samples to generate. Default: 10000
p (int, optional) – Dimensionality of each view (feature group). Default: 2
n_views (int, optional) – Number of views/feature groups. Total input features = p * n_views. Default: 10
n_concepts (int, optional) – Number of observable concepts (not including hidden concepts). Default: 2
n_hidden_concepts (int, optional) – Number of hidden concepts not observable in the bottleneck. Use this to simulate incomplete concept bottlenecks. Default: 0
n_tasks (int, optional) – Number of downstream tasks to predict. Default: 1
concept_subset (list of str, optional) – Subset of concept names to use. If provided, only the specified concepts will be included. Concept names follow format ‘C0’, ‘C1’, etc. Default: None
- input_data¶
Input features tensor of shape (n_samples, p * n_views).
- Type:
- concepts¶
Concepts and tasks tensor of shape (n_samples, n_concepts + n_tasks). Note: Hidden concepts are NOT included in this tensor.
- Type:
- annotations¶
Metadata about concept names, cardinalities, and types.
- Type:
- graph¶
Directed acyclic graph representing concept-to-task relationships. All concepts influence all tasks in this dataset.
- Type:
- n_concepts¶
Total number of observable concepts and tasks (includes both, excludes hidden).
- Type:
Examples
Basic usage with complete bottleneck:
>>> from torch_concepts.data import CompletenessDataset >>> >>> # Create dataset with complete bottleneck (no hidden concepts) >>> dataset = CompletenessDataset( ... name='complete_exp', ... n_gen=5000, ... n_concepts=5, ... n_hidden_concepts=0, ... seed=42 ... ) >>> print(f"Dataset size: {len(dataset)}") >>> print(f"Input features: {dataset.n_features}") >>> print(f"Concepts: {dataset.concept_names}")
Creating incomplete bottleneck with hidden concepts:
>>> from torch_concepts.data import CompletenessDataset >>> >>> # Create dataset with incomplete bottleneck >>> dataset = CompletenessDataset( ... name='incomplete_exp', ... n_gen=5000, ... n_concepts=3, # 3 observable concepts ... n_hidden_concepts=2, # 2 hidden concepts (not in bottleneck) ... seed=42 ... ) >>> # The hidden concepts affect tasks but are not observable >>> print(f"Observable concepts: {dataset.n_concepts}")
References
- __init__(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]¶
Methods
__init__(name[, root, seed, n_gen, p, ...])add_exogenous(name, value[, convert_precision])add_scaler(key, scaler)Add a scaler for preprocessing a specific tensor.
build()Generate synthetic completeness data and save to disk.
collate(samples)Collate samples into a batch, re-annotating the ground-truth concepts.
download()No download needed for synthetic datasets.
load()Load the dataset (wraps load_raw).
load_raw()Load the generated dataset from disk.
maybe_build()maybe_download()remove_exogenous(name)set_concepts(concepts)Set concept annotations for the dataset.
set_graph(graph)Set the adjacency matrix of the causal graph between concepts as a pandas DataFrame.
Attributes
Annotations for the concepts in the dataset.
List of concept names in the dataset.
exogenousMapping of dataset's exogenous variables.
Adjacency matrix of the causal graph between concepts.
has_conceptsWhether the dataset has concept annotations.
has_exogenousWhether the dataset has exogenous information.
Number of concepts in the dataset.
n_exogenousNumber of exogenous variables in the dataset.
Shape of features in dataset's input (excluding number of samples).
n_samplesNumber of samples in the dataset.
processed_filenamesList of processed filenames that will be created during build step.
processed_pathsThe absolute paths of the processed files that must be present in order to skip building.
raw_filenamesNo raw files needed - data is generated.
raw_pathsThe absolute paths of the raw files that must be present in order to skip downloading.
root_dirshapeShape of the input tensor.