torch_concepts.data.datasets.toy.CompletenessDataset¶
- class CompletenessDataset(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]¶
Synthetic dataset for concept bottleneck completeness experiments.
This dataset generates synthetic data to study complete vs. incomplete concept bottlenecks. Data is generated using randomly initialized multi-layer perceptrons with ReLU activations. Input features are sampled from a multivariate normal distribution, and concepts are derived through nonlinear transformations. Hidden concepts can be included to simulate incomplete bottlenecks.
The dataset uses a two-stage generation process: 1. Map inputs X to concepts C (both observed and hidden) via nonlinear function g 2. Map concepts C to tasks Y via nonlinear function f
- Parameters:
name (str) – Name identifier for the dataset (used for file storage).
root (str, optional) – Root directory to store/load the dataset files. If None, defaults to ‘./data/completeness_datasets/{name}’. Default: None
seed (int, optional) – Random seed for reproducible data generation. Default: 42
n_gen (int, optional) – Number of samples to generate. Default: 10000
p (int, optional) – Dimensionality of each view (feature group). Default: 2
n_views (int, optional) – Number of views/feature groups. Total input features = p * n_views. Default: 10
n_concepts (int, optional) – Number of observable concepts (not including hidden concepts). Default: 2
n_hidden_concepts (int, optional) – Number of hidden concepts not observable in the bottleneck. Use this to simulate incomplete concept bottlenecks. Default: 0
n_tasks (int, optional) – Number of downstream tasks to predict. Default: 1
concept_subset (list of str, optional) – Subset of concept names to use. If provided, only the specified concepts will be included. Concept names follow format ‘C0’, ‘C1’, etc. Default: None
- input_data¶
Input features tensor of shape (n_samples, p * n_views).
- Type:
- concepts¶
Concepts and tasks tensor of shape (n_samples, n_concepts + n_tasks). Note: Hidden concepts are NOT included in this tensor.
- Type:
- annotations¶
Metadata about concept names, cardinalities, and types.
- Type:
- graph¶
Directed acyclic graph representing concept-to-task relationships. All concepts influence all tasks in this dataset.
- Type:
- n_concepts¶
Total number of observable concepts and tasks (includes both, excludes hidden).
- Type:
Examples
Basic usage with complete bottleneck:
>>> from torch_concepts.data.datasets import CompletenessDataset >>> >>> # Create dataset with complete bottleneck (no hidden concepts) >>> dataset = CompletenessDataset( ... name='complete_exp', ... n_gen=5000, ... n_concepts=5, ... n_hidden_concepts=0, ... seed=42 ... ) >>> print(f"Dataset size: {len(dataset)}") >>> print(f"Input features: {dataset.n_features}") >>> print(f"Concepts: {dataset.concept_names}")
Creating incomplete bottleneck with hidden concepts:
>>> from torch_concepts.data.datasets import CompletenessDataset >>> >>> # Create dataset with incomplete bottleneck >>> dataset = CompletenessDataset( ... name='incomplete_exp', ... n_gen=5000, ... n_concepts=3, # 3 observable concepts ... n_hidden_concepts=2, # 2 hidden concepts (not in bottleneck) ... seed=42 ... ) >>> # The hidden concepts affect tasks but are not observable >>> print(f"Observable concepts: {dataset.n_concepts}")
References
- __init__(name: str, root: str | None = None, seed: int = 42, n_gen: int = 10000, p: int = 2, n_views: int = 10, n_concepts: int = 2, n_hidden_concepts: int = 0, n_tasks: int = 1, concept_subset: list | None = None)[source]¶
Methods
__init__(name[, root, seed, n_gen, p, ...])add_exogenous(name, value[, convert_precision])add_scaler(key, scaler)Add a scaler for preprocessing a specific tensor.
build()Generate synthetic completeness data and save to disk.
download()No download needed for synthetic datasets.
load()Load the dataset (wraps load_raw).
load_raw()Load the generated dataset from disk.
maybe_build()maybe_download()maybe_reduce_annotations(annotations[, ...])Set concept and labels for the dataset. :param annotations: Annotations object for all concepts. :param concept_names_subset: List of strings naming the subset of concepts to use. If
None, will use all concepts.remove_exogenous(name)set_concepts(concepts)Set concept annotations for the dataset.
set_graph(graph)Set the adjacency matrix of the causal graph between concepts as a pandas DataFrame.
Attributes
Annotations for the concepts in the dataset.
List of concept names in the dataset.
exogenousMapping of dataset's exogenous variables.
Adjacency matrix of the causal graph between concepts.
has_conceptsWhether the dataset has concept annotations.
has_exogenousWhether the dataset has exogenous information.
Number of concepts in the dataset.
n_exogenousNumber of exogenous variables in the dataset.
Shape of features in dataset's input (excluding number of samples).
n_samplesNumber of samples in the dataset.
List of processed filenames that will be created during build step.
processed_pathsThe absolute paths of the processed files that must be present in order to skip building.
No raw files needed - data is generated.
raw_pathsThe absolute paths of the raw files that must be present in order to skip downloading.
root_dirshapeShape of the input tensor.