torch_concepts.data.ToyDAGDataset¶
- class ToyDAGDataset(variables: List[str], cardinalities: Dict[str, int], dag: List[Tuple[str, str]], conditional_probs: Dict[Tuple[str, str] | Tuple[str], ndarray | list], root: str | None = None, seed: int = 42, n_gen: int = 10000, target_variable: str | None = None, latent_variables: List[str] | None = None, concept_subset: list | None = None, label_descriptions: dict | None = None, autoencoder_kwargs: dict | None = None, **kwargs)[source]¶
Dataset class for toy DAG-based synthetic datasets.
This dataset generates synthetic data based on a user-defined Directed Acyclic Graph (DAG) and conditional probability tables. It supports: - Custom DAG structures - Custom conditional probability tables - Optional latent variables (used for embedding generation but not exposed as concepts) - Autoencoder-based embedding generation
- Parameters:
variables – List of all variable names in the DAG.
cardinalities – Dictionary mapping variable names to their cardinality.
dag – List of edges representing the DAG structure as (parent, child) tuples.
conditional_probs – Dictionary mapping variables to their conditional probability tables. Format: {(parent, child): array} or {(child,): array for multi-parent}
root – Root directory to store/load the dataset. If None, creates local folder.
seed – Random seed for data generation and reproducibility.
n_gen – Total number of samples to generate.
target_variable – Name of the target variable (optional, for metadata).
latent_variables – List of latent variable names (used for embeddings but hidden from concepts).
concept_subset – Optional subset of concept labels to use.
label_descriptions – Optional dict mapping concept names to descriptions.
autoencoder_kwargs – Configuration for autoencoder-based feature extraction.
- __init__(variables: List[str], cardinalities: Dict[str, int], dag: List[Tuple[str, str]], conditional_probs: Dict[Tuple[str, str] | Tuple[str], ndarray | list], root: str | None = None, seed: int = 42, n_gen: int = 10000, target_variable: str | None = None, latent_variables: List[str] | None = None, concept_subset: list | None = None, label_descriptions: dict | None = None, autoencoder_kwargs: dict | None = None, **kwargs)[source]¶
Methods
__init__(variables, cardinalities, dag, ...)add_exogenous(name, value[, convert_precision])add_scaler(key, scaler)Add a scaler for preprocessing a specific tensor.
build()Build processed dataset from raw files.
collate(samples)Collate samples into a batch, re-annotating the ground-truth concepts.
download()Download raw data files to root directory.
load()Load and optionally preprocess dataset.
load_raw()Load raw processed files.
maybe_build()maybe_download()remove_exogenous(name)set_concepts(concepts)Set concept annotations for the dataset.
set_graph(graph)Set the adjacency matrix of the causal graph between concepts as a pandas DataFrame.
Attributes
annotationsAnnotations for the concepts in the dataset.
concept_namesList of concept names in the dataset.
exogenousMapping of dataset's exogenous variables.
graphAdjacency matrix of the causal graph between concepts.
has_conceptsWhether the dataset has concept annotations.
has_exogenousWhether the dataset has exogenous information.
n_conceptsNumber of concepts in the dataset.
n_exogenousNumber of exogenous variables in the dataset.
n_featuresShape of features in dataset's input (excluding number of samples).
n_samplesNumber of samples in the dataset.
processed_filenamesList of processed filenames that will be created during build step.
processed_pathsThe absolute paths of the processed files that must be present in order to skip building.
raw_filenamesList of raw filenames that must be present to skip downloading.
raw_pathsThe absolute paths of the raw files that must be present in order to skip downloading.
root_dirshapeShape of the input tensor.