torch_concepts.data.ToyDAGDataset¶

class ToyDAGDataset(variables: List[str], cardinalities: Dict[str, int], dag: List[Tuple[str, str]], conditional_probs: Dict[Tuple[str, str] | Tuple[str], ndarray | list], root: str | None = None, seed: int = 42, n_gen: int = 10000, target_variable: str | None = None, latent_variables: List[str] | None = None, concept_subset: list | None = None, label_descriptions: dict | None = None, autoencoder_kwargs: dict | None = None, **kwargs)[source]¶

Dataset class for toy DAG-based synthetic datasets.

This dataset generates synthetic data based on a user-defined Directed Acyclic Graph (DAG) and conditional probability tables. It supports: - Custom DAG structures - Custom conditional probability tables - Optional latent variables (used for embedding generation but not exposed as concepts) - Autoencoder-based embedding generation

Parameters:

variables – List of all variable names in the DAG.
cardinalities – Dictionary mapping variable names to their cardinality.
dag – List of edges representing the DAG structure as (parent, child) tuples.
conditional_probs – Dictionary mapping variables to their conditional probability tables. Format: {(parent, child): array} or {(child,): array for multi-parent}
root – Root directory to store/load the dataset. If None, creates local folder.
seed – Random seed for data generation and reproducibility.
n_gen – Total number of samples to generate.
target_variable – Name of the target variable (optional, for metadata).
latent_variables – List of latent variable names (used for embeddings but hidden from concepts).
concept_subset – Optional subset of concept labels to use.
label_descriptions – Optional dict mapping concept names to descriptions.
autoencoder_kwargs – Configuration for autoencoder-based feature extraction.

__init__(variables: List[str], cardinalities: Dict[str, int], dag: List[Tuple[str, str]], conditional_probs: Dict[Tuple[str, str] | Tuple[str], ndarray | list], root: str | None = None, seed: int = 42, n_gen: int = 10000, target_variable: str | None = None, latent_variables: List[str] | None = None, concept_subset: list | None = None, label_descriptions: dict | None = None, autoencoder_kwargs: dict | None = None, **kwargs)[source]¶

Methods

`__init__`(variables, cardinalities, dag, ...)
`add_exogenous`(name, value[, convert_precision])
`add_scaler`(key, scaler)	Add a scaler for preprocessing a specific tensor.
`build`()	Build processed dataset from raw files.
`collate`(samples)	Collate samples into a batch, re-annotating the ground-truth concepts.
`download`()	Download raw data files to root directory.
`load`()	Load and optionally preprocess dataset.
`load_raw`()	Load raw processed files.
`maybe_build`()
`maybe_download`()
`remove_exogenous`(name)
`set_concepts`(concepts)	Set concept annotations for the dataset.
`set_graph`(graph)	Set the adjacency matrix of the causal graph between concepts as a pandas DataFrame.

Attributes

`annotations`	Annotations for the concepts in the dataset.
`concept_names`	List of concept names in the dataset.
`exogenous`	Mapping of dataset's exogenous variables.
`graph`	Adjacency matrix of the causal graph between concepts.
`has_concepts`	Whether the dataset has concept annotations.
`has_exogenous`	Whether the dataset has exogenous information.
`n_concepts`	Number of concepts in the dataset.
`n_exogenous`	Number of exogenous variables in the dataset.
`n_features`	Shape of features in dataset's input (excluding number of samples).
`n_samples`	Number of samples in the dataset.
`processed_filenames`	List of processed filenames that will be created during build step.
`processed_paths`	The absolute paths of the processed files that must be present in order to skip building.
`raw_filenames`	List of raw filenames that must be present to skip downloading.
`raw_paths`	The absolute paths of the raw files that must be present in order to skip downloading.
`root_dir`
`shape`	Shape of the input tensor.