torch_concepts.data.ToyDAGDataset

class ToyDAGDataset(variables: List[str], cardinalities: Dict[str, int], dag: List[Tuple[str, str]], conditional_probs: Dict[Tuple[str, str] | Tuple[str], ndarray | list], root: str | None = None, seed: int = 42, n_gen: int = 10000, target_variable: str | None = None, latent_variables: List[str] | None = None, concept_subset: list | None = None, label_descriptions: dict | None = None, autoencoder_kwargs: dict | None = None, **kwargs)[source]

Dataset class for toy DAG-based synthetic datasets.

This dataset generates synthetic data based on a user-defined Directed Acyclic Graph (DAG) and conditional probability tables. It supports: - Custom DAG structures - Custom conditional probability tables - Optional latent variables (used for embedding generation but not exposed as concepts) - Autoencoder-based embedding generation

Parameters:
  • variables – List of all variable names in the DAG.

  • cardinalities – Dictionary mapping variable names to their cardinality.

  • dag – List of edges representing the DAG structure as (parent, child) tuples.

  • conditional_probs – Dictionary mapping variables to their conditional probability tables. Format: {(parent, child): array} or {(child,): array for multi-parent}

  • root – Root directory to store/load the dataset. If None, creates local folder.

  • seed – Random seed for data generation and reproducibility.

  • n_gen – Total number of samples to generate.

  • target_variable – Name of the target variable (optional, for metadata).

  • latent_variables – List of latent variable names (used for embeddings but hidden from concepts).

  • concept_subset – Optional subset of concept labels to use.

  • label_descriptions – Optional dict mapping concept names to descriptions.

  • autoencoder_kwargs – Configuration for autoencoder-based feature extraction.

__init__(variables: List[str], cardinalities: Dict[str, int], dag: List[Tuple[str, str]], conditional_probs: Dict[Tuple[str, str] | Tuple[str], ndarray | list], root: str | None = None, seed: int = 42, n_gen: int = 10000, target_variable: str | None = None, latent_variables: List[str] | None = None, concept_subset: list | None = None, label_descriptions: dict | None = None, autoencoder_kwargs: dict | None = None, **kwargs)[source]

Methods

__init__(variables, cardinalities, dag, ...)

add_exogenous(name, value[, convert_precision])

add_scaler(key, scaler)

Add a scaler for preprocessing a specific tensor.

build()

Build processed dataset from raw files.

collate(samples)

Collate samples into a batch, re-annotating the ground-truth concepts.

download()

Download raw data files to root directory.

load()

Load and optionally preprocess dataset.

load_raw()

Load raw processed files.

maybe_build()

maybe_download()

remove_exogenous(name)

set_concepts(concepts)

Set concept annotations for the dataset.

set_graph(graph)

Set the adjacency matrix of the causal graph between concepts as a pandas DataFrame.

Attributes

annotations

Annotations for the concepts in the dataset.

concept_names

List of concept names in the dataset.

exogenous

Mapping of dataset's exogenous variables.

graph

Adjacency matrix of the causal graph between concepts.

has_concepts

Whether the dataset has concept annotations.

has_exogenous

Whether the dataset has exogenous information.

n_concepts

Number of concepts in the dataset.

n_exogenous

Number of exogenous variables in the dataset.

n_features

Shape of features in dataset's input (excluding number of samples).

n_samples

Number of samples in the dataset.

processed_filenames

List of processed filenames that will be created during build step.

processed_paths

The absolute paths of the processed files that must be present in order to skip building.

raw_filenames

List of raw filenames that must be present to skip downloading.

raw_paths

The absolute paths of the raw files that must be present in order to skip downloading.

root_dir

shape

Shape of the input tensor.