Structural Equation Models¶

PyC can be used to build interpretable concept-based causal models and perform causal inference.

Warning

This API is still under development and interfaces might change in future releases.

Design principles¶

Structural Equation Models¶

PyC can be used to design Structural Equation Models (SEMs), where:

ExogenousVariable and EndogenousVariable objects represent random variables in the SEM. Variables are defined by their name, parents, and distribution type. For example, in this guide we define variables as:

exogenous_var = ExogenousVariable(
    "exogenous",
    parents=[],
    distribution=RelaxedBernoulli
)
genotype_var = EndogenousVariable(
    "genotype",
    parents=["exogenous"],
    distribution=RelaxedBernoulli
)

ParametricCPD objects represent the structural equations (causal mechanisms) between variables in the SEM and are parameterized by PyC or PyTorch modules. For example:

genotype_cpd = ParametricCPD(
    "genotype",
    parametrization=torch.nn.Sequential(
        torch.nn.Linear(1, 1),
        torch.nn.Sigmoid()
    )
)

ProbabilisticModel objects collect all variables and CPDs to define the full SEM. For example:

sem_model = ProbabilisticModel(
    variables=[exogenous_var, genotype_var],
    parametric_cpds=[exogenous_cpd, genotype_cpd]
)

Interventions¶

Interventions allow us to estimate causal effects. For instance, do-interventions allow us to set specific variables to fixed values and observe the effect on downstream variables simulating a randomized controlled trial.

To perform a do-intervention, use the DoIntervention strategy and the intervention context manager. For example, to set smoking to 0 (prevent smoking) and query the effect on downstream variables:

# Intervention: Force smoking to 0 (prevent smoking)
smoking_strategy_0 = DoIntervention(
    model=sem_model.parametric_cpds,
    constants=0.0
)

with intervention(
    policies=UniformPolicy(out_features=1),
    strategies=smoking_strategy_0,
    target_concepts=["smoking"]
):
    intervened_results_0 = inference_engine.query(
        query_concepts=["genotype", "smoking", "tar", "cancer"],
        evidence=initial_input
    )
    # Results reflect the effect of setting smoking=0

You can use these interventional results to estimate causal effects, such as the Average Causal Effect (ACE), as shown in later steps of this guide.

Detailed Guides¶

Structural Equation Models

Import Libraries

Start by importing PyC and PyTorch libraries:

import torch
from torch.distributions import RelaxedBernoulli
import torch_concepts as pyc
from torch_concepts import EndogenousVariable, ExogenousVariable
from torch_concepts.nn import ParametricCPD, ProbabilisticModel
from torch_concepts.nn import AncestralSamplingInference
from torch_concepts.nn import CallableCC, UniformPolicy, DoIntervention, intervention
from torch_concepts.nn.functional import cace_score

Create Sample Data

n_samples = 1000

# Create exogenous input (noise/unobserved confounders)
initial_input = {'exogenous': torch.randn((n_samples, 1))}

Define Variables and Causal Structure

In Structural Equation Models, we distinguish between exogenous (external) and endogenous (internal) variables. Each variable is defined by its name, parents, and distribution type. By specifying parents, we define the causal graph structure.

# Define exogenous variable (external noise/confounders)
exogenous_var = ExogenousVariable(
    "exogenous",
    parents=[],
    distribution=RelaxedBernoulli
)

# Define endogenous variables (causal chain)
genotype_var = EndogenousVariable(
    "genotype",
    parents=["exogenous"],
    distribution=RelaxedBernoulli
)

smoking_var = EndogenousVariable(
    "smoking",
    parents=["genotype"],
    distribution=RelaxedBernoulli
)

tar_var = EndogenousVariable(
    "tar",
    parents=["genotype", "smoking"],
    distribution=RelaxedBernoulli
)

cancer_var = EndogenousVariable(
    "cancer",
    parents=["tar"],
    distribution=RelaxedBernoulli
)

Define ParametricCPDs

ParametricCPDs define the structural equations (causal mechanisms) between variables. We can use PyC or PyTorch modules to parameterize these CPDs. More specifically, PyC provides CallableCC to define structural equations using arbitrary callables.

# CPD for exogenous variable (no parents)
exogenous_cpd = ParametricCPD(
    "exogenous",
    parametrization=torch.nn.Sigmoid()
)

# CPD for genotype (depends on exogenous noise)
genotype_cpd = ParametricCPD(
    "genotype",
    parametrization=torch.nn.Sequential(
        torch.nn.Linear(1, 1),
        torch.nn.Sigmoid()
    )
)

# CPD for smoking (depends on genotype)
smoking_cpd = ParametricCPD(
    ["smoking"],
    parametrization=CallableCC(
        lambda x: (x > 0.5).float(),
        use_bias=False
    )
)

# CPD for tar (depends on genotype and smoking)
tar_cpd = ParametricCPD(
    "tar",
    parametrization=CallableCC(
        lambda x: torch.logical_or(x[:, 0] > 0.5, x[:, 1] > 0.5).float().unsqueeze(-1),
        use_bias=False
    )
)

# CPD for cancer (depends on tar)
cancer_cpd = ParametricCPD(
    "cancer",
    parametrization=CallableCC(
        lambda x: x,
        use_bias=False
    )
)

Build Structural Equation Model

Combine all variables and CPDs into a probabilistic model:

# Create the structural equation model
sem_model = ProbabilisticModel(
    variables=[exogenous_var, genotype_var, smoking_var, tar_var, cancer_var],
    parametric_cpds=[exogenous_cpd, genotype_cpd, smoking_cpd, tar_cpd, cancer_cpd]
)

Next Steps¶

Explore the full Mid-Level API documentation
Compare with Probabilistic Models for standard probabilistic inference
Try the High-Level API for out-of-the-box models