Preprocessing¶

This module provides preprocessing utilities including autoencoder-based feature extraction.

Summary¶

Autoencoder Classes

`SimpleAutoencoder`	Simple feedforward autoencoder for dimensionality reduction.
`AutoencoderTrainer`	Trainer class for autoencoder models with early stopping.

Preprocessing Functions

extract_embs_from_autoencoder

Extract embeddings from a pandas DataFrame using an autoencoder.

Class Documentation¶

class SimpleAutoencoder(input_shape, latent_dim)[source]¶

Bases: Module

Simple feedforward autoencoder for dimensionality reduction.

A standard autoencoder with encoder and decoder networks using ReLU activations. Useful for preprocessing high-dimensional concept spaces.

encoder¶

Encoder network.

Type:: nn.Sequential

decoder¶

Decoder network.

Type:: nn.Sequential

Parameters:

input_shape – Number of input features.
latent_dim – Dimension of the latent space.

Example

>>> import torch
>>> from torch_concepts.data.preprocessing.autoencoder import SimpleAutoencoder
>>>
>>> # Create autoencoder
>>> autoencoder = SimpleAutoencoder(input_shape=784, latent_dim=32)
>>>
>>> # Forward pass
>>> x = torch.randn(4, 784)
>>> encoded, decoded = autoencoder(x)
>>> print(f"Encoded shape: {encoded.shape}")
Encoded shape: torch.Size([4, 32])
>>> print(f"Decoded shape: {decoded.shape}")
Decoded shape: torch.Size([4, 784])

forward(x)[source]¶

Forward pass through the autoencoder.

Parameters:

x – Input tensor of shape (batch_size, input_shape).

Returns:

(encoded, decoded) where

encoded has shape (batch_size, latent_dim)
decoded has shape (batch_size, input_shape)

Return type:

Tuple[torch.Tensor, torch.Tensor]

class AutoencoderTrainer(input_shape: int, noise: float = 0.0, latent_dim: int = 32, lr: float = 0.0005, epochs: int = 2000, batch_size: int = 512, patience: int = 50, device=None)[source]¶

Bases: object

Trainer class for autoencoder models with early stopping.

Provides training loop, early stopping, and latent representation extraction for autoencoder models.

model¶

The autoencoder model.

Type:: SimpleAutoencoder

criterion¶

Reconstruction loss function.

Type:: nn.MSELoss

optimizer¶

Optimizer for training.

Type:: optim.Adam

device¶

Device to train on (‘cpu’ or ‘cuda’).

Type:: str

Parameters:

input_shape – Number of input features.
noise – Noise level to add to latent representations (default: 0.5).
latent_dim – Dimension of latent space (default: 32).
lr – Learning rate (default: 0.0005).
epochs – Maximum training epochs (default: 2000).
batch_size – Batch size for training (default: 512).
patience – Early stopping patience in epochs (default: 50).
device – Device to use for training (default: ‘cpu’).

Example

>>> import torch
>>> from torch_concepts.data.preprocessing.autoencoder import AutoencoderTrainer
>>>
>>> # Create synthetic data
>>> data = torch.randn(1000, 100)
>>>
>>> # Create and train autoencoder
>>> trainer = AutoencoderTrainer(
...     input_shape=100,
...     latent_dim=16,
...     epochs=100,
...     batch_size=64,
...     device='cpu'
... )
>>>
>>> # Train
>>> trainer.train(data)
Autoencoder training started...
>>>
>>> # Extract latent representations
>>> latent = trainer.extract_latent()
>>> print(latent.shape)
torch.Size([1000, 16])

train(dataset)[source]¶

Train the autoencoder on the given dataset.

Implements training loop with MSE reconstruction loss and early stopping based on validation loss.

Parameters:: dataset – PyTorch dataset or tensor to train on.

extract_latent()[source]¶

Extract latent representations from the trained autoencoder.

Uses the best model weights (lowest reconstruction loss) to encode the entire dataset. Optionally adds noise to latent representations.

Returns:: Latent representations of shape (n_samples, latent_dim).
Return type:: torch.Tensor

Example

>>> # After training
>>> latent = trainer.extract_latent()
>>> print(latent.shape)
torch.Size([1000, 16])

Function Documentation¶

extract_embs_from_autoencoder(df, autoencoder_kwargs={})[source]¶

Extract embeddings from a pandas DataFrame using an autoencoder.

Convenience function that trains an autoencoder on tabular data and returns the learned latent representations.

Parameters:

df – Input pandas DataFrame.
autoencoder_kwargs – Dictionary of keyword arguments for AutoencoderTrainer. Can include ‘device’ to specify training device (default: ‘cpu’).

Returns:

Latent representations of shape (n_samples, latent_dim).

Return type:

torch.Tensor

Example

>>> import pandas as pd
>>> import torch
>>> from torch_concepts.data.preprocessing.autoencoder import extract_embs_from_autoencoder
>>>
>>> # Create sample DataFrame
>>> df = pd.DataFrame(torch.randn(100, 50).numpy())
>>>
>>> # Extract embeddings
>>> embeddings = extract_embs_from_autoencoder(
...     df,
...     autoencoder_kwargs={
...         'latent_dim': 10,
...         'epochs': 50,
...         'batch_size': 32,
...         'noise': 0.1,
...         'device': 'cpu'  # or 'cuda' if desired
...     }
... )
>>> print(embeddings.shape)
torch.Size([100, 10])