Preprocessing

This module provides preprocessing utilities including autoencoder-based feature extraction.

Summary

Autoencoder Classes

SimpleAutoencoder

Simple feedforward autoencoder for dimensionality reduction.

AutoencoderTrainer

Trainer class for autoencoder models with early stopping.

Preprocessing Functions

extract_embs_from_autoencoder

Extract embeddings from a pandas DataFrame using an autoencoder.

Class Documentation

class SimpleAutoencoder(input_shape, latent_dim)[source]

Bases: Module

Simple feedforward autoencoder for dimensionality reduction.

A standard autoencoder with encoder and decoder networks using ReLU activations. Useful for preprocessing high-dimensional concept spaces.

encoder

Encoder network.

Type:

nn.Sequential

decoder

Decoder network.

Type:

nn.Sequential

Parameters:
  • input_shape – Number of input features.

  • latent_dim – Dimension of the latent space.

Example

>>> import torch
>>> from torch_concepts.data.preprocessing.autoencoder import SimpleAutoencoder
>>>
>>> # Create autoencoder
>>> autoencoder = SimpleAutoencoder(input_shape=784, latent_dim=32)
>>>
>>> # Forward pass
>>> x = torch.randn(4, 784)
>>> encoded, decoded = autoencoder(x)
>>> print(f"Encoded shape: {encoded.shape}")
Encoded shape: torch.Size([4, 32])
>>> print(f"Decoded shape: {decoded.shape}")
Decoded shape: torch.Size([4, 784])
forward(x)[source]

Forward pass through the autoencoder.

Parameters:

x – Input tensor of shape (batch_size, input_shape).

Returns:

(encoded, decoded) where
  • encoded has shape (batch_size, latent_dim)

  • decoded has shape (batch_size, input_shape)

Return type:

Tuple[torch.Tensor, torch.Tensor]

class AutoencoderTrainer(input_shape: int, noise: float = 0.0, latent_dim: int = 32, lr: float = 0.0005, epochs: int = 2000, batch_size: int = 512, patience: int = 50, device=None)[source]

Bases: object

Trainer class for autoencoder models with early stopping.

Provides training loop, early stopping, and latent representation extraction for autoencoder models.

model

The autoencoder model.

Type:

SimpleAutoencoder

criterion

Reconstruction loss function.

Type:

nn.MSELoss

optimizer

Optimizer for training.

Type:

optim.Adam

device

Device to train on (‘cpu’ or ‘cuda’).

Type:

str

Parameters:
  • input_shape – Number of input features.

  • noise – Noise level to add to latent representations (default: 0.5).

  • latent_dim – Dimension of latent space (default: 32).

  • lr – Learning rate (default: 0.0005).

  • epochs – Maximum training epochs (default: 2000).

  • batch_size – Batch size for training (default: 512).

  • patience – Early stopping patience in epochs (default: 50).

  • device – Device to use for training (default: ‘cpu’).

Example

>>> import torch
>>> from torch_concepts.data.preprocessing.autoencoder import AutoencoderTrainer
>>>
>>> # Create synthetic data
>>> data = torch.randn(1000, 100)
>>>
>>> # Create and train autoencoder
>>> trainer = AutoencoderTrainer(
...     input_shape=100,
...     latent_dim=16,
...     epochs=100,
...     batch_size=64,
...     device='cpu'
... )
>>>
>>> # Train
>>> trainer.train(data)
Autoencoder training started...
>>>
>>> # Extract latent representations
>>> latent = trainer.extract_latent()
>>> print(latent.shape)
torch.Size([1000, 16])
train(dataset)[source]

Train the autoencoder on the given dataset.

Implements training loop with MSE reconstruction loss and early stopping based on validation loss.

Parameters:

dataset – PyTorch dataset or tensor to train on.

extract_latent()[source]

Extract latent representations from the trained autoencoder.

Uses the best model weights (lowest reconstruction loss) to encode the entire dataset. Optionally adds noise to latent representations.

Returns:

Latent representations of shape (n_samples, latent_dim).

Return type:

torch.Tensor

Example

>>> # After training
>>> latent = trainer.extract_latent()
>>> print(latent.shape)
torch.Size([1000, 16])

Function Documentation

extract_embs_from_autoencoder(df, autoencoder_kwargs={})[source]

Extract embeddings from a pandas DataFrame using an autoencoder.

Convenience function that trains an autoencoder on tabular data and returns the learned latent representations.

Parameters:
  • df – Input pandas DataFrame.

  • autoencoder_kwargs – Dictionary of keyword arguments for AutoencoderTrainer. Can include ‘device’ to specify training device (default: ‘cpu’).

Returns:

Latent representations of shape (n_samples, latent_dim).

Return type:

torch.Tensor

Example

>>> import pandas as pd
>>> import torch
>>> from torch_concepts.data.preprocessing.autoencoder import extract_embs_from_autoencoder
>>>
>>> # Create sample DataFrame
>>> df = pd.DataFrame(torch.randn(100, 50).numpy())
>>>
>>> # Extract embeddings
>>> embeddings = extract_embs_from_autoencoder(
...     df,
...     autoencoder_kwargs={
...         'latent_dim': 10,
...         'epochs': 50,
...         'batch_size': 32,
...         'noise': 0.1,
...         'device': 'cpu'  # or 'cuda' if desired
...     }
... )
>>> print(embeddings.shape)
torch.Size([100, 10])