Preprocessing¶
This module provides preprocessing utilities including autoencoder-based feature extraction.
Summary¶
Autoencoder Classes
Simple feedforward autoencoder for dimensionality reduction. |
|
Trainer class for autoencoder models with early stopping. |
Preprocessing Functions
Extract embeddings from a pandas DataFrame using an autoencoder. |
Class Documentation¶
- class SimpleAutoencoder(input_shape, latent_dim)[source]¶
Bases:
ModuleSimple feedforward autoencoder for dimensionality reduction.
A standard autoencoder with encoder and decoder networks using ReLU activations. Useful for preprocessing high-dimensional concept spaces.
- encoder¶
Encoder network.
- Type:
nn.Sequential
- decoder¶
Decoder network.
- Type:
nn.Sequential
- Parameters:
input_shape – Number of input features.
latent_dim – Dimension of the latent space.
Example
>>> import torch >>> from torch_concepts.data.preprocessing.autoencoder import SimpleAutoencoder >>> >>> # Create autoencoder >>> autoencoder = SimpleAutoencoder(input_shape=784, latent_dim=32) >>> >>> # Forward pass >>> x = torch.randn(4, 784) >>> encoded, decoded = autoencoder(x) >>> print(f"Encoded shape: {encoded.shape}") Encoded shape: torch.Size([4, 32]) >>> print(f"Decoded shape: {decoded.shape}") Decoded shape: torch.Size([4, 784])
- forward(x)[source]¶
Forward pass through the autoencoder.
- Parameters:
x – Input tensor of shape (batch_size, input_shape).
- Returns:
- (encoded, decoded) where
encoded has shape (batch_size, latent_dim)
decoded has shape (batch_size, input_shape)
- Return type:
Tuple[torch.Tensor, torch.Tensor]
- class AutoencoderTrainer(input_shape: int, noise: float = 0.0, latent_dim: int = 32, lr: float = 0.0005, epochs: int = 2000, batch_size: int = 512, patience: int = 50, device=None)[source]¶
Bases:
objectTrainer class for autoencoder models with early stopping.
Provides training loop, early stopping, and latent representation extraction for autoencoder models.
- model¶
The autoencoder model.
- Type:
- criterion¶
Reconstruction loss function.
- Type:
nn.MSELoss
- optimizer¶
Optimizer for training.
- Type:
optim.Adam
- Parameters:
input_shape – Number of input features.
noise – Noise level to add to latent representations (default: 0.5).
latent_dim – Dimension of latent space (default: 32).
lr – Learning rate (default: 0.0005).
epochs – Maximum training epochs (default: 2000).
batch_size – Batch size for training (default: 512).
patience – Early stopping patience in epochs (default: 50).
device – Device to use for training (default: ‘cpu’).
Example
>>> import torch >>> from torch_concepts.data.preprocessing.autoencoder import AutoencoderTrainer >>> >>> # Create synthetic data >>> data = torch.randn(1000, 100) >>> >>> # Create and train autoencoder >>> trainer = AutoencoderTrainer( ... input_shape=100, ... latent_dim=16, ... epochs=100, ... batch_size=64, ... device='cpu' ... ) >>> >>> # Train >>> trainer.train(data) Autoencoder training started... >>> >>> # Extract latent representations >>> latent = trainer.extract_latent() >>> print(latent.shape) torch.Size([1000, 16])
- train(dataset)[source]¶
Train the autoencoder on the given dataset.
Implements training loop with MSE reconstruction loss and early stopping based on validation loss.
- Parameters:
dataset – PyTorch dataset or tensor to train on.
- extract_latent()[source]¶
Extract latent representations from the trained autoencoder.
Uses the best model weights (lowest reconstruction loss) to encode the entire dataset. Optionally adds noise to latent representations.
- Returns:
Latent representations of shape (n_samples, latent_dim).
- Return type:
Example
>>> # After training >>> latent = trainer.extract_latent() >>> print(latent.shape) torch.Size([1000, 16])
Function Documentation¶
- extract_embs_from_autoencoder(df, autoencoder_kwargs={})[source]¶
Extract embeddings from a pandas DataFrame using an autoencoder.
Convenience function that trains an autoencoder on tabular data and returns the learned latent representations.
- Parameters:
df – Input pandas DataFrame.
autoencoder_kwargs – Dictionary of keyword arguments for AutoencoderTrainer. Can include ‘device’ to specify training device (default: ‘cpu’).
- Returns:
Latent representations of shape (n_samples, latent_dim).
- Return type:
Example
>>> import pandas as pd >>> import torch >>> from torch_concepts.data.preprocessing.autoencoder import extract_embs_from_autoencoder >>> >>> # Create sample DataFrame >>> df = pd.DataFrame(torch.randn(100, 50).numpy()) >>> >>> # Extract embeddings >>> embeddings = extract_embs_from_autoencoder( ... df, ... autoencoder_kwargs={ ... 'latent_dim': 10, ... 'epochs': 50, ... 'batch_size': 32, ... 'noise': 0.1, ... 'device': 'cpu' # or 'cuda' if desired ... } ... ) >>> print(embeddings.shape) torch.Size([100, 10])