Learning Meaningful Features in Machine Learning Models
The challenge of ensuring models learn underlying features of the data distribution, rather than simply interpolating or memorizing training points, is fundamental in machine learning. This issue relates to generalization, overfitting, and the bias-variance tradeoff.
Core Problem
When optimizing for metrics like least squared error:
$$ \text{LSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$models may interpolate between training points without capturing the true underlying structure, leading to:
- Overfitting
- Spurious correlations
- Brittleness to input perturbations
- Lack of interpretability
Why It Matters
- Improved generalization and robustness
- Enhanced interpretability and trustworthiness
- Greater efficiency in data and compute usage
Key Approaches
Regularization
- L1/L2 regularization: Adds penalty $\lambda \sum_{i} |w_i|$ or $\lambda \sum_{i} w_i^2$ to loss function
- Dropout: Randomly deactivates neurons during training
- Early stopping: Halts training when validation performance plateaus
Data Augmentation
- Expands dataset with meaningful variations
- Examples: image rotations, text synonym replacement
- Helps models learn invariant features (local structure rather than relying on global structure)
Architectural Choices
- Informed specialized architectures, e.g. energy-conserving neural networks in physics
Adversarial Training
- GANs: Generator and discriminator learn together
- Adversarial examples: Exposes models to slightly perturbed inputs during training, improving robustness to small changes.
Multi-task Learning
- Encourages learning of shared, fundamental features by training on multiple tasks simultaneously
- Transfer learning: Pre-train on large datasets, fine-tune for specific tasks (skills get “transferred”)
Contrastive Learning
- Learn representations where $\text{Similarity}(f(x_i), f(x_j)) > \text{Similarity}(f(x_i), f(x_k))$ for related $x_i, x_j$ and unrelated $x_k$
- Similarity measures: Cosine similarity, Euclidean distance, etc.
- Examples: SimCLR, MoCo, CLIP
Causal Learning
- Incorporate causal structure into models
- Invariant Risk Minimization: Learn features invariant across environments
Evaluation Strategies
- Out-of-distribution testing: Evaluate on data from different distributions than the training set.
- Adversarial testing: Use techniques like SHAP values or integrated gradients to understand what features the model is using.
- Interpretability methods (e.g., SHAP values)
- Probing tasks: Design specific tasks to test if the model has learned particular concepts or features
- Few-shot learning evaluation: Test how well the model adapts to new tasks with limited data.
Theoretical Perspectives
- Vapnik-Chervonenkis (VC) theory: Provides bounds on generalization error based on model complexity.
- Information Bottleneck Theory: Suggests that optimal representations balance compression of input with preservation of task-relevant information.
- Minimum Description Length (MDL) principle: Favors models that provide compact descriptions of the data. (“Occam’s razor” principle)
Simple Example: Interpolation vs. Generalization
Consider a simple 1D regression problem:
$$ f(x) = \sin(x) + \epsilon $$where $\epsilon$ is noise. A high-degree polynomial might perfectly interpolate training points but fail to capture the underlying sinusoidal pattern:
$$ \hat{f}(x) = a_nx^n + a_{n-1}x^{n-1} + ... + a_1x + a_0 $$True generalization occurs when the model captures the underlying $\sin(x)$ structure, not just fitting the exact training points.
Technical Insight: The Bias-Variance Tradeoff
The challenge of learning meaningful features is closely related to the bias-variance tradeoff. For a model $f$ and target function $f^*$:
$E[(y - f(x))^2] = (\text{Bias}[f(x)])^2 + \text{Var}[f(x)] + \sigma^2$
Where:
- $\text{Bias}[f(x)] = E[f(x)] - f^*(x)$
- $\text{Var}[f(x)] = E[(f(x) - E[f(x)])^2]$
- $\sigma^2$ is irreducible error
Models that interpolate perfectly between training points often have low bias but high variance, leading to poor generalization. The goal is to find the sweet spot that captures true underlying features.