Supervised vs Unsupervised Learning
- How supervised learning uses labelled data to train predictive models
- How unsupervised learning discovers hidden structure in data
- Where self-supervised learning fits and why it powers LLMs
- Practical decision framework for choosing the right paradigm
Every machine learning model learns from data. The difference between supervised and unsupervised learning comes down to one question: does the training data include the right answers?
This distinction isn't academic. It determines what problems you can solve, how much data preparation you need, and what kind of results you'll get. This guide covers both paradigms, plus the self-supervised approach that powers modern LLMs.
Supervised Learning: Learning From Examples
In supervised learning, the training data includes both inputs and their correct outputs (labels). The model learns to map inputs to outputs by finding patterns that generalise to new, unseen data.
Training Data (labelled):
┌──────────────────────────────┬──────────────┐
│ Input (features) │ Label (truth) │
├──────────────────────────────┼──────────────┤
│ Deal size: $50k, Stage: Demo │ Won │
│ Deal size: $10k, Stage: Qual │ Lost │
│ Deal size: $80k, Stage: Neg │ Won │
│ Deal size: $5k, Stage: Cold │ Lost │
└──────────────────────────────┴──────────────┘
↓
Training Algorithm
↓
Model: f(features) → Won/Lost
↓
New input: Deal size: $60k, Stage: Neg → ?
Model predicts: Won (85% confidence)
Two Tasks: Classification and Regression
Classification -- Predict a category. Spam vs not-spam. Lead priority (high/medium/low). Email intent (sales/support/billing).
Regression -- Predict a continuous value. Expected deal value. Time to close. Customer lifetime value.
Common Supervised Algorithms
| Algorithm | Task Type | Strengths |
|---|---|---|
| Logistic Regression | Classification | Fast, interpretable, good baseline |
| Random Forest | Both | Handles mixed feature types, robust |
| XGBoost / LightGBM | Both | State-of-the-art for tabular data |
| Support Vector Machines | Classification | Effective in high-dimensional spaces |
| Neural Networks | Both | Handles unstructured data (text, images) |
The Label Problem
Supervised learning's biggest constraint is that it requires labelled data. Someone (or something) has to provide the correct answer for every training example.
Labelling strategies:
- Manual labelling -- Humans annotate data. High quality but expensive. Typical cost: $0.05-$5.00 per label depending on task complexity.
- Programmatic labelling -- Write heuristic rules to generate noisy labels. Lower quality but massively scalable. Frameworks like Snorkel formalise this approach.
- Active learning -- The model identifies which unlabelled examples would be most informative. A human labels only those examples, maximising learning per label.
- Transfer learning -- Start with a pre-trained model and fine-tune on a small labelled dataset. This is how most production NLP works today.
The quality of your labels is the ceiling on your model's performance. A sophisticated algorithm trained on noisy labels will always underperform a simple algorithm trained on clean labels. Invest in labelling quality before investing in model complexity.
Unsupervised Learning: Finding Hidden Structure
Unsupervised learning works with unlabelled data. There are no "right answers" -- the model discovers patterns, groupings, and structure on its own.
Unlabelled Data:
┌──────────────────────────────────┐
│ Customer A: 50 logins, 3 deals │
│ Customer B: 2 logins, 0 deals │
│ Customer C: 48 logins, 5 deals │
│ Customer D: 1 login, 0 deals │
│ Customer E: 55 logins, 4 deals │
└──────────────────────────────────┘
↓
Clustering Algorithm
↓
Cluster 1: {A, C, E} ← "Power users"
Cluster 2: {B, D} ← "At-risk / inactive"
Key Unsupervised Tasks
Clustering -- Group similar data points together. Customer segmentation, document grouping, anomaly detection (points that don't fit any cluster).
Dimensionality reduction -- Compress high-dimensional data into fewer dimensions while preserving structure. PCA (Principal Component Analysis), t-SNE, and UMAP are common techniques. Useful for visualisation and as a preprocessing step.
Anomaly detection -- Identify data points that deviate significantly from the norm. Fraud detection, system monitoring, quality control.
Association rules -- Discover relationships between variables. "Customers who buy X also tend to buy Y." Market basket analysis.
Common Unsupervised Algorithms
| Algorithm | Task | Key Property |
|---|---|---|
| K-Means | Clustering | Simple, fast, requires specifying K |
| DBSCAN | Clustering | Finds arbitrarily shaped clusters, handles noise |
| Hierarchical Clustering | Clustering | Produces a dendrogram of nested clusters |
| PCA | Dimensionality reduction | Linear, preserves maximum variance |
| t-SNE / UMAP | Dimensionality reduction | Non-linear, preserves local structure |
| Isolation Forest | Anomaly detection | Efficient, handles high-dimensional data |
| Autoencoders | Dimensionality reduction / anomaly detection | Neural network-based, learns compressed representations |
Embedding models (like those used in RAG pipelines) are a form of unsupervised dimensionality reduction. They compress text into dense vectors where semantic similarity corresponds to geometric proximity. When you search for similar documents using cosine similarity on embeddings, you're applying unsupervised learning principles to a retrieval task.
Self-Supervised Learning: The Third Paradigm
Self-supervised learning bridges the gap. It trains on unlabelled data but creates its own supervision signal from the data's structure. This is how LLMs are trained.
Masked language modelling (BERT-style): Hide a word in a sentence, train the model to predict it.
Input: "The customer [MASK] the support team about a billing issue."
Target: "contacted"
Next-token prediction (GPT-style): Given a sequence, predict the next token.
Input: "The quarterly revenue exceeded expectations by"
Target: "fifteen"
The label is derived from the data itself -- no human annotation needed. This is why LLMs can be trained on trillions of tokens: every text document on the internet is simultaneously input and training signal.
Why Self-Supervised Learning Changed Everything
Before self-supervised learning, NLP relied on supervised approaches that required expensive human labelling. You might get 10,000 labelled examples for sentiment analysis -- enough for a narrow model, but not a general one.
Self-supervised pre-training on trillions of tokens produces a foundation model that understands language broadly. You then fine-tune (supervised) on a small labelled dataset for your specific task. The pre-training does the heavy lifting.
The Decision Framework
| Dimension | Supervised | Unsupervised | Self-Supervised |
|---|---|---|---|
| Data requirement | Labelled data | Unlabelled data | Unlabelled data |
| Output type | Predictions (classification, regression) | Structure (clusters, patterns) | Representations (embeddings, pre-trained weights) |
| Evaluation | Clear metrics (accuracy, F1, RMSE) | Subjective (do clusters make sense?) | Downstream task performance |
| Best for | Known tasks with labelled examples | Exploration, segmentation, anomaly detection | Pre-training foundation models |
| Scale | Limited by labelling budget | Scales with data availability | Scales massively |
In practice, modern ML systems combine all three:
- Self-supervised pre-training produces a foundation model
- Supervised fine-tuning adapts it to your specific task
- Unsupervised clustering groups your data for analysis and monitoring
Outrun's AI Email Intelligence combines these paradigms in practice. The underlying LLM was pre-trained (self-supervised), then aligned for instruction-following (supervised). At runtime, it classifies your emails (supervised task), while Outrun's analytics surface patterns across your pipeline (unsupervised insights). See how it works on the Email Intelligence feature page.
Reinforcement Learning: The Honourable Mention
Reinforcement learning (RL) doesn't fit neatly into supervised or unsupervised. An agent takes actions in an environment and learns from rewards and penalties. No labelled data, no discovering structure -- just trial and error optimised over time.
RL is used in LLM alignment (RLHF), game-playing agents, robotics, and recommendation systems. It's less common in typical business ML applications, but increasingly important as AI systems become more agentic.
What's Next
Understanding training paradigms tells you how models learn. But once a model is trained, how do you know if it's actually good?
In Evaluating LLMs, you'll learn the metrics, benchmarks, and practical evaluation strategies for assessing LLM quality -- the critical skill for choosing between models and verifying that your AI features actually work.