Introduction

This page is meant to be a non-exhaustive list of some of the most popular methods used in Data Science. The intent of this overview is to help novice users understand when to use a particular method and the strengths and weaknesses of that particular method. The methods are clustered into groups including Supervised Learning, Unsupervised Learning, Ensemble, Deep Learning and Dimensionality reduction. While the popularity of methods ebs & flows, linear regression is by far the most used method on the list as the number of practitioners extends well beyond the Data Science specialist.

Summary Table

Method	Description
Supervised Learning
Linear Regression	Predicts continuous numerical values by fitting a linear relationship between features and target.
Logistic Regression	Classifies data into discrete categories using a logistic function.
Decision Trees	Creates a tree-like model of decisions based on feature values.
K-Nearest Neighbors	Classifies or predicts based on the K closest training examples in feature space.
Support Vector Machines	Finds the optimal hyperplane that maximally separates classes.
Naive Bayes	Applies Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
Ensemble Methods
Random Forest	Builds multiple decision trees and aggregates their predictions.
Gradient Boosting	Builds trees sequentially, each correcting errors of previous trees.
AdaBoost	Combines weak learners by focusing on misclassified samples.
Deep Learning Methods
Artificial Neural Networks	Learns complex patterns through layers of interconnected neurons.
Convolutional Neural Networks	A feed forward neural network used for processing grid-like data (images).
Recurrent Neural Networks	Processes sequential data by maintaining hidden state information.
Transformers	Uses self-attention mechanisms to process sequential data in parallel.
Unsupervised Learning
K-Means Clustering	Partitions data into K distinct clusters based on feature similarity.
Hierarchical Clustering	Builds a tree of clusters (dendrogram) showing hierarchical relationships.
DBSCAN	Groups together points that are closely packed, marking outliers as noise.
PCA	Reduces dimensionality by finding principal components that explain variance.
Autoencoders	Neural networks that learn compressed representations of data.
Dimensionality Reduction
t-SNE	Non-linear dimensionality reduction primarily for visualization.
UMAP	Fast non-linear dimensionality reduction for visualization and preprocessing.
Time Series Models
ARIMA	Models the data as a combination of a trend (AR), integrated part (gives stationarity) and a moving average.

Supervised Learning Methods
Ensemble Methods
Deep Learning Methods
Unsupervised Learning Methods
Dimensionality Reduction
- t-SNE
- UMAP
Time Series Models
- ARIMA

Supervised Learning Methods

Supervised Learning: Family of algorithms that used labeled data in the training process.

Linear Regression

What it does: Predicts continuous numerical values by fitting a linear relationship between features and target.

When to use:

Predicting continuous outcomes (price, temperature, sales)
Need interpretable results
Relationship between features and target appears linear
Fast training is required

Strengths:	Weaknesses:
Simple and interpretable	Assumes linear relationships
Fast to train and predict	Sensitive to outliers
Works well with linearly separable data	Poor performance with non-linear patterns
Low computational requirements	Prone to overfitting with many features

Key considerations:

Check for multicollinearity between features
Consider polynomial features for non-linear relationships
Use regularization (Ridge/Lasso) to prevent overfitting

Logistic Regression

What it does: Classifies data into discrete categories using a logistic function.

When to use:

Binary classification problems (yes/no, spam/not spam)
Need probability estimates for predictions
Want interpretable feature importance
Baseline model for classification tasks

Strengths:	Weaknesses:
Provides probability scores	Assumes linear decision boundaries
Interpretable coefficients	Struggles with complex non-linear relationships
Fast training and prediction	Requires feature engineering for interactions
Works well with linearly separable classes	Can underfit complex datasets

Key considerations:

Use for binary or multinomial classification
Apply feature scaling for better convergence
Consider L1/L2 regularization

Decision Trees

What it does: Creates a tree-like model of decisions based on feature values.

When to use:

Need highly interpretable models
Data has non-linear relationships
Mix of categorical and numerical features
No need for feature scaling
Want to visualize decision-making process

Strengths:	Weaknesses:
Easy to understand and visualize	Prone to overfitting
Handles non-linear relationships	Unstable (small data changes affect tree structure)
No feature scaling required	Biased toward features with many levels
Captures feature interactions automatically	Poor at extrapolation
Works with mixed data types

Key considerations:

Set max_depth to prevent overfitting
Use min_samples_split and min_samples_leaf
Better in ensemble methods (Random Forest, XGBoost)

K-Nearest Neighbors (KNN)

What it does: Classifies or predicts based on the K closest training examples in feature space.

When to use:

Small to medium-sized datasets
Non-linear decision boundaries
Need simple baseline model
Pattern recognition tasks

Strengths:	Weaknesses:
Simple and intuitive	Computationally expensive at prediction time
No training phase (lazy learning)	Sensitive to feature scaling
Naturally handles multi-class problems	Curse of dimensionality (poor performance in high dimensions)
Non-parametric (makes no assumptions about data distribution)	Sensitive to irrelevant features
	Memory intensive

Key considerations:

Always scale features
Choose K carefully (odd numbers for binary classification)
Consider distance metrics (Euclidean, Manhattan, Minkowski)
Use dimensionality reduction for high-dimensional data

Support Vector Machines (SVM)

What it does: Finds the optimal hyperplane that maximally separates classes.

When to use:

Binary classification with clear margin of separation
High-dimensional spaces (text classification, bioinformatics)
More features than samples
Need robust model for outliers

Strengths:	Weaknesses:
Effective in high-dimensional spaces	Slow training on large datasets
Memory efficient (uses subset of training points)	Sensitive to feature scaling
Versatile through different kernel functions	Requires careful kernel selection
Works well with clear margin of separation	Doesn't provide probability estimates directly
	Poor performance on very large datasets

Key considerations:

Always scale features
Use linear kernel for linearly separable data
RBF kernel for non-linear problems
Tune C (regularization) and gamma parameters

Naive Bayes

What it does: Applies Bayes' theorem with strong independence assumptions between features.

When to use:

Text classification (spam detection, sentiment analysis)
Need very fast training and prediction
High-dimensional data
Baseline model for classification
Real-time prediction requirements

Strengths:	Weaknesses:
Extremely fast training and prediction	Assumes feature independence (rarely true)
Works well with high-dimensional data	Poor probability estimates
Requires small training dataset	Sensitive to how features are presented
Handles missing data well	Cannot learn feature interactions
Provides probability estimates

Key considerations:

Gaussian Naive Bayes for continuous data
Multinomial for count data (text)
Bernoulli for binary features

Ensemble Methods

Ensemble Methods: Family of algorithms that combine multiple weak models into a single, more accurate model.

Random Forest

What it does: Builds multiple decision trees and aggregates their predictions.

When to use:

Need robust, accurate predictions
Want feature importance rankings
Data has complex non-linear relationships
Reduce overfitting compared to single decision tree
Default choice for tabular data

Strengths:	Weaknesses:
High accuracy and robustness	Less interpretable than single trees
Reduces overfitting	Computationally expensive
Provides feature importance	Large memory footprint
Handles missing values	Can overfit on noisy datasets
Works with mixed data types	Slower prediction than simpler models
Minimal hyperparameter tuning needed

Key considerations:

Increase n_estimators (number of trees) for better performance
Tune max_depth and min_samples_split
Use for both classification and regression
Great for feature selection

Gradient Boosting

Variants: XGBoost, LigthGBM, CatBoos

What it does: Builds trees sequentially, each correcting errors of previous trees.

When to use:

Need state-of-the-art performance on tabular data
Kaggle competitions or production systems
Have time for hyperparameter tuning
Structured/tabular data with complex patterns

Strengths:	Weaknesses:
Often best performance on structured data	Prone to overfitting if not tuned properly
Handles missing values	Requires careful hyperparameter tuning
Provides feature importance	Longer training time
Built-in regularization	Less interpretable
Supports custom loss functions	Can be overkill for simple problems

Key considerations:

XGBoost: General purpose, widely used
LightGBM: Faster training, handles large datasets
CatBoost: Best for categorical features
Tune learning_rate, max_depth, n_estimators

AdaBoost

What it does: Combines weak learners by focusing on misclassified samples.

When to use:

Binary classification problems
Want simpler ensemble than gradient boosting
Have weak base learners
Less prone to overfitting than other boosting methods

Strengths:	Weaknesses:
Simple to implement	Sensitive to noisy data and outliers
Less prone to overfitting than other boosting	Slower than Random Forest
Works with various base learners	Can overfit on noisy datasets
No need for extensive hyperparameter tuning	Generally outperformed by gradient boosting

Key considerations:

Works best with stumps (trees with depth 1)
Tune number of estimators
Consider for simpler problems

Deep Learning Methods

Deep Learning: Family of algorithms that uses multi-layer neural networks to identify relationships between data.

Artificial Neural Networks (ANN)

What it does: Learns complex patterns through layers of interconnected neurons.

When to use:

Large datasets available
Complex non-linear relationships
Need to learn hierarchical features
Have computational resources (GPU)

Strengths:	Weaknesses:
Learns complex patterns automatically	Requires large amounts of data
Scales well with data	Computationally expensive
Flexible architecture	Black box (hard to interpret)
Can approximate any function	Requires careful tuning
	Prone to overfitting

Key considerations:

Use dropout for regularization
Batch normalization for faster training
Try different activation functions
Monitor for overfitting

Convolutional Neural Networks (CNN)

What it does: Specialized neural networks for processing grid-like data (images).

When to use:

Image classification and recognition
Object detection
Image segmentation
Any spatial data with local patterns
Video analysis

Strengths:	Weaknesses:
Excellent for image data	Requires large labeled datasets
Automatic feature learning	Computationally intensive
Translation invariant	Long training times
Parameter sharing reduces model size	Requires GPU for practical use
State-of-the-art for computer vision	Black box interpretability

Key considerations:

Use transfer learning (ResNet, VGG, EfficientNet)
Data augmentation to increase dataset size
Pre-trained models for faster development

Recurrent Neural Networks (RNN)

What it does: Processes sequential data by maintaining hidden state information.

When to use:

Time series prediction
Natural language processing
Sequential decision-making
Speech recognition
Any data with temporal dependencies

Strengths:	Weaknesses:
Handles variable-length sequences	Vanishing/exploding gradient problems (RNN)
Captures temporal dependencies	Slow training (sequential processing)
Shares parameters across time steps	Difficult to parallelize
Can use context from past inputs	May struggle with very long sequences

Key considerations:

Use LSTM or GRU instead of vanilla RNN
Consider Transformers for longer sequences
Bidirectional RNNs for full context
Gradient clipping to prevent exploding gradients

Transformers

What it does: Uses self-attention mechanisms to process sequential data in parallel.

When to use:

Natural language processing (BERT, GPT)
Long-range dependencies in sequences
Need parallel processing of sequences
Machine translation
Text generation

Strengths:	Weaknesses:
Captures long-range dependencies	Extremely computationally expensive
Parallelizable training	Requires massive datasets
State-of-the-art NLP performance	Large memory requirements
Transfer learning with pre-trained models	Quadratic complexity with sequence length
Attention provides some interpretability	Overkill for simple tasks

Key considerations:

Use pre-trained models (BERT, GPT, T5)
Fine-tune for specific tasks
Consider smaller variants (DistilBERT)

Unsupervised Learning Methods

K-Means Clustering

What it does: Partitions data into K distinct clusters based on feature similarity.

When to use:

Customer segmentation
Document clustering
Image compression
Anomaly detection preprocessing
Data exploration

Strengths:	Weaknesses:
Simple and fast	Must specify K in advance
Scales well to large datasets	Sensitive to initial centroid placement
Easy to implement and interpret	Assumes spherical clusters
Works well with spherical clusters	Sensitive to outliers
	Poor with clusters of different sizes/densities

Key considerations:

Use elbow method or silhouette score to choose K
Run multiple times with different initializations
Scale features before clustering
Consider K-Means++ for better initialization

Hierarchical Clustering

What it does: Builds a tree of clusters (dendrogram) showing hierarchical relationships.

When to use:

Don't know number of clusters beforehand
Need to visualize cluster hierarchy
Small to medium datasets
Want to see relationships at different granularities

Strengths:	Weaknesses:
No need to specify number of clusters	Computationally expensive (O(n³) or O(n²) time)
Provides dendrogram visualization	Not suitable for large datasets
Deterministic (same result each run)	Sensitive to noise and outliers
Can use various distance metrics and linkage methods	Cannot undo previous steps

Key considerations:

Choose linkage method (single, complete, average, Ward)
Use dendrograms to determine cluster count
Scale features appropriately

Density-Based Spatial Clustering (DBSCAN)

What it does: Groups together points that are closely packed, marking outliers as noise.

When to use:

Clusters of arbitrary shape
Need outlier detection
Don't know number of clusters
Spatial data analysis
Presence of noise in data

Strengths:	Weaknesses:
No need to specify number of clusters	Struggles with varying density clusters
Finds arbitrarily shaped clusters	Sensitive to epsilon and min_samples parameters
Robust to outliers	Not deterministic with border points
Identifies noise points	Doesn't work well in high dimensions
Only two parameters to tune

Key considerations:

Tune epsilon (neighborhood radius) carefully
Set min_samples based on dataset size
Scale features before clustering

Principal Component Analysis (PCA)

What it does: Reduces dimensionality by finding principal components that explain variance.

When to use:

Reduce number of features
Visualize high-dimensional data (reduce to 2D/3D)
Remove multicollinearity
Speed up training
Noise reduction

Strengths:	Weaknesses:
Reduces dimensionality while preserving variance	Linear transformation only
Removes correlated features	Results are harder to interpret
Speeds up training	Sensitive to feature scaling
Helps visualization	May lose important information
Unsupervised (no labels needed)	Assumes linear relationships

Key considerations:

Always scale features first
Choose number of components based on explained variance
Use for preprocessing before other algorithms
Consider alternatives (t-SNE, UMAP) for visualization

Autoencoders

What it does: Neural networks that learn compressed representations of data.

When to use:

Non-linear dimensionality reduction
Anomaly detection
Image denoising
Feature learning
Data compression

Strengths:	Weaknesses:
Learns non-linear representations	Requires large datasets
Flexible architecture	Computationally expensive
Can learn complex patterns	Needs careful architecture design
Works with various data types	Can be overkill for simple problems
	Harder to interpret than PCA

Key considerations:

Use for complex, high-dimensional data
Variational autoencoders for generation
Denoising autoencoders for robustness

Dimensionality Reduction

t-Distributed Stochastic Neighbor Embedding (t-SNE)

What it does: Non-linear dimensionality reduction primarily for visualization.

When to use:

Visualizing high-dimensional data in 2D/3D
Exploring cluster structure
Presentation and communication of patterns
Image/text embeddings visualization

Strengths:	Weaknesses:
Excellent for visualization	Slow on large datasets
Preserves local structure well	Non-deterministic (different runs give different results)
Reveals clusters clearly	Mainly for visualization, not for downstream tasks
Works with various distance metrics	Doesn't preserve global structure
	Perplexity parameter is sensitive

Key considerations:

Use PCA first to reduce to ~50 dimensions
Experiment with perplexity (5-50)
Not suitable for new data projection
Use for exploration, not modeling

Uniform Manifold Approximation and Projection (UMAP)

What it does: Fast non-linear dimensionality reduction for visualization and preprocessing.

When to use:

Fast visualization of high-dimensional data
Preprocessing for downstream tasks
Large datasets (faster than t-SNE)
Preserve both local and global structure

Strengths:	Weaknesses:
Faster than t-SNE	More hyperparameters than t-SNE
Preserves global structure better	Less well-known/tested than PCA or t-SNE
Can be used for general dimensionality reduction	Still somewhat non-deterministic
Works well with large datasets	Requires careful parameter tuning
Can project new data

Key considerations:

Adjust n_neighbors for local vs global structure
Use for both visualization and preprocessing
Generally preferred over t-SNE for speed

Quick Reference Decision Tree

What type of problem do you have?

Predicting continuous values (regression)

Simple linear relationship → Linear Regression
Non-linear patterns, medium data → Decision Tree, Random Forest
Complex patterns, large data → Gradient Boosting (XGBoost), Neural Networks
Time series → LSTM, GRU, or ARIMA

Predicting Categories (classification)

Binary Classification:

Need interpretability, simple → Logistic Regression
High accuracy, tabular data → Random Forest, XGBoost
Image data → CNN
Text data → Naive Bayes (simple), Transformers (advanced)
Small dataset, simple → SVM, KNN

Multi-class Classification:

Text classification → Naive Bayes, Transformers
Image classification → CNN
Tabular data → Random Forest, XGBoost
Need probability estimates → Logistic Regression (multinomial)

Group Similar Items (clustering)

Know number of clusters, fast → K-Means
Don't know cluster count, hierarchical view → Hierarchical Clustering
Arbitrary shapes, outliers present → DBSCAN
Need feature learning → Autoencoders + clustering

Dimension Reduction

Linear reduction, preserve variance → PCA
Visualization only → t-SNE, UMAP
Non-linear, for downstream tasks → Autoencoders, UMAP
Feature selection → Random Forest feature importance, Lasso

Time Series Data

Simple patterns → ARIMA, Linear Regression
Complex patterns, dependencies → LSTM, GRU
Very long sequences, NLP → Transformers

Data Size Matters

Feature Types	Methods
Small data (<1,000 samples)	Simpler models (Logistic Regression, Naive Bayes, KNN)
Medium data (1,000-100,000)	Random Forest, SVM, Gradient Boosting
Large data (>100,000)	Gradient Boosting, Neural Networks

Feature Types

Feature Types	Methods
Mostly categorical	Naive Bayes, CatBoost, Decision Trees
Mostly numerical	Linear models, SVM, Neural Networks
Mixed types	Tree-based methods (Random Forest, XGBoost)

Available Computational Resources

Interpretability	Methods
Limited resources	Linear models, Naive Bayes, Decision Trees
Moderate resources	Random Forest, SVM, standard Neural Networks
High resources (GPU)	Deep Learning (CNN, RNN, Transformers)

Interpretability Requirements

Interpretability	Methods
High interpretability needed	Linear Regression, Logistic Regression, Decision Trees
Moderate interpretability	Random Forest (feature importance), Naive Bayes
Low interpretability acceptable	Neural Networks, SVM with RBF kernel, XGBoost

Speed Requirements

Speed	Methods
Fast Training	Naive Bayes, Linear models, Decision Trees
Fast Prediction	Linear models, Naive Bayes, KNN (if indexed)
Can Be Slow	Deep Learning, SVM on large data, Ensemble methods

Common Pitfalls and Tips

Always

Split data into train/validation/test sets
Scale/normalize features (especially for distance-based methods)
Handle missing values appropriately
Check for data leakage
Use cross-validation for model evaluation
Monitor for overfitting

Feature Engineering Matters

Often more important than algorithm choice
Domain knowledge is crucial
Create interaction features if relationships exist
Remove highly correlated features

Hyperparameter Tuning

Use Grid Search or Random Search
Use validation set, not test set
Don't overfit to validation set
Consider Bayesian Optimization for expensive models

When to Use Deep Learning

Large datasets (typically >10,000 samples)
Complex patterns (images, text, sequences)
Have computational resources (GPUs)
Simpler methods have failed
Can afford longer training times

Ensemble Methods

Often provide best results
Combine different model types
Use bagging for variance reduction
Use boosting for bias reduction

Use this checklist to narrow down your algorithm choices, then experiment with 2-3 candidates to find the best fit.

Final Tip: There's no universally best algorithm. The right choice depends on your specific problem, data characteristics, and constraints. Start simple, iterate, and let your validation results guide you!

Introduction

Summary Table

Table of Contents

Supervised Learning Methods

Supervised Learning: Family of algorithms that used labeled data in the training process.

Linear Regression

Logistic Regression

Decision Trees

K-Nearest Neighbors (KNN)

Support Vector Machines (SVM)

Naive Bayes

Ensemble Methods

Ensemble Methods: Family of algorithms that combine multiple weak models into a single, more accurate model.

Random Forest

Gradient Boosting

AdaBoost

Deep Learning Methods

Deep Learning: Family of algorithms that uses multi-layer neural networks to identify relationships between data.

Artificial Neural Networks (ANN)

Convolutional Neural Networks (CNN)

Recurrent Neural Networks (RNN)

Transformers

Unsupervised Learning Methods

K-Means Clustering

Hierarchical Clustering

Density-Based Spatial Clustering (DBSCAN)

Principal Component Analysis (PCA)

Autoencoders

Dimensionality Reduction

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Uniform Manifold Approximation and Projection (UMAP)

Quick Reference Decision Tree

What type of problem do you have?

Predicting continuous values (regression)

Predicting Categories (classification)

Group Similar Items (clustering)

Dimension Reduction

Time Series Data

Data Size Matters

Feature Types

Available Computational Resources

Interpretability Requirements

Speed Requirements

Common Pitfalls and Tips

Always

Feature Engineering Matters

Hyperparameter Tuning

When to Use Deep Learning

Ensemble Methods