General Statistics Blog2025-01-22

Data Science Methods Overview

A brief summary of the most common Data Science methods.

Introduction

This page is meant to be a non-exhaustive list of some of the most popular methods used in Data Science. The intent of this overview is to help novice users understand when to use a particular method and the strengths and weaknesses of that particular method. The methods are clustered into groups including Supervised Learning, Unsupervised Learning, Ensemble, Deep Learning and Dimensionality reduction. While the popularity of methods ebs & flows, linear regression is by far the most used method on the list as the number of practitioners extends well beyond the Data Science specialist.

Summary Table

MethodDescription
Supervised Learning
Linear RegressionPredicts continuous numerical values by fitting a linear relationship between features and target.
Logistic RegressionClassifies data into discrete categories using a logistic function.
Decision TreesCreates a tree-like model of decisions based on feature values.
K-Nearest NeighborsClassifies or predicts based on the K closest training examples in feature space.
Support Vector MachinesFinds the optimal hyperplane that maximally separates classes.
Naive BayesApplies Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
Ensemble Methods
Random ForestBuilds multiple decision trees and aggregates their predictions.
Gradient BoostingBuilds trees sequentially, each correcting errors of previous trees.
AdaBoostCombines weak learners by focusing on misclassified samples.
Deep Learning Methods
Artificial Neural NetworksLearns complex patterns through layers of interconnected neurons.
Convolutional Neural NetworksA feed forward neural network used for processing grid-like data (images).
Recurrent Neural NetworksProcesses sequential data by maintaining hidden state information.
TransformersUses self-attention mechanisms to process sequential data in parallel.
Unsupervised Learning
K-Means ClusteringPartitions data into K distinct clusters based on feature similarity.
Hierarchical ClusteringBuilds a tree of clusters (dendrogram) showing hierarchical relationships.
DBSCANGroups together points that are closely packed, marking outliers as noise.
PCAReduces dimensionality by finding principal components that explain variance.
AutoencodersNeural networks that learn compressed representations of data.
Dimensionality Reduction
t-SNENon-linear dimensionality reduction primarily for visualization.
UMAPFast non-linear dimensionality reduction for visualization and preprocessing.
Time Series Models
ARIMAModels the data as a combination of a trend (AR), integrated part (gives stationarity) and a moving average.

Table of Contents


Supervised Learning Methods

Supervised Learning: Family of algorithms that used labeled data in the training process.

Linear Regression

What it does: Predicts continuous numerical values by fitting a linear relationship between features and target.

When to use:

  • Predicting continuous outcomes (price, temperature, sales)
  • Need interpretable results
  • Relationship between features and target appears linear
  • Fast training is required
Strengths:Weaknesses:
Simple and interpretableAssumes linear relationships
Fast to train and predictSensitive to outliers
Works well with linearly separable dataPoor performance with non-linear patterns
Low computational requirementsProne to overfitting with many features

Key considerations:

  • Check for multicollinearity between features
  • Consider polynomial features for non-linear relationships
  • Use regularization (Ridge/Lasso) to prevent overfitting

Logistic Regression

What it does: Classifies data into discrete categories using a logistic function.

When to use:

  • Binary classification problems (yes/no, spam/not spam)
  • Need probability estimates for predictions
  • Want interpretable feature importance
  • Baseline model for classification tasks
Strengths:Weaknesses:
Provides probability scoresAssumes linear decision boundaries
Interpretable coefficientsStruggles with complex non-linear relationships
Fast training and predictionRequires feature engineering for interactions
Works well with linearly separable classesCan underfit complex datasets

Key considerations:

  • Use for binary or multinomial classification
  • Apply feature scaling for better convergence
  • Consider L1/L2 regularization

Decision Trees

What it does: Creates a tree-like model of decisions based on feature values.

When to use:

  • Need highly interpretable models
  • Data has non-linear relationships
  • Mix of categorical and numerical features
  • No need for feature scaling
  • Want to visualize decision-making process
Strengths:Weaknesses:
Easy to understand and visualizeProne to overfitting
Handles non-linear relationshipsUnstable (small data changes affect tree structure)
No feature scaling requiredBiased toward features with many levels
Captures feature interactions automaticallyPoor at extrapolation
Works with mixed data types

Key considerations:

  • Set max_depth to prevent overfitting
  • Use min_samples_split and min_samples_leaf
  • Better in ensemble methods (Random Forest, XGBoost)

K-Nearest Neighbors (KNN)

What it does: Classifies or predicts based on the K closest training examples in feature space.

When to use:

  • Small to medium-sized datasets
  • Non-linear decision boundaries
  • Need simple baseline model
  • Pattern recognition tasks
Strengths:Weaknesses:
Simple and intuitiveComputationally expensive at prediction time
No training phase (lazy learning)Sensitive to feature scaling
Naturally handles multi-class problemsCurse of dimensionality (poor performance in high dimensions)
Non-parametric (makes no assumptions about data distribution)Sensitive to irrelevant features
Memory intensive

Key considerations:

  • Always scale features
  • Choose K carefully (odd numbers for binary classification)
  • Consider distance metrics (Euclidean, Manhattan, Minkowski)
  • Use dimensionality reduction for high-dimensional data

Support Vector Machines (SVM)

What it does: Finds the optimal hyperplane that maximally separates classes.

When to use:

  • Binary classification with clear margin of separation
  • High-dimensional spaces (text classification, bioinformatics)
  • More features than samples
  • Need robust model for outliers
Strengths:Weaknesses:
Effective in high-dimensional spacesSlow training on large datasets
Memory efficient (uses subset of training points)Sensitive to feature scaling
Versatile through different kernel functionsRequires careful kernel selection
Works well with clear margin of separationDoesn't provide probability estimates directly
Poor performance on very large datasets

Key considerations:

  • Always scale features
  • Use linear kernel for linearly separable data
  • RBF kernel for non-linear problems
  • Tune C (regularization) and gamma parameters

Naive Bayes

What it does: Applies Bayes' theorem with strong independence assumptions between features.

When to use:

  • Text classification (spam detection, sentiment analysis)
  • Need very fast training and prediction
  • High-dimensional data
  • Baseline model for classification
  • Real-time prediction requirements
Strengths:Weaknesses:
Extremely fast training and predictionAssumes feature independence (rarely true)
Works well with high-dimensional dataPoor probability estimates
Requires small training datasetSensitive to how features are presented
Handles missing data wellCannot learn feature interactions
Provides probability estimates

Key considerations:

  • Gaussian Naive Bayes for continuous data
  • Multinomial for count data (text)
  • Bernoulli for binary features

Ensemble Methods

Ensemble Methods: Family of algorithms that combine multiple weak models into a single, more accurate model.

Random Forest

What it does: Builds multiple decision trees and aggregates their predictions.

When to use:

  • Need robust, accurate predictions
  • Want feature importance rankings
  • Data has complex non-linear relationships
  • Reduce overfitting compared to single decision tree
  • Default choice for tabular data
Strengths:Weaknesses:
High accuracy and robustnessLess interpretable than single trees
Reduces overfittingComputationally expensive
Provides feature importanceLarge memory footprint
Handles missing valuesCan overfit on noisy datasets
Works with mixed data typesSlower prediction than simpler models
Minimal hyperparameter tuning needed

Key considerations:

  • Increase n_estimators (number of trees) for better performance
  • Tune max_depth and min_samples_split
  • Use for both classification and regression
  • Great for feature selection

Gradient Boosting

Variants: XGBoost, LigthGBM, CatBoos

What it does: Builds trees sequentially, each correcting errors of previous trees.

When to use:

  • Need state-of-the-art performance on tabular data
  • Kaggle competitions or production systems
  • Have time for hyperparameter tuning
  • Structured/tabular data with complex patterns
Strengths:Weaknesses:
Often best performance on structured dataProne to overfitting if not tuned properly
Handles missing valuesRequires careful hyperparameter tuning
Provides feature importanceLonger training time
Built-in regularizationLess interpretable
Supports custom loss functionsCan be overkill for simple problems

Key considerations:

  • XGBoost: General purpose, widely used
  • LightGBM: Faster training, handles large datasets
  • CatBoost: Best for categorical features
  • Tune learning_rate, max_depth, n_estimators

AdaBoost

What it does: Combines weak learners by focusing on misclassified samples.

When to use:

  • Binary classification problems
  • Want simpler ensemble than gradient boosting
  • Have weak base learners
  • Less prone to overfitting than other boosting methods
Strengths:Weaknesses:
Simple to implementSensitive to noisy data and outliers
Less prone to overfitting than other boostingSlower than Random Forest
Works with various base learnersCan overfit on noisy datasets
No need for extensive hyperparameter tuningGenerally outperformed by gradient boosting

Key considerations:

  • Works best with stumps (trees with depth 1)
  • Tune number of estimators
  • Consider for simpler problems

Deep Learning Methods

Deep Learning: Family of algorithms that uses multi-layer neural networks to identify relationships between data.

Artificial Neural Networks (ANN)

What it does: Learns complex patterns through layers of interconnected neurons.

When to use:

  • Large datasets available
  • Complex non-linear relationships
  • Need to learn hierarchical features
  • Have computational resources (GPU)
Strengths:Weaknesses:
Learns complex patterns automaticallyRequires large amounts of data
Scales well with dataComputationally expensive
Flexible architectureBlack box (hard to interpret)
Can approximate any functionRequires careful tuning
Prone to overfitting

Key considerations:

  • Use dropout for regularization
  • Batch normalization for faster training
  • Try different activation functions
  • Monitor for overfitting

Convolutional Neural Networks (CNN)

What it does: Specialized neural networks for processing grid-like data (images).

When to use:

  • Image classification and recognition
  • Object detection
  • Image segmentation
  • Any spatial data with local patterns
  • Video analysis
Strengths:Weaknesses:
Excellent for image dataRequires large labeled datasets
Automatic feature learningComputationally intensive
Translation invariantLong training times
Parameter sharing reduces model sizeRequires GPU for practical use
State-of-the-art for computer visionBlack box interpretability

Key considerations:

  • Use transfer learning (ResNet, VGG, EfficientNet)
  • Data augmentation to increase dataset size
  • Pre-trained models for faster development

Recurrent Neural Networks (RNN)

What it does: Processes sequential data by maintaining hidden state information.

When to use:

  • Time series prediction
  • Natural language processing
  • Sequential decision-making
  • Speech recognition
  • Any data with temporal dependencies
Strengths:Weaknesses:
Handles variable-length sequencesVanishing/exploding gradient problems (RNN)
Captures temporal dependenciesSlow training (sequential processing)
Shares parameters across time stepsDifficult to parallelize
Can use context from past inputsMay struggle with very long sequences

Key considerations:

  • Use LSTM or GRU instead of vanilla RNN
  • Consider Transformers for longer sequences
  • Bidirectional RNNs for full context
  • Gradient clipping to prevent exploding gradients

Transformers

What it does: Uses self-attention mechanisms to process sequential data in parallel.

When to use:

  • Natural language processing (BERT, GPT)
  • Long-range dependencies in sequences
  • Need parallel processing of sequences
  • Machine translation
  • Text generation
Strengths:Weaknesses:
Captures long-range dependenciesExtremely computationally expensive
Parallelizable trainingRequires massive datasets
State-of-the-art NLP performanceLarge memory requirements
Transfer learning with pre-trained modelsQuadratic complexity with sequence length
Attention provides some interpretabilityOverkill for simple tasks

Key considerations:

  • Use pre-trained models (BERT, GPT, T5)
  • Fine-tune for specific tasks
  • Consider smaller variants (DistilBERT)

Unsupervised Learning Methods

K-Means Clustering

What it does: Partitions data into K distinct clusters based on feature similarity.

When to use:

  • Customer segmentation
  • Document clustering
  • Image compression
  • Anomaly detection preprocessing
  • Data exploration
Strengths:Weaknesses:
Simple and fastMust specify K in advance
Scales well to large datasetsSensitive to initial centroid placement
Easy to implement and interpretAssumes spherical clusters
Works well with spherical clustersSensitive to outliers
Poor with clusters of different sizes/densities

Key considerations:

  • Use elbow method or silhouette score to choose K
  • Run multiple times with different initializations
  • Scale features before clustering
  • Consider K-Means++ for better initialization

Hierarchical Clustering

What it does: Builds a tree of clusters (dendrogram) showing hierarchical relationships.

When to use:

  • Don't know number of clusters beforehand
  • Need to visualize cluster hierarchy
  • Small to medium datasets
  • Want to see relationships at different granularities
Strengths:Weaknesses:
No need to specify number of clustersComputationally expensive (O(n³) or O(n²) time)
Provides dendrogram visualizationNot suitable for large datasets
Deterministic (same result each run)Sensitive to noise and outliers
Can use various distance metrics and linkage methodsCannot undo previous steps

Key considerations:

  • Choose linkage method (single, complete, average, Ward)
  • Use dendrograms to determine cluster count
  • Scale features appropriately

Density-Based Spatial Clustering (DBSCAN)

What it does: Groups together points that are closely packed, marking outliers as noise.

When to use:

  • Clusters of arbitrary shape
  • Need outlier detection
  • Don't know number of clusters
  • Spatial data analysis
  • Presence of noise in data
Strengths:Weaknesses:
No need to specify number of clustersStruggles with varying density clusters
Finds arbitrarily shaped clustersSensitive to epsilon and min_samples parameters
Robust to outliersNot deterministic with border points
Identifies noise pointsDoesn't work well in high dimensions
Only two parameters to tune

Key considerations:

  • Tune epsilon (neighborhood radius) carefully
  • Set min_samples based on dataset size
  • Scale features before clustering

Principal Component Analysis (PCA)

What it does: Reduces dimensionality by finding principal components that explain variance.

When to use:

  • Reduce number of features
  • Visualize high-dimensional data (reduce to 2D/3D)
  • Remove multicollinearity
  • Speed up training
  • Noise reduction
Strengths:Weaknesses:
Reduces dimensionality while preserving varianceLinear transformation only
Removes correlated featuresResults are harder to interpret
Speeds up trainingSensitive to feature scaling
Helps visualizationMay lose important information
Unsupervised (no labels needed)Assumes linear relationships

Key considerations:

  • Always scale features first
  • Choose number of components based on explained variance
  • Use for preprocessing before other algorithms
  • Consider alternatives (t-SNE, UMAP) for visualization

Autoencoders

What it does: Neural networks that learn compressed representations of data.

When to use:

  • Non-linear dimensionality reduction
  • Anomaly detection
  • Image denoising
  • Feature learning
  • Data compression
Strengths:Weaknesses:
Learns non-linear representationsRequires large datasets
Flexible architectureComputationally expensive
Can learn complex patternsNeeds careful architecture design
Works with various data typesCan be overkill for simple problems
Harder to interpret than PCA

Key considerations:

  • Use for complex, high-dimensional data
  • Variational autoencoders for generation
  • Denoising autoencoders for robustness

Dimensionality Reduction

t-Distributed Stochastic Neighbor Embedding (t-SNE)

What it does: Non-linear dimensionality reduction primarily for visualization.

When to use:

  • Visualizing high-dimensional data in 2D/3D
  • Exploring cluster structure
  • Presentation and communication of patterns
  • Image/text embeddings visualization
Strengths:Weaknesses:
Excellent for visualizationSlow on large datasets
Preserves local structure wellNon-deterministic (different runs give different results)
Reveals clusters clearlyMainly for visualization, not for downstream tasks
Works with various distance metricsDoesn't preserve global structure
Perplexity parameter is sensitive

Key considerations:

  • Use PCA first to reduce to ~50 dimensions
  • Experiment with perplexity (5-50)
  • Not suitable for new data projection
  • Use for exploration, not modeling

Uniform Manifold Approximation and Projection (UMAP)

What it does: Fast non-linear dimensionality reduction for visualization and preprocessing.

When to use:

  • Fast visualization of high-dimensional data
  • Preprocessing for downstream tasks
  • Large datasets (faster than t-SNE)
  • Preserve both local and global structure
Strengths:Weaknesses:
Faster than t-SNEMore hyperparameters than t-SNE
Preserves global structure betterLess well-known/tested than PCA or t-SNE
Can be used for general dimensionality reductionStill somewhat non-deterministic
Works well with large datasetsRequires careful parameter tuning
Can project new data

Key considerations:

  • Adjust n_neighbors for local vs global structure
  • Use for both visualization and preprocessing
  • Generally preferred over t-SNE for speed

Quick Reference Decision Tree

What type of problem do you have?

Predicting continuous values (regression)

  • Simple linear relationship → Linear Regression
  • Non-linear patterns, medium data → Decision Tree, Random Forest
  • Complex patterns, large data → Gradient Boosting (XGBoost), Neural Networks
  • Time series → LSTM, GRU, or ARIMA

Predicting Categories (classification)

Binary Classification:

  • Need interpretability, simple → Logistic Regression
  • High accuracy, tabular data → Random Forest, XGBoost
  • Image data → CNN
  • Text data → Naive Bayes (simple), Transformers (advanced)
  • Small dataset, simple → SVM, KNN

Multi-class Classification:

  • Text classification → Naive Bayes, Transformers
  • Image classification → CNN
  • Tabular data → Random Forest, XGBoost
  • Need probability estimates → Logistic Regression (multinomial)

Group Similar Items (clustering)

  • Know number of clusters, fast → K-Means
  • Don't know cluster count, hierarchical view → Hierarchical Clustering
  • Arbitrary shapes, outliers present → DBSCAN
  • Need feature learning → Autoencoders + clustering

Dimension Reduction

  • Linear reduction, preserve variance → PCA
  • Visualization only → t-SNE, UMAP
  • Non-linear, for downstream tasks → Autoencoders, UMAP
  • Feature selection → Random Forest feature importance, Lasso

Time Series Data

  • Simple patterns → ARIMA, Linear Regression
  • Complex patterns, dependencies → LSTM, GRU
  • Very long sequences, NLP → Transformers

Data Size Matters

Feature TypesMethods
Small data (<1,000 samples)Simpler models (Logistic Regression, Naive Bayes, KNN)
Medium data (1,000-100,000)Random Forest, SVM, Gradient Boosting
Large data (>100,000)Gradient Boosting, Neural Networks

Feature Types

Feature TypesMethods
Mostly categoricalNaive Bayes, CatBoost, Decision Trees
Mostly numericalLinear models, SVM, Neural Networks
Mixed typesTree-based methods (Random Forest, XGBoost)

Available Computational Resources

InterpretabilityMethods
Limited resourcesLinear models, Naive Bayes, Decision Trees
Moderate resourcesRandom Forest, SVM, standard Neural Networks
High resources (GPU)Deep Learning (CNN, RNN, Transformers)

Interpretability Requirements

InterpretabilityMethods
High interpretability neededLinear Regression, Logistic Regression, Decision Trees
Moderate interpretabilityRandom Forest (feature importance), Naive Bayes
Low interpretability acceptableNeural Networks, SVM with RBF kernel, XGBoost

Speed Requirements

SpeedMethods
Fast TrainingNaive Bayes, Linear models, Decision Trees
Fast PredictionLinear models, Naive Bayes, KNN (if indexed)
Can Be SlowDeep Learning, SVM on large data, Ensemble methods

Common Pitfalls and Tips

Always

  • Split data into train/validation/test sets
  • Scale/normalize features (especially for distance-based methods)
  • Handle missing values appropriately
  • Check for data leakage
  • Use cross-validation for model evaluation
  • Monitor for overfitting

Feature Engineering Matters

  • Often more important than algorithm choice
  • Domain knowledge is crucial
  • Create interaction features if relationships exist
  • Remove highly correlated features

Hyperparameter Tuning

  • Use Grid Search or Random Search
  • Use validation set, not test set
  • Don't overfit to validation set
  • Consider Bayesian Optimization for expensive models

When to Use Deep Learning

  • Large datasets (typically >10,000 samples)
  • Complex patterns (images, text, sequences)
  • Have computational resources (GPUs)
  • Simpler methods have failed
  • Can afford longer training times

Ensemble Methods

  • Often provide best results
  • Combine different model types
  • Use bagging for variance reduction
  • Use boosting for bias reduction

Use this checklist to narrow down your algorithm choices, then experiment with 2-3 candidates to find the best fit.

Final Tip: There's no universally best algorithm. The right choice depends on your specific problem, data characteristics, and constraints. Start simple, iterate, and let your validation results guide you!