Data Science Methods Overview
A brief summary of the most common Data Science methods.
Introduction
This page is meant to be a non-exhaustive list of some of the most popular methods used in Data Science. The intent of this overview is to help novice users understand when to use a particular method and the strengths and weaknesses of that particular method. The methods are clustered into groups including Supervised Learning, Unsupervised Learning, Ensemble, Deep Learning and Dimensionality reduction. While the popularity of methods ebs & flows, linear regression is by far the most used method on the list as the number of practitioners extends well beyond the Data Science specialist.
Summary Table
| Method | Description |
|---|---|
| Supervised Learning | |
| Linear Regression | Predicts continuous numerical values by fitting a linear relationship between features and target. |
| Logistic Regression | Classifies data into discrete categories using a logistic function. |
| Decision Trees | Creates a tree-like model of decisions based on feature values. |
| K-Nearest Neighbors | Classifies or predicts based on the K closest training examples in feature space. |
| Support Vector Machines | Finds the optimal hyperplane that maximally separates classes. |
| Naive Bayes | Applies Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. |
| Ensemble Methods | |
| Random Forest | Builds multiple decision trees and aggregates their predictions. |
| Gradient Boosting | Builds trees sequentially, each correcting errors of previous trees. |
| AdaBoost | Combines weak learners by focusing on misclassified samples. |
| Deep Learning Methods | |
| Artificial Neural Networks | Learns complex patterns through layers of interconnected neurons. |
| Convolutional Neural Networks | A feed forward neural network used for processing grid-like data (images). |
| Recurrent Neural Networks | Processes sequential data by maintaining hidden state information. |
| Transformers | Uses self-attention mechanisms to process sequential data in parallel. |
| Unsupervised Learning | |
| K-Means Clustering | Partitions data into K distinct clusters based on feature similarity. |
| Hierarchical Clustering | Builds a tree of clusters (dendrogram) showing hierarchical relationships. |
| DBSCAN | Groups together points that are closely packed, marking outliers as noise. |
| PCA | Reduces dimensionality by finding principal components that explain variance. |
| Autoencoders | Neural networks that learn compressed representations of data. |
| Dimensionality Reduction | |
| t-SNE | Non-linear dimensionality reduction primarily for visualization. |
| UMAP | Fast non-linear dimensionality reduction for visualization and preprocessing. |
| Time Series Models | |
| ARIMA | Models the data as a combination of a trend (AR), integrated part (gives stationarity) and a moving average. |
Table of Contents
- Supervised Learning Methods
- Ensemble Methods
- Deep Learning Methods
- Unsupervised Learning Methods
- Dimensionality Reduction
- Time Series Models
Supervised Learning Methods
Supervised Learning: Family of algorithms that used labeled data in the training process.
Linear Regression
What it does: Predicts continuous numerical values by fitting a linear relationship between features and target.
When to use:
- Predicting continuous outcomes (price, temperature, sales)
- Need interpretable results
- Relationship between features and target appears linear
- Fast training is required
| Strengths: | Weaknesses: |
|---|---|
| Simple and interpretable | Assumes linear relationships |
| Fast to train and predict | Sensitive to outliers |
| Works well with linearly separable data | Poor performance with non-linear patterns |
| Low computational requirements | Prone to overfitting with many features |
Key considerations:
- Check for multicollinearity between features
- Consider polynomial features for non-linear relationships
- Use regularization (Ridge/Lasso) to prevent overfitting
Logistic Regression
What it does: Classifies data into discrete categories using a logistic function.
When to use:
- Binary classification problems (yes/no, spam/not spam)
- Need probability estimates for predictions
- Want interpretable feature importance
- Baseline model for classification tasks
| Strengths: | Weaknesses: |
|---|---|
| Provides probability scores | Assumes linear decision boundaries |
| Interpretable coefficients | Struggles with complex non-linear relationships |
| Fast training and prediction | Requires feature engineering for interactions |
| Works well with linearly separable classes | Can underfit complex datasets |
Key considerations:
- Use for binary or multinomial classification
- Apply feature scaling for better convergence
- Consider L1/L2 regularization
Decision Trees
What it does: Creates a tree-like model of decisions based on feature values.
When to use:
- Need highly interpretable models
- Data has non-linear relationships
- Mix of categorical and numerical features
- No need for feature scaling
- Want to visualize decision-making process
| Strengths: | Weaknesses: |
|---|---|
| Easy to understand and visualize | Prone to overfitting |
| Handles non-linear relationships | Unstable (small data changes affect tree structure) |
| No feature scaling required | Biased toward features with many levels |
| Captures feature interactions automatically | Poor at extrapolation |
| Works with mixed data types |
Key considerations:
- Set max_depth to prevent overfitting
- Use min_samples_split and min_samples_leaf
- Better in ensemble methods (Random Forest, XGBoost)
K-Nearest Neighbors (KNN)
What it does: Classifies or predicts based on the K closest training examples in feature space.
When to use:
- Small to medium-sized datasets
- Non-linear decision boundaries
- Need simple baseline model
- Pattern recognition tasks
| Strengths: | Weaknesses: |
|---|---|
| Simple and intuitive | Computationally expensive at prediction time |
| No training phase (lazy learning) | Sensitive to feature scaling |
| Naturally handles multi-class problems | Curse of dimensionality (poor performance in high dimensions) |
| Non-parametric (makes no assumptions about data distribution) | Sensitive to irrelevant features |
| Memory intensive |
Key considerations:
- Always scale features
- Choose K carefully (odd numbers for binary classification)
- Consider distance metrics (Euclidean, Manhattan, Minkowski)
- Use dimensionality reduction for high-dimensional data
Support Vector Machines (SVM)
What it does: Finds the optimal hyperplane that maximally separates classes.
When to use:
- Binary classification with clear margin of separation
- High-dimensional spaces (text classification, bioinformatics)
- More features than samples
- Need robust model for outliers
| Strengths: | Weaknesses: |
|---|---|
| Effective in high-dimensional spaces | Slow training on large datasets |
| Memory efficient (uses subset of training points) | Sensitive to feature scaling |
| Versatile through different kernel functions | Requires careful kernel selection |
| Works well with clear margin of separation | Doesn't provide probability estimates directly |
| Poor performance on very large datasets |
Key considerations:
- Always scale features
- Use linear kernel for linearly separable data
- RBF kernel for non-linear problems
- Tune C (regularization) and gamma parameters
Naive Bayes
What it does: Applies Bayes' theorem with strong independence assumptions between features.
When to use:
- Text classification (spam detection, sentiment analysis)
- Need very fast training and prediction
- High-dimensional data
- Baseline model for classification
- Real-time prediction requirements
| Strengths: | Weaknesses: |
|---|---|
| Extremely fast training and prediction | Assumes feature independence (rarely true) |
| Works well with high-dimensional data | Poor probability estimates |
| Requires small training dataset | Sensitive to how features are presented |
| Handles missing data well | Cannot learn feature interactions |
| Provides probability estimates |
Key considerations:
- Gaussian Naive Bayes for continuous data
- Multinomial for count data (text)
- Bernoulli for binary features
Ensemble Methods
Ensemble Methods: Family of algorithms that combine multiple weak models into a single, more accurate model.
Random Forest
What it does: Builds multiple decision trees and aggregates their predictions.
When to use:
- Need robust, accurate predictions
- Want feature importance rankings
- Data has complex non-linear relationships
- Reduce overfitting compared to single decision tree
- Default choice for tabular data
| Strengths: | Weaknesses: |
|---|---|
| High accuracy and robustness | Less interpretable than single trees |
| Reduces overfitting | Computationally expensive |
| Provides feature importance | Large memory footprint |
| Handles missing values | Can overfit on noisy datasets |
| Works with mixed data types | Slower prediction than simpler models |
| Minimal hyperparameter tuning needed |
Key considerations:
- Increase n_estimators (number of trees) for better performance
- Tune max_depth and min_samples_split
- Use for both classification and regression
- Great for feature selection
Gradient Boosting
Variants: XGBoost, LigthGBM, CatBoos
What it does: Builds trees sequentially, each correcting errors of previous trees.
When to use:
- Need state-of-the-art performance on tabular data
- Kaggle competitions or production systems
- Have time for hyperparameter tuning
- Structured/tabular data with complex patterns
| Strengths: | Weaknesses: |
|---|---|
| Often best performance on structured data | Prone to overfitting if not tuned properly |
| Handles missing values | Requires careful hyperparameter tuning |
| Provides feature importance | Longer training time |
| Built-in regularization | Less interpretable |
| Supports custom loss functions | Can be overkill for simple problems |
Key considerations:
- XGBoost: General purpose, widely used
- LightGBM: Faster training, handles large datasets
- CatBoost: Best for categorical features
- Tune learning_rate, max_depth, n_estimators
AdaBoost
What it does: Combines weak learners by focusing on misclassified samples.
When to use:
- Binary classification problems
- Want simpler ensemble than gradient boosting
- Have weak base learners
- Less prone to overfitting than other boosting methods
| Strengths: | Weaknesses: |
|---|---|
| Simple to implement | Sensitive to noisy data and outliers |
| Less prone to overfitting than other boosting | Slower than Random Forest |
| Works with various base learners | Can overfit on noisy datasets |
| No need for extensive hyperparameter tuning | Generally outperformed by gradient boosting |
Key considerations:
- Works best with stumps (trees with depth 1)
- Tune number of estimators
- Consider for simpler problems
Deep Learning Methods
Deep Learning: Family of algorithms that uses multi-layer neural networks to identify relationships between data.
Artificial Neural Networks (ANN)
What it does: Learns complex patterns through layers of interconnected neurons.
When to use:
- Large datasets available
- Complex non-linear relationships
- Need to learn hierarchical features
- Have computational resources (GPU)
| Strengths: | Weaknesses: |
|---|---|
| Learns complex patterns automatically | Requires large amounts of data |
| Scales well with data | Computationally expensive |
| Flexible architecture | Black box (hard to interpret) |
| Can approximate any function | Requires careful tuning |
| Prone to overfitting |
Key considerations:
- Use dropout for regularization
- Batch normalization for faster training
- Try different activation functions
- Monitor for overfitting
Convolutional Neural Networks (CNN)
What it does: Specialized neural networks for processing grid-like data (images).
When to use:
- Image classification and recognition
- Object detection
- Image segmentation
- Any spatial data with local patterns
- Video analysis
| Strengths: | Weaknesses: |
|---|---|
| Excellent for image data | Requires large labeled datasets |
| Automatic feature learning | Computationally intensive |
| Translation invariant | Long training times |
| Parameter sharing reduces model size | Requires GPU for practical use |
| State-of-the-art for computer vision | Black box interpretability |
Key considerations:
- Use transfer learning (ResNet, VGG, EfficientNet)
- Data augmentation to increase dataset size
- Pre-trained models for faster development
Recurrent Neural Networks (RNN)
What it does: Processes sequential data by maintaining hidden state information.
When to use:
- Time series prediction
- Natural language processing
- Sequential decision-making
- Speech recognition
- Any data with temporal dependencies
| Strengths: | Weaknesses: |
|---|---|
| Handles variable-length sequences | Vanishing/exploding gradient problems (RNN) |
| Captures temporal dependencies | Slow training (sequential processing) |
| Shares parameters across time steps | Difficult to parallelize |
| Can use context from past inputs | May struggle with very long sequences |
Key considerations:
- Use LSTM or GRU instead of vanilla RNN
- Consider Transformers for longer sequences
- Bidirectional RNNs for full context
- Gradient clipping to prevent exploding gradients
Transformers
What it does: Uses self-attention mechanisms to process sequential data in parallel.
When to use:
- Natural language processing (BERT, GPT)
- Long-range dependencies in sequences
- Need parallel processing of sequences
- Machine translation
- Text generation
| Strengths: | Weaknesses: |
|---|---|
| Captures long-range dependencies | Extremely computationally expensive |
| Parallelizable training | Requires massive datasets |
| State-of-the-art NLP performance | Large memory requirements |
| Transfer learning with pre-trained models | Quadratic complexity with sequence length |
| Attention provides some interpretability | Overkill for simple tasks |
Key considerations:
- Use pre-trained models (BERT, GPT, T5)
- Fine-tune for specific tasks
- Consider smaller variants (DistilBERT)
Unsupervised Learning Methods
K-Means Clustering
What it does: Partitions data into K distinct clusters based on feature similarity.
When to use:
- Customer segmentation
- Document clustering
- Image compression
- Anomaly detection preprocessing
- Data exploration
| Strengths: | Weaknesses: |
|---|---|
| Simple and fast | Must specify K in advance |
| Scales well to large datasets | Sensitive to initial centroid placement |
| Easy to implement and interpret | Assumes spherical clusters |
| Works well with spherical clusters | Sensitive to outliers |
| Poor with clusters of different sizes/densities |
Key considerations:
- Use elbow method or silhouette score to choose K
- Run multiple times with different initializations
- Scale features before clustering
- Consider K-Means++ for better initialization
Hierarchical Clustering
What it does: Builds a tree of clusters (dendrogram) showing hierarchical relationships.
When to use:
- Don't know number of clusters beforehand
- Need to visualize cluster hierarchy
- Small to medium datasets
- Want to see relationships at different granularities
| Strengths: | Weaknesses: |
|---|---|
| No need to specify number of clusters | Computationally expensive (O(n³) or O(n²) time) |
| Provides dendrogram visualization | Not suitable for large datasets |
| Deterministic (same result each run) | Sensitive to noise and outliers |
| Can use various distance metrics and linkage methods | Cannot undo previous steps |
Key considerations:
- Choose linkage method (single, complete, average, Ward)
- Use dendrograms to determine cluster count
- Scale features appropriately
Density-Based Spatial Clustering (DBSCAN)
What it does: Groups together points that are closely packed, marking outliers as noise.
When to use:
- Clusters of arbitrary shape
- Need outlier detection
- Don't know number of clusters
- Spatial data analysis
- Presence of noise in data
| Strengths: | Weaknesses: |
|---|---|
| No need to specify number of clusters | Struggles with varying density clusters |
| Finds arbitrarily shaped clusters | Sensitive to epsilon and min_samples parameters |
| Robust to outliers | Not deterministic with border points |
| Identifies noise points | Doesn't work well in high dimensions |
| Only two parameters to tune |
Key considerations:
- Tune epsilon (neighborhood radius) carefully
- Set min_samples based on dataset size
- Scale features before clustering
Principal Component Analysis (PCA)
What it does: Reduces dimensionality by finding principal components that explain variance.
When to use:
- Reduce number of features
- Visualize high-dimensional data (reduce to 2D/3D)
- Remove multicollinearity
- Speed up training
- Noise reduction
| Strengths: | Weaknesses: |
|---|---|
| Reduces dimensionality while preserving variance | Linear transformation only |
| Removes correlated features | Results are harder to interpret |
| Speeds up training | Sensitive to feature scaling |
| Helps visualization | May lose important information |
| Unsupervised (no labels needed) | Assumes linear relationships |
Key considerations:
- Always scale features first
- Choose number of components based on explained variance
- Use for preprocessing before other algorithms
- Consider alternatives (t-SNE, UMAP) for visualization
Autoencoders
What it does: Neural networks that learn compressed representations of data.
When to use:
- Non-linear dimensionality reduction
- Anomaly detection
- Image denoising
- Feature learning
- Data compression
| Strengths: | Weaknesses: |
|---|---|
| Learns non-linear representations | Requires large datasets |
| Flexible architecture | Computationally expensive |
| Can learn complex patterns | Needs careful architecture design |
| Works with various data types | Can be overkill for simple problems |
| Harder to interpret than PCA |
Key considerations:
- Use for complex, high-dimensional data
- Variational autoencoders for generation
- Denoising autoencoders for robustness
Dimensionality Reduction
t-Distributed Stochastic Neighbor Embedding (t-SNE)
What it does: Non-linear dimensionality reduction primarily for visualization.
When to use:
- Visualizing high-dimensional data in 2D/3D
- Exploring cluster structure
- Presentation and communication of patterns
- Image/text embeddings visualization
| Strengths: | Weaknesses: |
|---|---|
| Excellent for visualization | Slow on large datasets |
| Preserves local structure well | Non-deterministic (different runs give different results) |
| Reveals clusters clearly | Mainly for visualization, not for downstream tasks |
| Works with various distance metrics | Doesn't preserve global structure |
| Perplexity parameter is sensitive |
Key considerations:
- Use PCA first to reduce to ~50 dimensions
- Experiment with perplexity (5-50)
- Not suitable for new data projection
- Use for exploration, not modeling
Uniform Manifold Approximation and Projection (UMAP)
What it does: Fast non-linear dimensionality reduction for visualization and preprocessing.
When to use:
- Fast visualization of high-dimensional data
- Preprocessing for downstream tasks
- Large datasets (faster than t-SNE)
- Preserve both local and global structure
| Strengths: | Weaknesses: |
|---|---|
| Faster than t-SNE | More hyperparameters than t-SNE |
| Preserves global structure better | Less well-known/tested than PCA or t-SNE |
| Can be used for general dimensionality reduction | Still somewhat non-deterministic |
| Works well with large datasets | Requires careful parameter tuning |
| Can project new data |
Key considerations:
- Adjust n_neighbors for local vs global structure
- Use for both visualization and preprocessing
- Generally preferred over t-SNE for speed
Quick Reference Decision Tree
What type of problem do you have?
Predicting continuous values (regression)
- Simple linear relationship → Linear Regression
- Non-linear patterns, medium data → Decision Tree, Random Forest
- Complex patterns, large data → Gradient Boosting (XGBoost), Neural Networks
- Time series → LSTM, GRU, or ARIMA
Predicting Categories (classification)
Binary Classification:
- Need interpretability, simple → Logistic Regression
- High accuracy, tabular data → Random Forest, XGBoost
- Image data → CNN
- Text data → Naive Bayes (simple), Transformers (advanced)
- Small dataset, simple → SVM, KNN
Multi-class Classification:
- Text classification → Naive Bayes, Transformers
- Image classification → CNN
- Tabular data → Random Forest, XGBoost
- Need probability estimates → Logistic Regression (multinomial)
Group Similar Items (clustering)
- Know number of clusters, fast → K-Means
- Don't know cluster count, hierarchical view → Hierarchical Clustering
- Arbitrary shapes, outliers present → DBSCAN
- Need feature learning → Autoencoders + clustering
Dimension Reduction
- Linear reduction, preserve variance → PCA
- Visualization only → t-SNE, UMAP
- Non-linear, for downstream tasks → Autoencoders, UMAP
- Feature selection → Random Forest feature importance, Lasso
Time Series Data
- Simple patterns → ARIMA, Linear Regression
- Complex patterns, dependencies → LSTM, GRU
- Very long sequences, NLP → Transformers
Data Size Matters
| Feature Types | Methods |
|---|---|
| Small data (<1,000 samples) | Simpler models (Logistic Regression, Naive Bayes, KNN) |
| Medium data (1,000-100,000) | Random Forest, SVM, Gradient Boosting |
| Large data (>100,000) | Gradient Boosting, Neural Networks |
Feature Types
| Feature Types | Methods |
|---|---|
| Mostly categorical | Naive Bayes, CatBoost, Decision Trees |
| Mostly numerical | Linear models, SVM, Neural Networks |
| Mixed types | Tree-based methods (Random Forest, XGBoost) |
Available Computational Resources
| Interpretability | Methods |
|---|---|
| Limited resources | Linear models, Naive Bayes, Decision Trees |
| Moderate resources | Random Forest, SVM, standard Neural Networks |
| High resources (GPU) | Deep Learning (CNN, RNN, Transformers) |
Interpretability Requirements
| Interpretability | Methods |
|---|---|
| High interpretability needed | Linear Regression, Logistic Regression, Decision Trees |
| Moderate interpretability | Random Forest (feature importance), Naive Bayes |
| Low interpretability acceptable | Neural Networks, SVM with RBF kernel, XGBoost |
Speed Requirements
| Speed | Methods |
|---|---|
| Fast Training | Naive Bayes, Linear models, Decision Trees |
| Fast Prediction | Linear models, Naive Bayes, KNN (if indexed) |
| Can Be Slow | Deep Learning, SVM on large data, Ensemble methods |
Common Pitfalls and Tips
Always
- Split data into train/validation/test sets
- Scale/normalize features (especially for distance-based methods)
- Handle missing values appropriately
- Check for data leakage
- Use cross-validation for model evaluation
- Monitor for overfitting
Feature Engineering Matters
- Often more important than algorithm choice
- Domain knowledge is crucial
- Create interaction features if relationships exist
- Remove highly correlated features
Hyperparameter Tuning
- Use Grid Search or Random Search
- Use validation set, not test set
- Don't overfit to validation set
- Consider Bayesian Optimization for expensive models
When to Use Deep Learning
- Large datasets (typically >10,000 samples)
- Complex patterns (images, text, sequences)
- Have computational resources (GPUs)
- Simpler methods have failed
- Can afford longer training times
Ensemble Methods
- Often provide best results
- Combine different model types
- Use bagging for variance reduction
- Use boosting for bias reduction
Use this checklist to narrow down your algorithm choices, then experiment with 2-3 candidates to find the best fit.
Final Tip: There's no universally best algorithm. The right choice depends on your specific problem, data characteristics, and constraints. Start simple, iterate, and let your validation results guide you!