Advanced Supervised Learning, Neural Nets, & Unsupervised Model Overviews

Photo by Alina Grubnyak on Unsplash

Advanced Supervised Learning

  • CART
  • Bootstrapping & Bagging
  • Random Forest
  • Gradient Descent
  • Boosting
  • SVMs

Neural Nets

  • Intro to Neural Nets
  • Regularizing Neural Nets
  • RNNs
  • CNNs
  • GANs

Intro to Unsupervised Learning

  • K Means
  • DB Scan
  • PCA
  • Recommender Systems to come in a later post

Advanced Supervised Learning

Getting a little fancier than Linear Regression and KNN with these models. Starting from the more basic and adding complexity as we go through the list.


CART stands for Classification and Regression Tree, also called a Decision Tree. So, what’s a decision tree and how does it help predict a target variable?

  • Don’t have to scale our data
  • Nonparametric, so don’t make assumptions about data distributions
  • Easy to interpret: Can map out decisions made by the tree & view feature importance
  • Speed: They fit very quickly
  • Pretty much guaranteed to overfit on training data
  • Locally optimal. Because Decision Trees make the best decision at the local level, we may end up with a worse overall solution.
  • Don’t work well with unbalanced data.

Bootstrapping & Bagging

Bootstrapping is random resampling with replacement. So, the model takes a bunch of samples to make a new dataset the same size as the original. But, when a sample is added to the new set it’s not removed from the original, so the same sample can be added multiple times (or not at all). Now you have some bootstrapped samples to run a model on.

  • Can get closer to a true function by averaging out multiple model results
  • Wisdom of the crowd
  • Can lower variance by exposing the model to differing samples of the training data
  • It is harder to interpret because it’s running multiple models at once
  • Computationally more expensive
  • Trees are highly correlated so still can have high variance

Random Forest & Extra Trees

Random Forest is another ensemble model made up of, you probably guessed, many single Decision Trees. It is also training on boostrapped samples, like the Bagging method. The difference between Bagging and a Random Forest is that at each split each tree in the Random Forest, the model is only optimizing and selecting the highest Gini impurity from a random subset of features. Each level it gets a new random set of remaining features and continues the process.

  • Reduces overfitting by using the random subset of features at each level
  • Another step beyond Bagging because trees are less correlated.
  • More complex and a little harder to interpret than a single decision tree
  • More computationally expensive
  • Individual tree models within Extra Trees are even less correlated to each other
  • Faster than Random Forest
  • Trained on the entire dataset instead of just bootstrapped samples
  • Can increase bias because of addition of random elements
  • Random node splitting instead of optimal split

Gradient Descent

Gradient descent is used to optimize loss functions. Typical loss functions include Means Squared Error, Mean Absolute Error, and the Log-loss or Binary Cross Entropy Loss.


Adaboost is an iterative model, often the stump of a Decision Tree (A decision tree with a depth of 1 or 2.) The model tests on the training data and gives higher weights to incorrect results. It then continues through the process with the new weights until it either fits with minimal error or reaches the maximum number of estimators. Hyperparameters to tune are max depth, learning rate, & number of estimators.

  • Computationally inexpensive because it uses an ensemble of weaker predictors
  • Sensitive to outliers because it will weigh every incorrect prediction
  • Not interpretable
  • Lots of flexibility
  • Can optimize on different loss functions
  • Computationally expensive
  • Can be prone to overfitting because it’s designed to minimize all errors


SVM stands for Support Vector Machine. They are a type of classification model called discriminant models. The model attempts to find a separating hyperplane (also called linear discriminant) that creates the maximal margin. This is the widest space that has no points inside.

  • Exceptional performance
  • Effective in high dimensional data
  • Works with non-linear data
  • Computationally expensive
  • SVMs always find a hyperplane
  • Results are not interpretable
  • Have to scale the data

Neural Nets

Intro to Neural Nets

The first type of Neural Net we will cover is called a feed-forward neural net. This type of neural net takes in our input features at nodes and passes them through using an activation function to a hidden layer. The hidden layer then takes the information from the previous layer’s nodes and passes the information through an activation function to the next hidden layer and so on, until the last hidden layer, which passes the information to the output layer, which is transformed and makes a prediction.

  • Number of hidden layers
  • Number of nodes in each layer
  • Choice of activation functions
  • Loss function
  • Good results
  • Flexible set up
  • Can overfit to training data
  • Computationally expensive

Regularizing Neural Nets

To combat overfitting we can use regularization in creating neural nets. Common ways are L1 & L2 optimization (like in Linear Regression models,) Dropout, and Early Stopping.


Recurrent Neural Networks are sequential and take time into play. Instead of a layer only passing forward its information it also takes into account past observations that influence the current observation.


Convolutional Neural Networks use a convolution layer that passes over the matrix grid of data and computes the element-wise dot product to get the resulting matrix.


General Adversarial Neural Networks are used to create new data. You have two neural networks working in conjunction, one that generates new data called the generator and the other that classifies data as real or fake called the discriminator.

Intro to Unsupervised Learning

All of the above deals with supervised learning, which means there is a target goal. Unsupervised learning has no goal and instead tries to represent the data in new ways by finding relationships. This can be used in transfer learning by taking the results of unsupervised learning to create new features for supervised learning models.

K Means

K-Means is a clustering method where k is the number of clusters and means refers to the mean point of the k cluster. Because we’re working with distances it is important to first scale the data. The user determines the number of clusters from the outset, so it helps if there is some familiarity to the subject. The model linearly splits the data into clusters, based around the center point or centroid. The model updates by moving the centroids so that the total sum of the squared distances from each point to the centroid of the cluster is minimized.

  • Guaranteed to converge
  • Fast
  • Simple and easy to implement
  • Very sensitive to outliers because it finds the centroid in the center of all the clusters
  • Convergence on the absolute minimum versus a local minimum depends on initialization of centroids. It is recommended to start the centroids distant from each other and run the model several times to verify.

DB Scan

DBScan stands for Density Based Spatial Clustering of Application with Noise. It is another clustering method that uses density to create clusters. As with K-Means, scaling the data is required. It’s hyperparameters that need specifying are epsilon, the search distance radius, and the minimum samples it takes to be considered a cluster.

  • Can detect patterns that may nor be found by K-Means
  • Don’t need to specify the number of clusters
  • Great for identifying outliers
  • Fixed epsilon
  • If features overlap they will be grouped into the same cluster
  • Doesn’t work well when the clusters are of varying density


Principal Component Analysis is a method of feature extraction, which is combining features in a particular way. Then the features are ranked and feature selection is performed by dropping the least important ones. Once again we are dealing with distances so data will need to be scaled.

Recommender Systems to come in a later post



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
David Holcomb

David Holcomb

I am a data scientist, leveraging my experience in residential architecture to apply creativity to data informed solutions and graphically communicate results.