Advanced Supervised Learning, Neural Nets, & Unsupervised Model Overviews
In this blog post I will review some terms and overall concepts for 3 areas in Data Science: Advanced Supervised Learning, Neural Nets, & an Intro into Unsupervised Learning topics. It’s mostly for my own review but hopefully someone out there will find it helpful as well.
Advanced Supervised Learning
- Bootstrapping & Bagging
- Random Forest
- Gradient Descent
- Intro to Neural Nets
- Regularizing Neural Nets
Intro to Unsupervised Learning
- K Means
- DB Scan
- Recommender Systems to come in a later post
Advanced Supervised Learning
Getting a little fancier than Linear Regression and KNN with these models. Starting from the more basic and adding complexity as we go through the list.
CART stands for Classification and Regression Tree, also called a Decision Tree. So, what’s a decision tree and how does it help predict a target variable?
We actually start at the top of the Tree, which is called the root node. All of our features are passed at that root node. The model determines the purity by calculating the Gini impurity for each possible split, and whichever reduces the impurity the most (highest Gini value) the model splits at that node.
The Gini impurity score is the probability that a randomly chosen class will be mislabeled if it was randomly labeled. In math terms it’s 1 minus the Sum of the probability of each class squared. After making that first split on whichever feature lowers the impurity the most, it moves on to the next level and continues the process; calculating all the gini impurity scores for every remaining feature and splitting at the one that reduces the gini impurity the most, until it goes through them all and reaches the leaf nodes.
Pros to single Decision Trees:
- Don’t have to scale our data
- Nonparametric, so don’t make assumptions about data distributions
- Easy to interpret: Can map out decisions made by the tree & view feature importance
- Speed: They fit very quickly
Cons to single Decision Trees:
- Pretty much guaranteed to overfit on training data
- Locally optimal. Because Decision Trees make the best decision at the local level, we may end up with a worse overall solution.
- Don’t work well with unbalanced data.
Bootstrapping & Bagging
Bootstrapping is random resampling with replacement. So, the model takes a bunch of samples to make a new dataset the same size as the original. But, when a sample is added to the new set it’s not removed from the original, so the same sample can be added multiple times (or not at all). Now you have some bootstrapped samples to run a model on.
The model most used in bootstrapping is a decision tree but you could use any model you choose. This is called an ensemble model because the final result is made up of the results of multiple individual models.
The model runs on each version of the bootstrapped data and in a classification problem will take the majority prediction as the output prediction.
Bagging is Bootstrapping plus Aggregating. Generally used on decision trees, and is the process of aggregating the scores of the predictions for an observation.
Put them together while using a CART in order to reduce variance between training and testing sets.
- Can get closer to a true function by averaging out multiple model results
- Wisdom of the crowd
- Can lower variance by exposing the model to differing samples of the training data
- It is harder to interpret because it’s running multiple models at once
- Computationally more expensive
- Trees are highly correlated so still can have high variance
Random Forest & Extra Trees
Random Forest is another ensemble model made up of, you probably guessed, many single Decision Trees. It is also training on boostrapped samples, like the Bagging method. The difference between Bagging and a Random Forest is that at each split each tree in the Random Forest, the model is only optimizing and selecting the highest Gini impurity from a random subset of features. Each level it gets a new random set of remaining features and continues the process.
- Reduces overfitting by using the random subset of features at each level
- Another step beyond Bagging because trees are less correlated.
- More complex and a little harder to interpret than a single decision tree
- More computationally expensive
Extremely Randomized Trees or Extra Trees add another layer of randomness to the Random Trees. They still get a random subset of features but instead of computationally determining the best Gini impurity score, they instead select a random value from the features empirical range and that instead determines the split.
- Individual tree models within Extra Trees are even less correlated to each other
- Faster than Random Forest
- Trained on the entire dataset instead of just bootstrapped samples
- Can increase bias because of addition of random elements
- Random node splitting instead of optimal split
Gradient descent is used to optimize loss functions. Typical loss functions include Means Squared Error, Mean Absolute Error, and the Log-loss or Binary Cross Entropy Loss.
Starting at a random point on a loss curve the model computes the derivative and takes a step towards the minimum. The model repeats the process until it reaches or gets really close to a minimum point (derivative equals 0.)
Some things to keep in mind are step-size because if the model overshoots the minimum it won’t converge. Another thing to keep in mind is the model might be finding a local minimum but not the absolute minimum. Next we’ll take a look at actual model examples:
Adaboost is an iterative model, often the stump of a Decision Tree (A decision tree with a depth of 1 or 2.) The model tests on the training data and gives higher weights to incorrect results. It then continues through the process with the new weights until it either fits with minimal error or reaches the maximum number of estimators. Hyperparameters to tune are max depth, learning rate, & number of estimators.
- Computationally inexpensive because it uses an ensemble of weaker predictors
- Sensitive to outliers because it will weigh every incorrect prediction
- Not interpretable
Gradient Boost is another ensemble learning method that builds a strong model by starting with weaker ones. The difference between Gradient Boost and Adaboost is that instead of giving weights to incorrect predictions, it instead fits on the residuals of the previous fit and continues through the process to minimize error.
- Lots of flexibility
- Can optimize on different loss functions
- Computationally expensive
- Can be prone to overfitting because it’s designed to minimize all errors
SVM stands for Support Vector Machine. They are a type of classification model called discriminant models. The model attempts to find a separating hyperplane (also called linear discriminant) that creates the maximal margin. This is the widest space that has no points inside.
The points touching the margin are called the support vectors. Because we are working with distances between the points scaling data before fitting is a must!
Soft margin SVMs still attempt to find a separating hyperplane and corresponding pair of margins bounded by the support vectors. Next the model measures the distance from each wrong point back to its margin. This distance is called eta. The model then attempts to maximize the margins such that the sum of the etas is within an acceptable range.
Kernel SVMs also attempt to find a separating hyperplane. However we kernelize our model and curve the axes that the model is fit on. The idea is to increase the dimensionality of the data. Three common choices are Linear Kernel, which does nothing, Polynomial Kernel, which adjusts to a power, and the Radial Basis Function Kernel, which bends the axis to find a region of similarity. The RBF kernel is the go to if the Linear do nothing kernel doesn’t work.
- Exceptional performance
- Effective in high dimensional data
- Works with non-linear data
- Computationally expensive
- SVMs always find a hyperplane
- Results are not interpretable
- Have to scale the data
Intro to Neural Nets
The first type of Neural Net we will cover is called a feed-forward neural net. This type of neural net takes in our input features at nodes and passes them through using an activation function to a hidden layer. The hidden layer then takes the information from the previous layer’s nodes and passes the information through an activation function to the next hidden layer and so on, until the last hidden layer, which passes the information to the output layer, which is transformed and makes a prediction.
A neural network is dense if all the nodes of each layer pass on information to every node in the following layer.
The term deep learning comes from the fact that you can have as many hidden layers as you want.
When creating a neural network things to consider are:
- Number of hidden layers
- Number of nodes in each layer
- Choice of activation functions
- Loss function
More hidden layers adds complexity to the relationships learned. More hidden nodes means more relationships learned that get aggregated into a final prediction. This might sound great at first, but can easily lead to overfitting.
A good place to start when selecting the number of nodes in each layer is: for the input layer pick a number that is base 2 and equal to or the next base 2 number higher than the number of features that will be passed in. The number of nodes in a hidden layer is also generally a base 2 number. You also want the number of nodes in the output layer to match the number of results you want in your prediction. (So generally one for regression or binary classification and equal to the number of target variable categories.)
Activation functions include Relu (hidden layer), tanh (hidden layer or output layer on -1 to 1 scale), softmax (output layer for multiclass classification that gives the probability for each class), sigmoid (output layer 0 to 1 for binary classification,) and linear (output for linear regression.)
Neural Nets use gradient descent and loss functions are means squared error for regression, binary crossentropy for binary classification and crossentropy for multiclass classification.
Neural nets are computationally expensive because they use the forward pass by computing predictions and loss and the backward pass which is computing the gradients based on the predictions and updating weights.
Ways to speed things up include using optimizers, minibatches, & getting access to a better GPU.
Optimizers used in Neural Nets include Adaptive Gradient (AdaGrad,) Root Mean Square Propagation (RMSProp,) and Adaptive Moment Estimation (Adam.)
Minibatch uses a smaller subsampling of the full training data. Instead of running all your data at once the model processes a minibatch at a time. One run through of the full set is called an epoch. Minibatch sizes are usually a power of 2.
- Good results
- Flexible set up
- Can overfit to training data
- Computationally expensive
Regularizing Neural Nets
To combat overfitting we can use regularization in creating neural nets. Common ways are L1 & L2 optimization (like in Linear Regression models,) Dropout, and Early Stopping.
When using dropout you specify the chance that each neuron within a layer will be turned off for that epoch. Individual nodes are turned off during the training process but not when making predictions on the testing data.
Early stopping is a technique that can also prevent overfitting to the training data. When setting it up you select a metric that you would like to track, a minimum delta for that metric in each step, and the number of epochs you want the model to train before making an early stop (called patience.) The idea is that you stop the model when the training and testing data have converged and before they split and the model starts to overfit.
Recurrent Neural Networks are sequential and take time into play. Instead of a layer only passing forward its information it also takes into account past observations that influence the current observation.
Because neural networks first use the forward pass and then use back propagation and updates weights before running it back through, the gradient shrinks exponentially over each time step. But data at the earlier time steps might be very important, like when an algorithm is evaluating a sentence for example.
Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) were created to handle this memory loss. LSTMs have various gates so that important information passed in earlier time steps stays relevant by passing through a sigmoid function at the forget gate. If the result is 1, the data is stored and passed on, if 0 then multiplying that value by 0 causes the value to be forgotten. The input gate helps decide what information is relevant to add from the current step (hidden state plus current input into a tanh function multiplied by a sigmoid function) and the output gate determines what the next hidden step should be.
GRUs on the other hand only have update gates and reset gates. The update gate decides what information to throw away and what is added. The reset gate decides how much past information is forgotten.
This is a very brief overview, heavily inspired by this post on Towards Data Science. I’d recommend heading over there to learn in more detail the workings of LSTMs and GRUs.
Convolutional Neural Networks use a convolution layer that passes over the matrix grid of data and computes the element-wise dot product to get the resulting matrix.
An example of this would be step by step sliding over the greyscale values of pixels in an image in order to find edges in the image, which would be seen as large jumps in value from one pixel to the next.
When working with image data color values range from 0–255 and we scale them to be from 0–1 so they play nice with our sigmoid and softclass output activation functions that operate in that range. If you choose the hyperbolic tangent as the output activation layer, you would scale your images to the range of -1 to 1.
The number of filters in a CNN is a tunable hyperparameter. When setting up a convolution layer there are also options to adjust it’s size, step, & padding. Padding can be important when there is important information at the edges. Padding adds a layer around the image, so the edges become like the interior.
Spatial pooling is also called downsampling or subsampling. What pooling does is looks at the data matrix and within a window finds either the maximum, average, or sum. Then it adds that number to form a new matrix. This is done to reduce the feature dimension.
After passing an image through convolution and pooling layers the information goes through dense layers with an activation function and to an output layer similar to what was described above in the intro to neural nets section.
General Adversarial Neural Networks are used to create new data. You have two neural networks working in conjunction, one that generates new data called the generator and the other that classifies data as real or fake called the discriminator.
They are trained against each other. It’s the generator’s job to create fake data that the discriminator thinks is real. As both networks train the created data looks more and more like the original real data. They train in alternating fashion. The discriminator takes a turn and trains on a batch of half real class 1 data and half fake class 0 data from the generator, and then it’s the generator’s turn to train on a batch and create more data. This allows the network to improve sequentially.
When the generator is training the weights of the discriminator are not updated to avoid leaking our data into the discriminator.
There is no overall objective function to evaluate a GAN but both the discriminator and the generator have their own respective loss functions.
Intro to Unsupervised Learning
All of the above deals with supervised learning, which means there is a target goal. Unsupervised learning has no goal and instead tries to represent the data in new ways by finding relationships. This can be used in transfer learning by taking the results of unsupervised learning to create new features for supervised learning models.
K-Means is a clustering method where k is the number of clusters and means refers to the mean point of the k cluster. Because we’re working with distances it is important to first scale the data. The user determines the number of clusters from the outset, so it helps if there is some familiarity to the subject. The model linearly splits the data into clusters, based around the center point or centroid. The model updates by moving the centroids so that the total sum of the squared distances from each point to the centroid of the cluster is minimized.
Each point gets a silhouette score. The score is high for a given point if that point is similar to other points in it’s cluster (cohesion) and dissimilar to points in neighboring clusters (separation.)
It’s calculated as the average distance of points within clusters divided by the average distance of points in one cluster to the points of another cluster.
- Guaranteed to converge
- Simple and easy to implement
- Very sensitive to outliers because it finds the centroid in the center of all the clusters
- Convergence on the absolute minimum versus a local minimum depends on initialization of centroids. It is recommended to start the centroids distant from each other and run the model several times to verify.
DBScan stands for Density Based Spatial Clustering of Application with Noise. It is another clustering method that uses density to create clusters. As with K-Means, scaling the data is required. It’s hyperparameters that need specifying are epsilon, the search distance radius, and the minimum samples it takes to be considered a cluster.
Starting from a random point, if the model finds more than the minimum samples within the epsilon distance it creates a cluster. It keeps searching and expanding that cluster until it has found all the points that fit in that cluster before picking a new random point and beginning again.
- Can detect patterns that may nor be found by K-Means
- Don’t need to specify the number of clusters
- Great for identifying outliers
- Fixed epsilon
- If features overlap they will be grouped into the same cluster
- Doesn’t work well when the clusters are of varying density
Principal Component Analysis is a method of feature extraction, which is combining features in a particular way. Then the features are ranked and feature selection is performed by dropping the least important ones. Once again we are dealing with distances so data will need to be scaled.
The data is plotted and the model finds the center of the data and places that at the origin. Then it draws a random line through the origin and projects the points onto the line. The line is rotated until it finds the maximum sum of squared distances from the projected points to the origin. That best fit line is called PC1. PC2 is orthogonal to PC1.
After taking the PC1 line and scaling it so it has a length of 1 (unit vector,) is called the Eigenvector for PC1. The sum of the squared distances for PC1 (as calculated above) is the Eigenvalue for PC1.
So the eigenvector tells you the most important directions of the data and the eigenvalues tell you the magnitude of how important that principal component is.
We can then calculate the ratio of eigenvalue of PC1 to determine how much the features are contributing to make a final prediction.
PCA is useful when there are variables with high multicollinearity or when there are many features.
PCA assumes a linear relationship between observations because it’s finding the best linear vector.
For a 22 minute overview video with visuals and a lot better explanation with examples, check out the StatQuest PCA Step by Step video.