Trending March 2024 # Guide To Gradient Descent And Its Variants With Python Implementation # Suggested April 2024 # Top 8 Popular

You are reading the article Guide To Gradient Descent And Its Variants With Python Implementation updated in March 2024 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Guide To Gradient Descent And Its Variants With Python Implementation

This article was published as a part of the Data Science Blogathon

Introduction Table of Content:

Need for Optimization

Gradient Descend

Stochastic Gradient Descent (SGD)

Mini batch Gradient Descent (SGD)

Momentum based Gradient Descent (SGD)

Adagrad (short for adaptive gradient)


Adam(Adaptive Gradient Descend)


Need for Optimization

The main purpose of machine learning or deep learning is to create a model that performs well and gives accurate predictions in a particular set of cases. In order to achieve that, we machine optimization.

Optimization starts with defining some kind of loss function/cost function (objective function) and ends with minimizing it using one or the other optimization routine. The choice of an optimization algorithm can make a difference between getting a good accuracy in hours or days.

To know more about the Optimization algorithm refer to this article

1.Gradient Descent

Gradient descent is one of the most popular and widely used optimization algorithms.

Gradient descent is not only applicable to neural networks but is also used in situations where we need to find the minimum of the objective function.

Python Implementation:

Note: We will be using MSE (Mean Squared Error) as the loss function.

We generate some random data points with 500 rows and 2 columns (x and y) and use them for training

data = np.random.randn(500, 2) ## Column one=X values; Column two=Y values theta = np.zeros(2) ## Model Parameters(Weights)

Calculate the loss function using MSE

def loss_function(data,theta): #get m and b m = theta[0] b = theta[1] loss = 0 #on each data point for i in range(0, len(data)): #get x and y x = data[i, 0] y = data[i, 1] #predict the value of y y_hat = (m*x + b) #compute loss as given in quation (2) loss = loss + ((y - (y_hat)) ** 2) #mean sqaured loss mean_squared_loss = loss / float(len(data)) return mean_squared_loss

Calculate the Gradient of loss function for model parameters

def compute_gradients(data, theta): gradients = np.zeros(2) #total number of data points N = float(len(data)) m = theta[0] b = theta[1] #for each data point for i in range(len(data)): x = data[i, 0] y = data[i, 1] #gradient of loss function with respect to m as given in (3) gradients[0] += - (2 / N) * x * (y - (( m* x) + b)) #gradient of loss funcction with respect to b as given in (4) gradients[1] += - (2 / N) * (y - ((theta[0] * x) + b)) #add epsilon to avoid division by zero error epsilon = 1e-6 gradients = np.divide(gradients, N + epsilon) return gradients

After computing gradients, we need to update our model parameter.

theta = np.zeros(2) gr_loss=[] for t in range(50000): #compute gradients gradients = compute_gradients(data, theta) #update parameter theta = theta - (1e-2*gradients) #store the loss gr_loss.append(loss_function(data,theta))

2. Stochastic Gradient Descent (SGD)

In gradient descent, to perform a single parameter update, we go through all the data points in our training set. Updating the parameters of the model only after iterating through all the data points in the training set makes convergence in gradient descent very slow increases the training time, especially when we have a large dataset. To combat this, we use Stochastic Gradient Descent (SGD).

In Stochastic Gradient Descent (SGD) we don’t have to wait to update the parameter of the model after iterating all the data points in our training set instead we just update the parameters of the model after iterating through every single data point in our training set.

Since we update the parameters of the model in SGD after iterating every single data point, it will learn the optimal parameter of the model faster hence faster convergence, and this will minimize the training time as well.

3. Mini-batch Gradient Descent

In Mini-batch gradient descent, we update the parameters after iterating some batches of data points.

Let’s say the batch size is 10, which means that we update the parameter of the model after iterating through 10 data points instead of updating the parameter after iterating through each individual data point.

Now we will calculate the loss function and update parameters

def minibatch(data, theta, lr = 1e-2, minibatch_ratio = 0.01, num_iterations = 5000): loss = [] minibatch_size = int(math.ceil(len(data) * minibatch_ratio)) ## Calculate batch_size for t in range(num_iterations): sample_size = random.sample(range(len(data)), minibatch_size) np.random.shuffle(data) #sample batch of data sample_data = data[0:sample_size[0], :] #compute gradients grad = compute_gradients(sample_data, theta) #update parameters theta = theta - (lr * grad) loss.append(loss_function(data,theta)) return loss

4.Momentum based Gradient Descent (SGD)

The problem with Stochastic Gradient Descent (SGD) and Mini-batch Gradient Descent was that during convergence they had oscillations.

From the above plot, we can see oscillations represented with dotted lines in the case of Mini-batch Gradient Descent.

Now you must be wondering what these oscillations are?

Momentum helps us in not taking the direction that does not lead us to convergence.

In other words, we take a fraction of the parameter update from the previous gradient step and add it to the current gradient step.

Python Implementation

def Momentum(data, theta, lr = 1e-2, gamma = 0.9, num_iterations = 5000): loss = [] #Initialize vt with zeros: vt = np.zeros(theta.shape[0]) for t in range(num_iterations): #compute gradients with respect to theta gradients = compute_gradients(data, theta) #Update vt by equation (8) vt = gamma * vt + lr * gradients #update model parameter theta by equation (9) theta = theta - vt #store loss of every iteration loss.append(loss_function(data,theta)) return loss

From the above plot, we can see that Momentum reduces the oscillations produced in MiniBatch Gradient Descent

5. Adagrad (short for adaptive gradient)

In the case of deep learning, we have many model parameters (Weights) and many layers to train. Our goal is to find the optimal values for all these weights.

In all of the previous methods, we observed that the learning rate was a constant value for all the parameters of the network.

However, Adagrad adaptively sets the learning rate according to a parameter hence the name adaptive gradient.

In the given equation the denominator represents the sum of the squares of the previous gradient step for the given parameter. If we can notice this denominator actually scales of learning rate.

– That is, when the sum of the squared past gradients has a high value, we are basically dividing the learning rate by a high value, so our learning rate will become less.

– Similarly, if the sum of the squared past gradients has a low value, we are dividing the learning rate by a lower value, so our learning rate value will become high.

Python Implementation

def AdaGrad(data, theta, lr = 1e-2, epsilon = 1e-8, num_iterations = 100): loss = [] #initialize gradients_sum for storing sum of gradients gradients_sum = np.zeros(theta.shape[0]) for t in range(num_iterations): #compute gradients with respect to theta gradients = compute_gradients(data, theta) #compute square of sum of gradients gradients_sum += gradients ** 2 #update gradients gradient_update = gradients / (np.sqrt(gradients_sum + epsilon)) #update model parameter according to equation (12) theta = theta - (lr * gradient_update) loss.append(loss_function(data,theta)) return loss

As we can see that for every iteration, we are accumulating and summing all the past squared gradients. So, on every iteration, our sum of the squared past gradients value will increase. When the sum of the squared past gradient value is high, we will have a large number in the denominator. When we divide the learning rate by a very large number, then the learning rate will become very small.

That is, our learning rate will be decreasing. When the learning rate reaches a very low value, then it takes a long time to attain convergence

6. Adadelta

We can see that in the case of Adagrad we had a vanishing learning rate problem. To deal with this we generally use Adadelta.

In Adadelta, instead of taking the sum of all the squared past gradients, we take the exponentially decaying running average or weighted average of gradients.

Python Implementation

def AdaDelta(data, theta, gamma = 0.9, epsilon = 1e-5,num_iterations = 500): loss = [] #initialize running average of gradients E_grad2 = np.zeros(theta.shape[0]) #initialize running average of parameter update E_delta_theta2 = np.zeros(theta.shape[0]) for t in range(num_iterations): #compute gradients of loss with respect to theta gradients = compute_gradients(data, theta) #compute running average of gradients as given in equation (13) E_grad2 = (gamma * E_grad2) + ((1. - gamma) * (gradients ** 2)) #compute delta_theta as given in equation (14) delta_theta = - (np.sqrt(E_delta_theta2 + epsilon)) / (np.sqrt(E_grad2 + epsilon)) * gradients #compute running average of parameter updates as given in equation (15) E_delta_theta2 = (gamma * E_delta_theta2) + ((1. - gamma) * (delta_theta ** 2)) #update the model parameter, theta as given in equation (16) theta = theta + delta_theta #store the loss loss.append(loss_function(data,theta)) return loss

Note: The main idea behind Adadelta and RMSprop is mostly the same that is to deal with the vanishing learning rate by taking the weighted average of gradient step.

To know more about RMSprop refer to this article

7. Adam(Adaptive moment estimation)

compute the running average of the gradients.

In the above equations Beta=decaying rate.

From the above equation, we can see that we are combining the equations from both Momentum and RMSProp.

Python Implementation

def Adam(data, theta, lr = 1e-2, beta1 = 0.9, beta2 = 0.9, epsilon = 1e-6, num_iterations = 1000): loss = [] #initialize first moment mt mt = np.zeros(theta.shape[0]) #initialize second moment vt vt = np.zeros(theta.shape[0]) for t in range(num_iterations): #compute gradients with respect to theta gradients = compute_gradients(data, theta) #update first moment mt as given in equation (19) mt = beta1 * mt + (1. - beta1) * gradients #update second moment vt as given in equation (20) vt = beta2 * vt + (1. - beta2) * gradients ** 2 #compute bias-corected estimate of mt (21) mt_hat = mt / (1. - beta1 ** (t+1)) #compute bias-corrected estimate of vt (22) vt_hat = vt / (1. - beta2 ** (t+1)) #update the model parameter as given in (23) theta = theta - (lr / (np.sqrt(vt_hat) + epsilon)) * mt_hat loss.append(loss_function(data,theta)) return loss

You're reading Guide To Gradient Descent And Its Variants With Python Implementation

Variants Of Gradient Descent Algorithm


In a Neural Network, the Gradient Descent Algorithm is used during the backward propagation to update the parameters of the model. This article is completely focused on the variants of the Gradient Descent Algorithm in detail. Without any delay, let’s start!

Note: If you are more interested in learning concepts in an Audio-Visual format, We have this entire article explained in the video below. If not, you may continue reading.

Recap: What is the equation of the Gradient Descent Algorithm?

This is the updated equation for the Gradient Descent algorithm-

Here θ is the parameter we wish to update, dJ/dθ is the partial derivative which tells us the rate of change of error on the cost function with respect to the parameter θ and α here is the Learning Rate. I hope you are familiar with these terms, if not then I would recommend you to first go through this article on Understanding Gradient Descent Algorithm.

So, this J here represents the cost function and there are multiple ways to calculate this cost. Based on the way we are calculating this cost function there are different variants of Gradient Descent.

Batch Gradient Descent

Let’s say there are a total of ‘m’ observations in a data set and we use all these observations to calculate the cost function J, then this is known as Batch Gradient Descent.

So we take the entire training set, perform forward propagation and calculate the cost function. And then we update the parameters using the rate of change of this cost function with respect to the parameters. An epoch is when the entire training set is passed through the model, forward propagation and backward propagation are performed and the parameters are updated. In batch Gradient Descent since we are using the entire training set, the parameters will be updated only once per epoch.

Stochastic Gradient Descent(SGD)

If you use a single observation to calculate the cost function it is known as Stochastic Gradient Descent, commonly abbreviated as SGD. We pass a single observation at a time, calculate the cost and update the parameters.

Let’s say we have 5 observations and each observation has three features and the values that I’ve taken are completely random.

Now if we use the SGD, will take the first observation, then pass it through the neural network, calculate the error and then update the parameters.

Then will take the second observation and perform similar steps with it. This step will be repeated until all observations have been passed through the network and the parameters have been updated.

Each time the parameter is updated, it is known as an Iteration. Here since we have 5 observations, the parameters will be updated 5 times or we can say that there will be 5 iterations. Had this been the Batch Gradient Descent we would have passed all the observations together and the parameters have been updated only once. In the case of SGD, there will be ‘m’ iterations per epoch, where ‘m’ is the number of observations in a dataset.

So far we’ve seen that if we use the entire dataset to calculate the cost function, it is known as Batch Gradient Descent and if use a single observation to calculate the cost it is known as SGD.

Mini-batch Gradient Descent

Another type of Gradient Descent is the Mini-batch Gradient Descent. It takes a subset of the entire dataset to calculate the cost function. So if there are ‘m’ observations then the number of observations in each subset or mini-batches will be more than 1 and less than ‘m’.

Again let’s take the same example. Assume that the batch size is 2. So we’ll take the first two observations, pass them through the neural network, calculate the error and then update the parameters.

Then we will take the next two observations and perform similar steps i.e will pass through the network, calculate the error and update the parameters.

Now since we’re left with the single observation in the final iteration, there will be only a single observation and will update the parameters using this observation.

Comparison between Batch GD, SGD, and Mini-batch GD:

This is a brief overview of the different variants of Gradient Descent. Now let’s compare these different types with each other:

Comparison: Number of observations used for Updation

In the case of Stochastic Gradient Descent, we update the parameters after every single observation and we know that every time the weights are updated it is known as an iteration.

In the case of Mini-batch Gradient Descent, we take a subset of data and update the parameters based on every subset.

Comparison: Cost function

Now since we update the parameters using the entire data set in the case of the Batch GD, the cost function, in this case, reduces smoothly.

On the other hand, this updation in the case of SGD is not that smooth. Since we’re updating the parameters based on a single observation, there are a lot of iterations. It might also be possible that the model starts learning noise as well.

The updation of the cost function in the case of Mini-batch Gradient Descent is smoother as compared to that of the cost function in SGD. Since we’re not updating the parameters after every single observation but after every subset of the data.

Comparison: Computation Cost and Time

Now coming to the computation cost and time taken by these variants of Gradient Descent. Since we’ve to load the entire data set at a time, perform the forward propagation on that and calculate the error and then update the parameters, the computation cost in the case of Batch gradient descent is very high.

Computation cost in the case of SGD is less as compared to the Batch Gradient Descent since we’ve to load every single observation at a time but the Computation time here increases as there will be more number of updates which will result in more number of iterations.

In the case of Mini-batch Gradient Descent, taking a subset of the data there are a lesser number of iterations or updations and hence the computation time in the case of mini-batch gradient descent is less than SGD. Also, since we’re not loading the entire dataset at a time whereas loading a subset of the data, the computation cost is also less as compared to the Batch gradient descent. This is the reason why people usually prefer using Mini-batch gradient descent. Practically whenever we say Stochastic Gradient Descent we generally refer to Mini-batch Gradient Descent.

Here is the complete Comparison Chart:

End Notes

In this video, we saw the variants of the Gradient Descent Algorithm in detail. We also compared all of them with each other and found that Mini-batch GD is the most commonly used variant of the Gradient Descent.

If you are looking to kick start your Data Science Journey and want every topic under one roof, your search stops here. Check out Analytics Vidhya’s Certified AI & ML BlackBelt Plus Program


Beginners Guide To Convolutional Neural Network With Implementation In Python

This article was published as a part of the Data Science Blogathon

We have learned about the Artificial Neural network and its application in the last few articles. This blog will be all about another Deep Learning model which is the Convolutional Neural Network. As always this will be a beginner’s guide and will be written in such as matter that a starter in the Data Science field will be able to understand the concept, so keep on reading 🙂

1. Introduction to Convolutional Neural Network

2. Its Components

Input layer

Convolutional Layer

Pooling Layer

Fully Connected Layer

3. Practical Implementation of CNN on a dataset

Introduction to CNN

Convolutional Neural Network is a Deep Learning algorithm specially designed for working with Images and videos. It takes images as inputs, extracts and learns the features of the image, and classifies them based on the learned features.

This algorithm is inspired by the working of a part of the human brain which is the Visual Cortex. The visual Cortex is a part of the human brain which is responsible for processing visual information from the outside world. It has various layers and each layer has its own functioning i.e each layer extracts some information from the image or any visual and at last all the information received from each layer is combined and the image/visual is interpreted or classified.

Similarly, CNN has various filters, and each filter extracts some information from the image such as edges, different kinds of shapes (vertical, horizontal, round), and then all of these are combined to identify the image.

It is too much computation for an ANN model to train large-size images and different types of image channels.

Another reason is that ANN is sensitive to the location of the object in the image i.e if the location or place of the same object changes, it will not be able to classify properly.

Components of CNN

The CNN model works in two steps: feature extraction and Classification

Feature Extraction is a phase where various filters and layers are applied to the images to extract the information and features out of it and once it’s done it is passed on to the next phase i.e Classification where they are classified based on the target variable of the problem.

A typical CNN model looks like this:

Input layer

Convolution layer + Activation function

Pooling layer

Fully Connected Layer

Let’s learn about each layer in detail.

Input layer

As the name says, it’s our input image and can be Grayscale or RGB. Every image is made up of pixels that range from 0 to 255. We need to normalize them i.e convert the range between 0 to 1  before passing it to the model.

Below is the example of an input image of size 4*4 and has 3 channels i.e RGB and pixel values.

Convolution Layer

The convolution layer is the layer where the filter is applied to our input image to extract or detect its features. A filter is applied to the image multiple times and creates a feature map which helps in classifying the input image. Let’s understand this with the help of an example. For simplicity, we will take a 2D input image with normalized pixels.

In the above figure, we have an input image of size 6*6 and applied a filter of 3*3 on it to detect some features. In this example, we have applied only one filter but in practice, many such filters are applied to extract information from the image.

The result of applying the filter to the image is that we get a Feature Map of 4*4 which has some information about the input image. Many such feature maps are generated in practical applications.

Let’s get into some maths behind getting the feature map in the above image.

As presented in the above figure, in the first step the filter is applied to the green highlighted part of the image, and the pixel values of the image are multiplied with the values of the filter (as shown in the figure using lines) and then summed up to get the final value.

In the next step, the filter is shifted by one column as shown in the below figure. This jump to the next column or row is known as stride and in this example, we are taking a stride of 1 which means we are shifting by one column.

Similarly, the filter passes over the entire image and we get our final Feature Map. Once we get the feature map, an activation function is applied to it for introducing nonlinearity.

A point to note here is that the Feature map we get is smaller than the size of our image. As we increase the value of stride the size of the feature map decreases.

This is how a filter passes through the entire image with the stride of 1

Pooling Layer

The pooling layer is applied after the Convolutional layer and is used to reduce the dimensions of the feature map which helps in preserving the important information or features of the input image and reduces the computation time.

Using pooling, a lower resolution version of input is created that still contains the large or important elements of the input image.

The most common types of Pooling are Max Pooling and Average Pooling. The below figure shows how Max Pooling works. Using the Feature map which we got from the above example to apply Pooling. Here we are using a Pooling layer of size 2*2 with a stride of 2.

The maximum value from each highlighted area is taken and a new version of the input image is obtained which is of size 2*2 so after applying Pooling the dimension of the feature map has reduced. 

Fully Connected Layer

Till now we have performed the Feature Extraction steps, now comes the Classification part. The Fully connected layer (as we have in ANN) is used for classifying the input image into a label. This layer connects the information extracted from the previous steps (i.e Convolution layer and Pooling layers) to the output layer and eventually classifies the input into the desired label.

The complete process of a CNN model can be seen in the below image.

How to Implement CNN in Python? #evaluting the model model.evaluate(X_test,y_test) Frequently Asked Questions

Q1. What is CNN in Python?

A. A Convolutional Neural Network (CNN) is a type of deep neural network used for image recognition and classification tasks in machine learning. Python libraries like TensorFlow, Keras, PyTorch, and Caffe provide pre-built CNN architectures and tools for building and training them on specific datasets.

Q2. What are the 4 types of CNN?

A. The four common types of Convolutional Neural Networks (CNNs) are LeNet, AlexNet, VGGNet, and ResNet. LeNet is the first CNN architecture used for handwritten digit recognition, while AlexNet, VGGNet, and ResNet are deep CNNs that achieved top performance on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

Q3. What is tensorflow in python?

A. TensorFlow is an open-source machine learning and artificial intelligence library developed by Google Brain Team. It is written in Python and provides high-level APIs like Keras, as well as low-level APIs, for building and training machine learning models. TensorFlow also offers tools for data preprocessing, visualization, and distributed computing.

End notes:

We have covered some important elements of CNN in this blog while many are still left such as Padding, Data Augmentation, more details on Stride but as Deep learning is a deep and never-ending topic so I will try to discuss it in some future blogs. I hope you found this article helpful and worth your time investing on.

In the next few blogs, you can expect a detailed implementation of CNN with explanations and concepts like Data augmentation and Hyperparameter tuning.

About the Author

I am Deepanshi Dhingra currently working as a Data Science Researcher, and possess knowledge of Analytics, Exploratory Data Analysis, Machine Learning, and Deep Learning. Feel free to content with me on LinkedIn for any feedback and suggestions.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.


Pneumonia Detection Using Cnn With Implementation In Python

Hey there! Just finished another deep learning project several hours ago, now I want to share what I actually did there. So the objective of this challenge is to determine whether a person suffers pneumonia or not. If yes, then determine whether it’s caused by bacteria or viruses. Well, I think this project should be called classification instead of detection.

Several x-ray images in the dataset used in this project.

In other words, this task is going to be a multiclass classification problem where the label names are: normal, virus, and bacteria. In order to solve this problem. I will use CNN (Convolutional Neural Network), thanks to its excellent ability to perform image classification. Not only that, but here I also implement the image augmentation technique as an approach to improve model performance. By the way, here I obtained 80% accuracy on test data which is pretty impressive to me.

The dataset used in this project can be downloaded from this Kaggle link. The size of the entire dataset itself is around 1 GB, so it might take a while to download. Or, we can also directly create a Kaggle Notebook and code the entire project there, so we don’t even need to download anything. Next, if you explore the dataset folder, you will see that there are 3 subfolders, namely train, test and val.

Well, I think those folder names are self-explanatory. In addition, the data in the train folder consists of 1341, 1345, and 2530 samples for normal, virus and bacteria class respectively. I think that’s all for the intro, let’s now jump into the code!

Note: I put the entire code used in this project at the end of this article.

Loading modules and train images

The very first thing to do when working with a computer vision project is to load all required modules and the image data itself. I use tqdm module to display the progress bar which you’ll see why it is useful later on. The last import I do here is ImageDataGenerator coming from the Keras module. This module is going to help us with implementing the image augmentation technique during the training process.

import os import cv2 import pickle import numpy as np import matplotlib.pyplot as plt import seaborn as sns from tqdm import tqdm from sklearn.preprocessing import OneHotEncoder from sklearn.metrics import confusion_matrix from keras.models import Model, load_model from keras.layers import Dense, Input, Conv2D, MaxPool2D, Flatten from keras.preprocessing.image import ImageDataGenerator


Next, I define two functions to load image data from each folder. The two functions below might look identical at glance, but there’s actually a little difference at the line with bold text. This is done because the filename structure in NORMAL and PNEUMONIA folders are slightly different. Despite the difference, the other process done by both functions is essentially the same. First, all images are going to be resized to 200 by 200 pixels large.

This is important to do since the images in all folders are having different dimensions while the neural networks can only accept data with a fixed array size. Next, basically all images are stored with 3 color channels, which is I think it’s just redundant for x-ray images. So the idea here is to convert all those color images to grayscale.

# Do not forget to include the last slash def load_normal(norm_path): norm_files = np.array(os.listdir(norm_path)) norm_labels = np.array(['normal']*len(norm_files)) norm_images = [] for image in tqdm(norm_files): image = cv2.imread(norm_path + image) image = cv2.resize(image, dsize=(200,200)) image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) norm_images.append(image) norm_images = np.array(norm_images) return norm_images, norm_labels

def load_pneumonia(pneu_path): pneu_files = np.array(os.listdir(pneu_path)) pneu_labels = np.array([pneu_file.split('_')[1] for pneu_file in pneu_files]) pneu_images = [] for image in tqdm(pneu_files): image = cv2.imread(pneu_path + image) image = cv2.resize(image, dsize=(200,200)) image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) pneu_images.append(image) pneu_images = np.array(pneu_images) return pneu_images, pneu_labels

As the two functions above have been declared, now we can just use it to load train data. If you run the code below you’ll also see why I choose to implement tqdm module in this project.

norm_images, norm_labels = load_normal('/kaggle/input/chest-xray-pneumonia/chest_xray/train/NORMAL/')

pneu_images, pneu_labels = load_pneumonia('/kaggle/input/chest-xray-pneumonia/chest_xray/train/PNEUMONIA/')

The progress bar displayed using tqdm module.

Up to this point, we already got several arrays: norm_images, norm_labels, pneu_images, and pneu_labels. The one with _images suffix indicates that it contains the preprocessed images while the array with _labels suffix shows that it stores all ground truths (a.k.a. labels). In other words, both norm_images and pneu_images are going to be our X data while the rest is going to be y data. To make things look more straightforward, I concatenate the values of those arrays and store in X_train and y_train array.

X_train = np.append(norm_images, pneu_images, axis=0) y_train = np.append(norm_labels, pneu_labels)

The shape of the features (X) and labels (y).

By the way, I obtain the number of images of each class using the following code:

Finding out the number of unique values in our training set

Beginner’s Guide To Image Gradient

So, the gradient helps us measure how the image changes and based on sharp changes in the intensity levels; it detects the presence of an edge. We will dive deep into it by manually computing the gradient in a moment.

Why do we need an image gradient?

Image gradient is used to extract information from an image. It is one of the fundamental building blocks in image processing and edge detection. The main application of image gradient is in edge detection. Many algorithms, such as Canny Edge Detection, use image gradients for detecting edges.

Mathematical Calculation of Image gradients

Enough talking about gradients, Let’s now look at how we compute gradients manually. Let’s take a 3*3 image and try to find an edge using an image gradient. We will start by taking a center pixel around which we want to detect the edge. We have 4 main neighbors of the center pixel, which are:

(iv) P(x,y+1) bottom pixel

We will subtract the pixels opposite to each other i.e. Pbottom – Ptop and Pright – Pleft , which will give us the change in intensity or the contrast in the level of intensity of the opposite the pixel.

Change of intensity in the X direction is given by:

Gradient in Y direction = PR - PL

Change of intensity in the Y direction is given by:

Gradient in Y direction = PB - PT

Gradient for the image function is given by:

𝛥I = [𝛿I/𝛿x, 𝛿I/𝛿y]

Let us find out the gradient for the given image function:

We can see from the image above that there is a change in intensity levels only in the horizontal direction and no change in the y direction. Let’s try to replicate the above image in a 3*3 image, creating it manually-

Let us now find out the change in intensity level for the image above

GX = PR - PL Gy = PB - PT GX = 0-255 = -255 Gy = 255 - 255 = 0

𝛥I = [ -255, 0]

Let us take another image to understand the process clearly.

Let us now try to replicate this image using a grid system and create a similar 3 * 3 image.

Now we can see that there is no change in the horizontal direction of the image

GX = PR - PL , Gy = PB - PT GX = 255 - 255 = 0 Gy = 0 - 255 = -255

𝛥I = [0, -255]

But what if there is a change in the intensity level in both the direction of the image. Let us take an example in which the image intensity is changing in both the direction

Let us now try replicating this image using a grid system and create a similar 3 * 3 image.

GX = PR - PL , Gy = PB - PT GX = 0 - 255 = -255 Gy = 0 - 255 = -255

𝛥I = [ -255, -255]

Now that we have found the gradient values, let me introduce you to two new terms:

Gradient magnitude

Gradient orientation

Gradient magnitude represents the strength of the change in the intensity level of the image. It is calculated by the given formula:

Gradient Magnitude: √((change in x)² +(change in Y)²)

The higher the Gradient magnitude, the stronger the change in the image intensity

Gradient Orientation represents the direction of the change of intensity levels in the image. We can find out gradient orientation by the formula given below:

Gradient Orientation: tan-¹( (𝛿I/𝛿y) / (𝛿I/𝛿x)) * (180/𝝅) Overview of Filters

We have learned to calculate gradients manually, but we can’t do that manually each time, especially with large images. We can find out the gradient of any image by convoluting a filter over the image. To find the orientation of the edge, we have to find the gradient in both X and Y directions and then find the resultant of both to get the very edge.

Different filters or kernels can be used for finding gradients, i.e., detecting edges.

3 filters that we will be working on in this article are

Roberts filter

Prewitt filter

Sobel filter

All the filters can be used for different purposes. All these filters are similar to each other but different in some properties. All these filters have horizontal and vertical edge detecting filters.

These filters differ in terms of the values orientation and size

Roberts Filter

Suppose we have this 4*4 image

Let us look at the computation

The gradient in x-direction =

Gx = 100 *1 + 200*0 + 150*0 - 35*1 Gx = 65

The gradient in y direction =

Gy = 100 *0 + 200*1 - 150*1 + 35*0 Gy = 50

Now that we have found out both these values, let us calculate gradient strength and gradient orientation.

Gradient magnitude = √(Gx)² + (Gy)² = √(65)² + (50)² = √6725 ≅ 82

We can use the arctan2 function of NumPy to find the tan-1 in order to find the gradient orientation

Gradient Orientation = np.arctan2( Gy / Gx) * (180/ 𝝅) = 37.5685 Prewitt Filter

Prewitt filter is a 3 * 3 filter and it is more sensitive to vertical and horizontal edges as compared to the Sobel filter. It detects two types of edges – vertical and horizontal. Edges are calculated by using the difference between corresponding pixel intensities of an image.

A working example of Prewitt filter

Suppose we have the same 4*4 image as earlier

Let us look at the computation

The gradient in x direction =

Gx = 100 *(-1) + 200*0 + 100*1 + 150*(-1) + 35*0 + 100*1 + 50*(-1) + 100*0 + 200*1 Gx = 100

The gradient in y direction =

Gy = 100 *1 + 200*1 + 200*1 + 150*0 + 35*0 +100*0 + 50*(-1) + 100*(-1) + 200*(-1) Gy = 150

Now that we have found both these values let us calculate gradient strength and gradient orientation.

Gradient magnitude = √(Gx)² + (Gy)² = √(100)² + (150)² = √32500 ≅ 180

We will use the arctan2 function of NumPy to find the gradient orientation

Gradient Orientation = np.arctan2( Gy / Gx) * (180/ 𝝅) = 56.3099 Sobel Filter

Sobel filter is the same as the Prewitt filter, and just the center 2 values are changed from 1 to 2 and -1 to -2 in both the filters used for horizontal and vertical edge detection.

A working example of Sobel filter

Suppose we have the same 4*4 image as earlier

Let us look at the computation

The gradient in x direction =

Gx = 100 *(-1) + 200*0 + 100*1 + 150*(-2) + 35*0 + 100*2 + 50*(-1) + 100*0 + 200*1 Gx = 50

The gradient in y direction =

Gy = 100 *1 + 200*2 + 100*1 + 150*0 + 35*0 +100*0 + 50*(-1) + 100*(-2) + 200*(-1) Gy = 150

Now that we have found out both these values, let us calculate gradient strength and gradient orientation.

Gradient magnitude = √(Gx)² + (Gy)² = √(50)² + (150)² = √ ≅ 58

Using the arctan2 function of NumPy to find the gradient orientation

Gradient Orientation = np.arctan2( Gy / Gx) * (180/ 𝝅) = 71.5650 Implementation using OpenCV

We will perform the program on a very famous image known as Lenna.

Let us start by installing the OpenCV package

#installing opencv !pip install cv2

After we have installed the package, let us import the package and other libraries

Using Roberts filter

Python Code:

Using Prewitt Filter #Converting image to grayscale gray_img = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) #Creating Prewitt filter kernelx = np.array([[1,1,1],[0,0,0],[-1,-1,-1]]) kernely = np.array([[-1,0,1],[-1,0,1],[-1,0,1]]) #Applying filter to the image in both x and y direction img_prewittx = cv2.filter2D(img, -1, kernelx) img_prewitty = cv2.filter2D(img, -1, kernely) # Taking root of squared sum(np.hypot) from both the direction and displaying the result prewitt = np.hypot(img_prewitty,img_prewittx) prewitt = prewitt[:,:,0] prewitt = prewitt.astype('int') plt.imshow(prewitt,cmap='gray')


Using Sobel filter gray_img = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY) kernelx = np.array([[-1,0,1],[-2,0,2],[-1,0,1]]) kernely = np.array([[1, 2, 1],[0, 0, 0],[-1,-2,-1]]) img_x = cv2.filter2D(gray_img, -1, kernelx) img_y = cv2.filter2D(gray_img, -1, kernely) #taking root of squared sum and displaying result new=np.hypot(img_x,img_y) plt.imshow(new.astype('int'),cmap='gray')



This article taught us the basics of Image gradient and its application in edge detection. Image gradient is one of the fundamental building blocks of image processing. It is the directional change in the intensity of the image. The main application of image gradient is in edge detection. Finding the change in intensity can conclude that it can be a boundary of an object. We can compute the gradient, its magnitude, and the orientation of the gradient manually. We usually use filters, which are of many kinds and for different results and purposes. The filters discussed in this article are the Roberts filter, Prewitt filter, and Sobel filter. We implemented the code in OpenCV, using all these 3 filters for computing gradient and eventually finding the edges.

Some key points to be noted:

Image gradient is the building block of any edge detection algorithm.

We can manually find out the image gradient and the strength and orientation of the gradient.

We learned how to find the gradient and detect edges using different filters coded using OpenCV.

Gradient Boosting Algorithm: A Complete Guide For Beginners

This article was published as a part of the Data Science Blogathon


In this article, I am going to discuss the math intuition behind the Gradient boosting algorithm. It is more popularly known as Gradient boosting Machine or GBM. It is a boosting method and I have talked more about boosting in this article.

Gradient boosting is a method standing out for its prediction speed and accuracy, particularly with large and complex datasets. From Kaggle competitions to machine learning solutions for business, this algorithm has produced the best results. We already know that errors play a major role in any machine learning algorithm. There are mainly two types of error, bias error and variance error. Gradient boost algorithm helps us minimize bias error of the model

Before getting into the details of this algorithm we must have some knowledge about AdaBoost Algorithm which is again a boosting method. This algorithm starts by building a decision stump and then assigning equal weights to all the data points. Then it increases the weights for all the points which are misclassified and lowers the weight for those that are easy to classify or are correctly classified. A new decision stump is made for these weighted data points. The idea behind this is to improve the predictions made by the first stump. I have talked more about this algorithm here. Read this article before starting this algorithm to get a better understanding.

The main difference between these two algorithms is that Gradient boosting has a fixed base estimator i.e., Decision Trees whereas in AdaBoost we can change the base estimator according to our needs.

Table of Contents

What is Boosting technique?

Gradient Boosting Algorithm

Gradient Boosting Regressor

Example of gradient boosting

Gradient Boosting Classifier

Implementation using Scikit-learn

Parameter Tuning in Gradient Boosting (GBM) in Python

End Notes

Table of contents

About the Author

What is boosting?

While studying machine learning you must have come across this term called Boosting. It is the most misinterpreted term in the field of Data Science. The principle behind boosting algorithms is first we built a model on the training dataset, then a second model is built to rectify the errors present in the first model. Let me try to explain to you what exactly does this means and how does this works.

Suppose you have n data points and 2 output classes (0 and 1). You want to create a model to detect the class of the test data. Now what we do is randomly select observations from the training dataset and feed them to model 1 (M1), we also assume that initially, all the observations have an equal weight that means an equal probability of getting selected.

Remember in ensembling techniques the weak learners combine to make a strong model so here M1, M2, M3….Mn all are weak learners.

Since M1 is a weak learner, it will surely misclassify some of the observations. Now before feeding the observations to M2 what we do is update the weights of the observations which are wrongly classified. You can think of it as a bag that initially contains 10 different color balls but after some time some kid takes out his favorite color ball and put 4 red color balls instead inside the bag. Now off-course the probability of selecting a red ball is higher. This same phenomenon happens in Boosting techniques, when an observation is wrongly classified, its weight get’s updated and for those which are correctly classified, their weights get decreased. The probability of selecting a wrongly classified observation gets increased hence in the next model only those observations get selected which were misclassified in model 1.

Similarly, it happens with M2, the wrongly classified weights are again updated and then fed to M3. This procedure is continued until and unless the errors are minimized, and the dataset is predicted correctly. Now when the new datapoint comes in (Test data) it passes through all the models (weak learners) and the class which gets the highest vote is the output for our test data.

What is a Gradient boosting Algorithm?

The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. But how do we do that? How do we reduce the error? This is done by building a new model on the errors or residuals of the previous model.

When the target column is continuous, we use Gradient Boosting Regressor whereas when it is a classification problem, we use Gradient Boosting Classifier. The only difference between the two is the “Loss function”. The objective here is to minimize this loss function by adding weak learners using gradient descent. Since it is based on loss function hence for regression problems, we’ll have different loss functions like Mean squared error (MSE) and for classification, we will have different for e.g log-likelihood.

Understand Gradient Boosting Algorithm with example

Let’s understand the intuition behind Gradient boosting with the help of an example. Here our target column is continuous hence we will use Gradient Boosting Regressor.

Following is a sample from a random dataset where we have to predict the car price based on various features. The target column is price and other features are independent features.

Image Source: Author

Step -1 The first step in gradient boosting is to build a base model to predict the observations in the training dataset. For simplicity we take an average of the target column and assume that to be the predicted value as shown below:

Image Source: Author

Why did I say we take the average of the target column? Well, there is math involved behind this. Mathematically the first step can be written as:

Looking at this may give you a headache, but don’t worry we will try to understand what is written here.

Here L is our loss function

Gamma is our predicted value

argmin means we have to find a predicted value/gamma for which the loss function is minimum.

Since the target column is continuous our loss function will be:

Here yi is the observed value

And gamma is the predicted value

Now we need to find a minimum value of gamma such that this loss function is minimum. We all have studied how to find minima and maxima in our 12th grade. Did we use to differentiate this loss function and then put it equal to 0 right? Yes, we will do the same here.

Let’s see how to do this with the help of our example. Remember that y_i is our observed value and gamma_i is our predicted value, by plugging the values in the above formula we get:

We end up over an average of the observed car price and this is why I asked you to take the average of the target column and assume it to be your first prediction.

Hence for gamma=14500, the loss function will be minimum so this value will become our prediction for the base model.

Step-2 The next step is to calculate the pseudo residuals which are (observed value – predicted value)

Image Source: Author

Here F(xi) is the previous model and m is the number of DT made.

We are just taking the derivative of loss function w.r.t the predicted value and we have already calculated this derivative:

If you see the formula of residuals above, we see that the derivative of the loss function is multiplied by a negative sign, so now we get:

The predicted value here is the prediction made by the previous model. In our example the prediction made by the previous model (initial base model prediction) is 14500, to calculate the residuals our formula becomes:

In the next step, we will build a model on these pseudo residuals and make predictions. Why do we do this? Because we want to minimize these residuals and minimizing the residuals will eventually improve our model accuracy and prediction power. So, using the Residual as target and the original feature Cylinder number, cylinder height, and Engine location we will generate new predictions. Note that the predictions, in this case, will be the error values, not the predicted car price values since our target column is an error now.

Let’s say hm(x) is our DT made on these residuals.

Step- 4 In this step we find the output values for each leaf of our decision tree. That means there might be a case where 1 leaf gets more than 1 residual, hence we need to find the final output of all the leaves. TO find the output we can simply take the average of all the numbers in a leaf, doesn’t matter if there is only 1 number or more than 1.

Let’s see why do we take the average of all the numbers. Mathematically this step can be represented as:

Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking about the 1st DT and when it is “M” we are talking about the last DT.

The output value for the leaf is the value of gamma that minimizes the Loss function. The left-hand side “Gamma” is the output value of a particular leaf. On the right-hand side [Fm-1(xi)+ƴhm(xi))] is similar to step 1 but here the difference is that we are taking previous predictions whereas earlier there was no previous prediction.

Image Source

We see 1st residual goes in R1,1  ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .

Let’s calculate the output for the first leave that is R1,1

Now we need to find the value for gamma for which this function is minimum. So we find the derivative of this equation w.r.t gamma and put it equal to 0.

Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1

Let’s take the derivative to get the minimum value of gamma for which this function is minimum:

We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more than 1 residual, we can simply find the average of that leaf and that will be our final output.

Now after calculating the output of all the leaves, we get:

Image Source: Author

Step-5 This is finally the last step where we have to update the predictions of the previous model. It can be updated as:

where m is the number of decision trees made.

Since we have just started building our model so our m=1. Now to make a new DT our new predictions will be:

Image Source: Author

Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our base model hence the previous prediction is 14500.

nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has on the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this example.

Hm(x) is the recent DT made on the residuals.

Let’s calculate the new prediction now:

Image Source: Author

Suppose we want to find a prediction of our first data point which has a car height of 48.8. This data point will go through this decision tree and the output it gets will be multiplied with the learning rate and then added to the previous prediction.

Now let’s say m=2 which means we have built 2 decision trees and now we want to have new predictions.

This time we will add the previous prediction that is F1(x) to the new DT made on residuals. We will iterate through these steps again and again till the loss is negligible.

Image Source: Author

If a new data point says height = 1.40 comes, it’ll go through all the trees and then will give the prediction. Here we have only 2 trees hence the datapoint will go through these 2 trees and the final output will be F2(x).

What is Gradient Boosting Classifier?

A gradient boosting classifier is used when the target column is binary. All the steps explained in the Gradient boosting regressor are used here, the only difference is we change the loss function. Earlier we used Mean squared error when the target column was continuous but this time, we will use log-likelihood as our loss function.

Let’s see how this loss function works, to read more about log-likelihood I recommend you to go through this article where I have given each detail you need to understand this.

The loss function for the classification problem is given below:

Our first step in the gradient boosting algorithm was to initialize the model with some constant value, there we used the average of the target column but here we’ll use log(odds) to get that constant value. The question comes why log(odds)?

When we differentiate this loss function, we will get a function of log(odds) and then we need to find a value of log(odds) for which the loss function is minimum.

Confused right? Okay let’s see how it works:

Let’s first transform this loss function so that it is a function of log(odds), I’ll tell you later why we did this transformation.

Now this is our loss function, and we need to minimize it, for this, we take the derivative of this w.r.t to log(odds) and then put it equal to 0,

Here y are the observed values

You must be wondering that why did we transform the loss function into the function of log(odds). Actually, sometimes it is easy to use the function of log(odds), and sometimes it’s easy to use the function of predicted probability “p”.

It is not compulsory to transform the loss function, we did this just to have easy calculations.

Hence the minimum value of this loss function will be our first prediction (base model prediction)

Now in the Gradient boosting regressor our next step was to calculate the pseudo residuals where we multiplied the derivative of the loss function with -1. We will do the same but now the loss function is different, and we are dealing with the probability of an outcome now.

After finding the residuals we can build a decision tree with all independent variables and target variables as “Residuals”.

Now when we have our first decision tree, we find the final output of the leaves because there might be a case where a leaf gets more than 1 residuals, so we need to calculate the final output value. The math behind this step is out of the scope of this article so I will mention the direct formula to calculate the output of a leaf:

Finally, we are ready to get new predictions by adding our base model with the new tree we made on residuals.

There are a few variations of gradient boosting and a couple of them are momentarily clarified in the coming article.

Implementation Using scikit-learn

The task here is to classify the income of an individual, when given the required inputs about his personal life.

First, let’s import all required libraries.

# Import all relevant libraries from sklearn.ensemble import GradientBoostingClassifier import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix from sklearn import preprocessing import warnings warnings.filterwarnings("ignore") Now let’s read the dataset and look at the columns to understand the information better. df = pd.read_csv('income_evaluation.csv') df.head()

I have already done the data preprocessing part and you can look whole code chúng tôi my main aim is to tell you how to implement this on python. Now for training and testing our model, the data has to be divided into train and test data.

We will also scale the data to lie between 0 and 1.

# Split dataset into test and train data X_train, X_test, y_train, y_test = train_test_split(df.drop(‘income’, axis=1),df[‘income’], test_size=0.2)

Now let’s go ahead with defining the Gradient Boosting Classifier along with it’s hyperparameters. Next, we will fit this model on the training data.

# Define Gradient Boosting Classifier with hyperparameters gbc=GradientBoostingClassifier(n_estimators=500,learning_rate=0.05,random_state=100,max_features=5 ) # Fit train data to GBC,y_train)

The model has been trained and we can now observe the outputs as well.

Below, you can see the confusion matrix of the model, which gives a report of the number of classifications and misclassifications.

# Confusion matrix will give number of correct and incorrect classifications print(confusion_matrix(y_test, gbc.predict(X_test))) # Accuracy of model print("GBC accuracy is %2.2f" % accuracy_score( y_test, gbc.predict(X_test)))

Let’s check the classification report also:

from sklearn.metrics import classification_report pred=gbc.predict(X_test) print(classification_report(y_test, pred)) Parameter Tuning in Gradient Boosting (GBM) in Python Tuning n_estimators and Learning rate

n_estimators is the number of trees (weak learners) that we want to add in the model. There are no optimum values for learning rate as low values always work better, given that we train on sufficient number of trees. A high number of trees can be computationally expensive that’s why I have taken few number of trees here.

from sklearn.model_selection import GridSearchCV grid = {     'learning_rate':[0.01,0.05,0.1],     'n_estimators':np.arange(100,500,100), } gb = GradientBoostingClassifier() gb_cv = GridSearchCV(gb, grid, cv = 4),y_train) print("Best Parameters:",gb_cv.best_params_) print("Train Score:",gb_cv.best_score_) print("Test Score:",gb_cv.score(X_test,y_test))

We see the accuracy increased from 86 to 89 after tuning n_estimators and learning rate. Also the “true positive” and the “true negative” rate improved.

We can also tune max_depth parameter which you must have heard in decision trees and random forests.

grid = {'max_depth':[2,3,4,5,6,7] } gb = GradientBoostingClassifier(learning_rate=0.1,n_estimators=400) gb_cv = GridSearchCV(gb, grid, cv = 4),y_train) print("Best Parameters:",gb_cv.best_params_) print("Train Score:",gb_cv.best_score_) print("Test Score:",gb_cv.score(X_test,y_test))

The accuracy has increased even more when we tuned the parameter “max_depth”.

End Notes

I hope you got an understanding of how the Gradient Boosting algorithm works under the hood. I have tried to show you the math behind this is the easiest way possible.

In the next article, I will explain Xtreme Gradient Boosting (XGB), which is again a new technique to combine various models and to improve our accuracy score. It is just an extension of the gradient boost algorithm.

About the Author

I am an undergraduate student currently in my last year majoring in Statistics (Bachelors of Statistics) and have a strong interest in the field of data science, machine learning, and artificial intelligence. I enjoy diving into data to discover trends and other valuable insights about the data. I am constantly learning and motivated to try new things.

I am open to collaboration and work.

For any doubt and queries, feel free to contact me on Email

Connect with me on LinkedIn and Twitter

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.


Update the detailed information about Guide To Gradient Descent And Its Variants With Python Implementation on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!