In this post, I’m going to explain what is the Gradient Descent and how to implement it from scratch in Python. To understand how it works you will need some basic math and logical thinking. Though a stronger math background would be preferable to understand derivatives, I will try to explain them as simple as possible. Gradient Descent can be used in different machine learning algorithms, including neural networks. For this tutorial, we are going to build it for a linear regression problem, because it’s easy to understand and visualize. I also created GitHub repo with all explanations. Let’s get started. Linear RegressionLinear RegressionIn order to fit the regression line, we tune two parameters: slope ( Minimization of the function is the exact task of the Gradient Descent algorithm. It takes parameters and tunes them till the local minimum is reached. Let’s break down the process in steps and explain what is actually going on under the hood:
DerivativesWe use partial derivatives to find how each individual parameter affects MSE, so that's where word
partial comes from. We take these derivatives with respect to With respect to The chain rule says that we should take a derivative of an outside function, keep an inside function untouched and then multiply everything by the derivative of the inside function. All we need to do now is to take a partial derivative with respect to m and b of the function below: Squared ErrorIf you’re new to this you’d be surprised that
To put everything together we get the following: To derive with respect to
Put everything together: Phew! The hardest part behind us, now we can dive into the Python environment. Gradient DescentBecause we have only one input feature, X must be a NumPy vector, which is a list of values. We can easily extend
it to multiple features by taking derivatives of weights Take a look at the code. We start by defining import numpy as np We can track visually how the algorithm was approaching a local minimum. To the naked eye, it seems like it didn’t converge completely. To solve that we can increase epochs and learning rate parameters. BONUS: Stochastic Gradient DescentStochastic Gradient Descent uses only one sample to update parameters, which makes it faster. We can make small corrections to the previous version and see how it performs. def SGD(X, y, lr=0.05, epoch=10, batch_size=1): We can observe the stochastic nature of the process. The regression line was jumping all over the place, trying to find a minimum based on just one sample. Poor thing. ConclusionWe’ve learned how to implement Gradient Descent and SGD from scratch. The same approach we take when we do backpropagation when we train neural networks. If you’re interested in implementing Deep Neural Networks from scratch let me know in the comments. Thank you for reading! Arseniy. |