Graphical Notes on Machine Learning — Gradient Descent

Demystifying Gradient Descent: A Visual Guide to Understanding and Applying this Optimization Technique

6 min readMar 4, 2024

Gradient descent (fun) illustration by author

Gradient descent — is an optimization algorithm, which commonly used in machine learning to find the optimal parameters for the models during training step. Imagine you’re lost in a hilly landscape and want to find the lowest valley. Gradient descent helps you get there by iteratively taking steps in the direction that goes downhill the fastest as in above illustration.

Recap on cost function

Cost function — also sometimes called a loss function, is a mathematical tool used in machine learning to evaluate how well a model is performing. It essentially quantifies the error between the predicted values of the model and the actual values it’s trying to learn from. In each application, the cost function can be any function regarding problem setting. In our example here, we will see the simplest form of linear regression as representing the main equation using throughout the note.

Cost function and goal of gradient descent (image by author)

Goal — minimize the cost function J(w,b) by finding the optimal values of w and b.

How does gradient descent work in a simple way

Gradient descent starts with initializing the position of (w, b) on the graph, where the location keeps reflecting back on the cost function J, or random values of w and b anywhere at the initial step. And then, the algorithm performs repeating steps to update parameters w and b until the trajectory reaches somewhere closer to the local minima.

How does gradient descent work (image by author)

Gradient descent algorithm

Let’s dive deeper into the Gradient Descent algorithm!

The update rule for each parameter is determined by the learning rate and the partial derivative of the cost function with respect to that parameter.

Here’s a step-by-step breakdown:

Start with a guess — You begin with an initial guess for the position of the minimum (like a random point on the graph). This guess represents the initial values of the model’s parameters.
Calculate the gradient — The gradient tells you the direction of steepest descent at your current position. It’s like having a compass that points downhill. Mathematically, the gradient is calculated using calculus.
Take a step — You move in the negative direction of the gradient by a small amount called the learning rate. This ensures you don’t jump over the valley but gradually descend towards it. The learning rate controls how big of a step you take; a larger rate means faster movement but also a higher risk of missing the valley altogether.
Repeat — Go back to step 2 and keep calculating the gradient at your new position. Update your position based on the new gradient and learning rate.
Convergence — Continue iterating (steps 2–4) until you reach a point where the cost function doesn’t change much anymore, or you meet a pre-defined stopping criterion. This indicates you’ve (hopefully) found a minimum.

Intuition

Here are some general intuition during the updating towards the goal of finding the lowest point (the minimum) in this graph. From the derivative terms in the update for w, we can see that the differentiation of the cost function represents the slope of the function J. If the slope is positive, the update value (learning rate) will be subtracted, making the step size smaller over time. Conversely, if the slope is negative, the update value will be added, making the step size larger over time until it reaches a minimum, which is often close to zero.

Learning rate

Learning rate — is a crucial hyperparameter in the update term. It controls the step size taken by the algorithm during each iteration as it navigates towards the minimum of a cost function.

An appropriate learning rate helps the algorithm converge smoothly to the minimum point, meaning it reaches the optimal solution efficiently and avoids getting stuck in local minima (false bottoms).

Learning rate is too small or too large (image by author)

When the learning rate is too small, it can cause slow convergence during the training, reflecting on the tiny step on algorithm takes significantly longer time to reach the minimum point.

When the learning rate is too small, this can cause the oscillation and instability. The algorithm takes large steps, potentially overshooting the minimum point and oscillating around it, resulting that it will never reach the minimum point.

Problem can also appear when the function is too complicated. The algorithm might get trapped in shallow valleys (local minima) instead of reaching the global minimum (the absolute lowest point). It might mistake a small improvement for the optimal solution.

Learning rate get stuck (image by author)

The more modification we can do with learning rate is setting adaptive learning rates or learning rate decay. This will adjust automatically during training based on the progress made.

While a fixed learning rate is the simplest approach, it can also lead the algorithm to reach local minima. This happens because both the derivative term and the parameter w become smaller during updates in some simple functions.

Grouping all parts together

From the linear regression model, we can link the predicted values to the cost function terms. The cost function here is in the form of squared error, which can be visualized as the bowl shape, and it contains a global minimum. Next step, gradient descent relies heavily on the cost function. By calculating the gradient (the rate of change) of the cost function with respect to the model’s parameters, gradient descent can determine the direction in which the model needs to adjust to minimize the error and improve its performance.

Gradient descent with linear regression (image by author)

Using calculus, we can derive the update term, which is the derivative of the cost function. The algorithm iteratively updates the parameters, resembling a walk towards the minimum point on a contour plot. This process helps us find a model that closely approximates the function f.

Visualizing gradient descent with linear regression (image by author)

When all training examples are used to calculate the gradient and update the model parameters in each iteration, this is called batch gradient descent. However, this can be computationally expensive for large datasets. In such cases, subsets of the data, called mini batches, can be used to optimize the model efficiently. Stochastic gradient descent (SGD) is another variation where the model parameters are updated after each individual data point. We’ll delve deeper into SGD in a future note!

Thanks for reading! Stay tuned for more.

If you like my note, here is the way to buy me a coffee☕.

References

This note is illustrated from machine learning specialization by Stanford Online x Deeplearning.AI — Supervised Machine Learning: Regression and Classification