Current location - Health Preservation Learning Network - Slimming men and women - What are the important optimization algorithms in machine learning?
What are the important optimization algorithms in machine learning?
Gradient descent is a very common optimization algorithm. As the basic knowledge of machine learning, this is an algorithm that must be mastered. With the help of this article, let's learn more about this algorithm.

order

The code for this article can be obtained from my Github:

/paulQuei/gradient_descent

The algorithm example in this paper is implemented in Python language, and numpy and matplotlib are used in the implementation. If you are not familiar with these two tools, please search for tutorials on the Internet yourself.

About optimization

Most learning algorithms involve some form of optimization. Optimization refers to changing x to minimize or maximize the task of a function.

We usually call minimization the most optimization problem. Maximization can be achieved by minimizing.

We regard the function to be minimized or maximized as the objective function or standard.

We usually use a superscript * to represent the x value that a function minimizes or maximizes. Remember this:

[x^* = arg; minf(x)]

Optimization itself is a very big topic. If you are interested, you can learn from books on numerical optimization and operational research.

Models and hypothetical functions

All the models are wrong, but some are useful. George Edward pelham box

Model is a hypothesis about the data to be analyzed, and it is to learn from the data to solve a specific problem, so it is the core concept of machine learning.

There are usually a large number of models to choose from for a problem.

This article will not discuss this aspect in depth. Please refer to the related books of machine learning for various models. This paper only discusses the gradient descent algorithm based on the simplest linear model.

Here we first introduce three commonly used symbols in supervised learning:

M, describing the number of training samples

X, describing the input variable or characteristic.

Y, describing the output variable or target value.

Please note that a sample may have many features, so x and y are usually a vector. But at the beginning of the study, for the sake of easy understanding, it can be temporarily understood as a specific numerical value. The training set contains many samples, which we use to represent the i-th sample.

X is the characteristic of the data sample and y is its target value. For example, in the model of predicting house price, X is all kinds of information of the house, such as area, floor and location. Y is the price of the house. In the task of image recognition, X is all pixel data of the graph, and Y is the target object contained in the image.

We want to find a function that maps X to Y. This function should be good enough to predict the corresponding Y. For historical reasons, this function is called a hypothetical function.

The learning process is shown in the figure below. That is, we first train our algorithm model according to the existing data (called training set), and then predict new data according to the assumed function of the model.

Linear model, as its name implies, wants to describe the pattern in the form of a straight line. The assumed functions of the linear model are as follows:

[h _ { \ theta }(x)= \ theta _ { 0 }+\ theta _ { 1 } * x]

This formula should be very simple for everyone. If you draw it, it is actually a straight line.

The following figure is a concrete example, namely:

In an actual machine learning project, you will have a lot of data. The data will come from a data source. They are stored in csv files or packaged in other forms.

But this article is for demonstration, and we can automatically generate the required data through some simple codes. In order to facilitate the calculation, the amount of data demonstrated is also very small.

Import numpy as np.

max_x = 10

data_size = 10

θ_ 0 = 5

theta_ 1 = 2

Define acquisition data:

x = np.linspace( 1,max_x,data_size)

noise = np.random.normal(0,0.2,len(x))

Y = theta_0+theta_ 1 * x+noise.

Returns x, y

This code is very simple. We generated 10 pieces of data, and its x range is an integer of [1, 10]. The corresponding y is calculated in the form of linear model, and its function is: Real data are often interfered by various factors, so we deliberately add some Gaussian noise to y, so the final value of y will be slightly different from the original.

Finally, our data are as follows:

x = [ 1,2,3,4,5,6,7,8,9, 10]

y = [6.66,9. 1 1, 1 1.08, 12.67, 15. 12, 16.76, 18.75,2 1.35,22.77,24.56]

We can draw this 10 data, so that we can have an intuitive understanding, as shown in the following figure:

Although the data used in the demonstration are calculated by our formula. But in practical engineering, the parameters of the model need to be learned by data. So we assume that we don't know what the two parameters of the linear model are here, but we get them in the form of an algorithm.

Finally, we compare it with the known parameters to verify whether our algorithm is correct.

With the above data, we can try to draw a straight line to describe our model.

For example, draw a horizontal straight line as follows:

Obviously, this horizontal line is too far away from the data, which is very mismatched.

Then we can draw another diagonal line.

The diagonal line we drew for the first time may not be appropriate either. It might look like this:

Finally, we tried and tried and found the most suitable one, as shown below:

The calculation process of gradient descent algorithm is similar to this instinctive attempt, that is, it is iterative and approaches the final result step by step.

Value function

We tried several times to fit the fitting data through straight lines.

The straight line on the two-dimensional plane can be uniquely determined by two parameters, and the determination of the two parameters is also the determination of the model. So how to describe the fitting degree between the model and the data? The answer is the cost function.

The cost function describes the degree of deviation between the learning model and the actual results. Take the above three pictures as an example. Compared with the first horizontal green line, the deviation degree (cost) of the red line in the last picture should be smaller.

Obviously, we want our hypothesis function to be as close to the data as possible, that is, we want the result of the cost function to be as small as possible. This involves the optimization of the results, and the gradient descent method is one of the methods to find the minimum value.

The cost function is also called the loss function. For each sample, the hypothetical function will calculate an estimate, which we often use to express. Namely.

Naturally, we will think of the following formula to describe the degree of deviation between our model and the actual value:

[(h_\theta(x^i)-y^i)^2 =(\widehat{y}^{i}-y^i)^2 =(\ theta _ { 0 }+\ theta _ { 1 } * x^{i}-y^{i})^2]

Please note that it is the value of actual data, not the estimated value of our model. The former corresponds to the Y coordinates of the discrete points in the above figure, and the latter corresponds to the Y coordinates of the projection points of the discrete points on the straight line.

Each piece of data will have a deviation value, and the cost function is to average the deviation of all samples, and its calculation formula is as follows:

[l(\ theta)= \ frac { 1 } { m } \sum_{i= 1}^{m}(h_\theta(x^i)-y^i)^2 = \ frac { 1 } { m } \sum_{i= 1}^{m}(\theta_{0}+\ theta _ { 1 } * x^{i}-y^{i})^2]

When the result of the loss function is smaller, it means that the result of our assumed function estimation is closer to the real value. This is why we should minimize the loss function.

Different models may use different loss functions. For example, the hypothetical function of logistic regression is as follows: The cost function is as follows: With the help of the above formula, we can write a function to realize the cost function:

Define cost _ function (x, y, t0, t 1):

cost_sum = 0

For I(len(x)) in the range:

cost _ item = NP . power(t0+t 1 * x[I]-y[I],2)

Total Cost+= Cost Item

Return Cost _ Sum/Loan (X)

The code of this function should be self-explanatory, which is calculated according to the above.

We can try to choose different summation combinations to calculate the value of the cost function, and then get the result:

Import numpy as np.

Import matplotlib.pyplot as plt.

Import cm from matplotlib

Import Axes3D from mpl_toolkits.mplot3d

θ_ 0 = 5

theta_ 1 = 2

def draw_cost(x,y):

fig = PLT . figure(figsize =( 10,8))

Ax = fig.gca (projection ='3d')

Scatter count = 100

Radius = 1

T0 _ range = NP. Lin space(θ_ 0- radius, θ_ 0+ radius, scattering _ count)

T1_ range = NP. linspace (theta _1-radius, theta_ 1+radius, scattering _ count).

cost = np.zeros((len(t0_range),len(t 1_range)))

For range (len(t0_range)):

For the best range (len(t 1_range)):

Cost [a][b] = cost _ function (x, y, t0 _ range [a], t 1 _ range [b])

t0,t 1 = np.meshgrid(t0_range,t 1_range)

ax.set_xlabel('theta_0 ')

ax.set_ylabel('theta_ 1 ')

ax.plot_surface(t0,t 1,cost,cmap=cm.hsv)

In this code, we sample the sum of a range 100 times respectively, and then calculate the cost function values of different combination pairs.

If we draw the cost function values of all points, the result is shown in the following figure:

As can be seen from this figure, the closer to [5,2], the smaller the result (deviation). On the contrary, the farther away, the greater the effect.

Intuitive explanation

From the above figure, we can see that the cost function has different results in different positions.

From a three-dimensional perspective, this is the same as the ups and downs of the ground. The highest place is like the top of the mountain.

Our goal is to find a path from any point and reach the lowest point (the lowest generation value) of the graph.

The algorithm process of gradient descent is the same as what we want to do quickly from the top of the mountain.

In life, we naturally think that taking the steepest road is the fastest way down the mountain. As shown in the figure below:

Careful readers may soon have many questions about this picture, such as:

How to determine the downward direction of a function?

How far should each step go?

Is it possible to stay on the platform halfway up the mountain?

These problems are discussed next in this paper.

Algorithm description

The first point of gradient descent algorithm is to determine the direction of descent, that is, gradient.

We often use it to represent gradient.

For a curve in two-dimensional space, the gradient is the direction of its tangent. As shown in the figure below:

For functions in high dimensional space, the gradient is determined by the partial derivatives of all variables.

Its expression is as follows:

[\ nabla f({ \ theta })=(\ frac { \ partial f({ \ theta })} { \ partial \ theta _ 1 },\ frac { \ partial f({ \ theta })} { \ partial \ theta _ 2 },...,\ frac { \ partial f({ \ theta })} { \ partial \ theta _ n })]

In machine learning, we mainly use gradient descent algorithm to minimize the cost function, as follows:

[\ theta * = arg minimum L(\theta)]

Where l is the cost function and parameter.

The main logic of gradient descent algorithm is very simple, that is, descending along the gradient direction until the parameters converge.

Remember to do:

[\ Xita {k+1} _ I = \ theta {k} _ I-\ Ramda \ Nabura F (\ theta {k})].

The subscript I here represents the ith parameter. The superscript k refers to the calculation result of step k, not to the k power. On the basis of understanding, the superscript k will be omitted from the following formula. Here are a few points to note:

Convergence means that the rate of change of the function is very small. How much is appropriate depends on the specific project. In the demonstration project, we can choose a value of 0.0 1 or 0.00 1. Different values will affect the number of iterations of the algorithm, because at the end of the gradient descent, we will get closer and closer to the flat place, and the rate of change of the function will become smaller and smaller. If you choose a smaller value, the number of iterations of the algorithm may increase sharply.

In the formula, it is called step size, also known as learning rate. It determines how far each step goes, and we will explain this value in detail below. It can be temporarily assumed that it is a fixed value such as 0.0 1 or 0.00 1.

In specific projects, we will not let the algorithm run endlessly, so we usually set a maximum upper limit of the number of iterations.

Gradient descent of linear regression

With the above knowledge, we can return to the realization of gradient descent algorithm of linear model cost function.

First, according to the cost function, we can get the gradient vector as follows:

[\ nabla f({ \ theta })=(\ frac { \ partial l(\ theta)} { \ partial \ theta _ { 0 } },\ frac { \ partial l(\ theta)} { \ partial \ theta _ { 1 } })=(\ frac { 2 } { m } \ sum_{i= 1}^{m}(\theta_{0}+\ theta _ { 1 } * x^{i}-y^{i}),\ frac { 2 } { m } \ sum_{i= 1}^{m}(\theta_{0}+\ theta _ { 1 } * x^{i}-y^{i})x^{i})]

Then, each partial derivative is brought into the iterative formula to obtain:

[\ theta _ { 0 }:= \ theta _ { 0 }-\ lambda \ frac { \ partial l(\ theta _ { 0 })} { \ partial \ theta _ { 0 }-\ frac { 2 \ lambda } { m } \ sum_{i= 1}^{m}(\ theta _ { 0 }+\ theta _ { 1 } * x^{i}-y^{i})\ \ theta _ { 1 }:= \ theta _ { 1 }-\ lambda \ frac { \ partial l(\ theta

From this, we can realize our gradient descent algorithm through code, and the logic of the algorithm is not complicated:

learning_rate = 0.0 1

def gradient_descent(x,y):

t0 = 10

t 1 = 10

delta = 0.00 1

For time in range (1000):

sum 1 = 0

sum2 = 0

For I(len(x)) in the range:

sum 1+=(t0+t 1 * x[I]-y[I])

sum 2+=(t0+t 1 * x[I]-y[I])* x[I]

t0 _ = t0-2 * learning _ rate * sum 1/len(x)

t 1 _ = t 1-2 * learning _ rate * sum 2/len(x)

Print ('Times: {}, Gradient: [{}, {}]'. Format (times, t0_, t 1_)

if(ABS(t0-t0 _)& lt; Delta and ABS (t1-t1_) < △):

Print ("Gradient Descent Complete")

return t0_,t 1_

t0 = t0_

t 1 = t 1_

Print ("Gradient Down Too Many Times")

Return to t0, t 1

The code is explained as follows:

We randomly choose 10 as the starting point.

Set the maximum 1000 iterations.

The convergence range is set to 0.00 1.

The learning step is set to 0.0 1.

If the linear pattern obtained in the iterative process of the algorithm is drawn, the following dynamic diagram can be obtained:

Finally, the results obtained by the algorithm are as follows:

Times: 657, gradual change: [5.196562662718697,1.952931052920264]

Times: 658, gradual change: [5.195558390180733,1.9530753071808193]

Times: 659, gradual change: [5.194558335124868,1.9532189556399233]

Times: 660, gradual change: [5.193562479839619,1.953620008416623]

Gradient descent finishing

It can be seen from the output that the algorithm converges after 660 iterations. The result [5.19356247939619,1.95362000416623] is close to the target value [5,2]. If higher precision is needed, the value of delta can be adjusted smaller, and of course more iterations are needed at this time.

High dimensional extension

Although our example is two-dimensional, it is similar for higher dimensional situations. Similarly, it can be calculated according to the iterative formula:

[\ theta _ { I } = \ theta _ { I }-\sum_{i= 1}^{m}(h_\theta(x^{k})-y^k)x_i^k]

Here, the subscript I represents the i-th parameter and the superscript k represents the k-th data.

Gradient descent family BGD

In the above content, we can see that every iteration of the algorithm needs to traverse all the samples. This practice is called batch gradient descent or BGD for short. As a demonstration example, there are only 10 pieces of data, no problem.

However, in actual projects, the number of data sets may be millions or tens of millions, and the calculation amount of each iteration will be very large.

So there are the following two variants.

Sign on

Historical gradient descent (SGD), this algorithm is to select only one sample at a time from the sample set for calculation. Obviously, the computation of each step of this algorithm is much less.

The algorithm formula is as follows:

[\ theta _ { I } = \ theta _ { I }-\ lambda \ frac { \ partial l(\ theta)} { \ partial \ theta _ I } = \ theta _ { I }-\lambda(h_\theta(x^k)-y^k)x_i^k]

Of course, there is a price to reduce the computational complexity of the algorithm, that is, the result of the algorithm will strongly depend on the randomly obtained data, which may lead to the unsatisfactory final result of the algorithm.

MBGD

The above two practices are actually two extremes. One is to use all the data at once, and the other is to use only one data at a time.

We will naturally think of a way to take both: choose a small part of data at a time to iterate. This not only avoids the problem of too large data set, but also avoids the influence of single data on the algorithm.

This algorithm is called small batch gradient descent, or MBGD for short.

The algorithm formula is as follows:

[\ theta _ { I } = \ theta _ { I }-\sum_{i=a}^{a+b}(h_\theta(x^k)-y^k)x_i^k]

Of course, we can think that SGD is a special case of Mini-batch 1.

How to choose the algorithm variant mentioned above?

The following is Andrew Ng's suggestion:

If the number of samples is small (for example, less than or equal to 2000), BGD can be selected.

If the number of samples is large, choose MBGD, for example: 64,128,256, 5 12.

The following table is a comparison of three algorithms in deep learning optimization.

Methods Accuracy, update speed, memory occupation, online learning, BGD is good, slow, high or low, SGD is good, with a notice, MBGD is good, medium and good.

algorithm optimization

Equation 7 is the basic form of the algorithm, and many people have done more research on it. Next, we introduce several optimization methods of gradient descent algorithm.

Momentum effect

Momentum is momentum. The idea of this algorithm is to use a dynamic model: every iteration of the algorithm will be based on the final speed.

The formula of this algorithm is as follows:

[v t = \ gamma v {t-1}+\ lambda \ nablaf (\ theta) \ theta = \ theta-v _ t]

As can be seen from Comparative Equation 7, the main difference of this algorithm lies in the introduction that each moment is influenced by the previous moment.

Formally, the momentum algorithm introduces the variable V as the speed role-it represents the direction and speed of the parameter moving in the parameter space. The speed is set to the exponential decay average of the negative gradient. The name momentum comes from physical analogy. According to Newton's law of motion, negative gradient is the force that makes particles move in parameter space. Momentum is defined in physics as mass times velocity. In the momentum learning algorithm, we assume that it is a unit mass, so the velocity vector V can also be regarded as the momentum of particles.

It is a good choice to set the value from 0 to 0.9, which is a constant.

The following figure shows the effect comparison of momentum algorithm:

The momentum effect can be increased by slightly modifying the original algorithm:

def gradient _ descent _ with _ momentum(x,y):

t0 = 10

t 1 = 10

delta = 0.00 1

v0 = 0

v 1 = 0

Gamma = 0.9

For time in range (1000):

sum 1 = 0

sum2 = 0

For I(len(x)) in the range:

sum 1+=(t0+t 1 * x[I]-y[I])

sum 2+=(t0+t 1 * x[I]-y[I])* x[I]

V0 = gamma * v0+2 * learning rate * sum 1/len(x)

v 1 = gamma * v 1+2 * learning _ rate * sum 2/len(x)

t0_ = t0 - v0

t 1_ = t 1 - v 1

Print ('Times: {}, Gradient: [{}, {}]'. Format (times, t0_, t 1_)

if(ABS(t0-t0 _)& lt; Delta and ABS (t1-t1_) < △):

Print ("Gradient Descent Complete")

return t0_,t 1_

t0 = t0_

t 1 = t 1_

Print ("Gradient Down Too Many Times")

Return to t0, t 1

The following is the output of the algorithm:

Frequency: 125, gradual change: [4.9555758569991,2.0000501789775]

Frequency: 126, gradual change: [4.955309381126545,1.996928964532015]

Frequency: 127, gradual change: [4.95429643 17327005,1.98674828684156]

Frequency: 128, gradual change: [4.9536358220657,1.9781180992510465]

Frequency: 129, gradual change: [4.9541249625441,1.978858350530971]

Gradient descent finishing

It can be seen from the results that the improved algorithm only needs 129 iterations to converge. The speed is much faster than 660 times.

Similarly, we can make the calculation process of the algorithm into a dynamic diagram:

Compared with the original algorithm, the biggest difference between the improved algorithm and the original algorithm is that it will jump up and down in the final result when looking for the target value, but the later the jump is, the smaller the jump is, which is the effect of momentum.

Learning rate optimization

At this point, you may be curious about how the learning rate is set.

In fact, the choice of this value needs some experience or trial and error to determine.

The book Deep Learning describes it this way: "It is more like an art than a science, and we should carefully refer to most of the guides on this issue." . The key is that this value cannot be too big or too small.

If this value is too small, the step size of each iteration will be very small, and the result is that the algorithm needs to be iterated many times.

So, what if it's worth congress? The result is that the algorithm may oscillate back and forth around the result, but it cannot reach the target point. The following figure describes this phenomenon:

In fact, the value of learning rate is not necessarily a constant, and there are many studies on the setting of this value.

The following are some common improved algorithms.

Adagrad

AdaGrad is the abbreviation of Adaptive Gradient, and the algorithm will set a different learning rate for each parameter. It uses the sum of squares of historical gradients as the calculation basis.

The algorithm formula is as follows:

[\ theta _ I = \ theta _ I-\ frac { \ lambda }

In contrast to 7, the change here lies in the root sign below the semicolon.

The root symbol has two symbols, and the second symbol is easy to understand. It is a small constant artificially introduced to avoid division by 0, for example, it can be set to 0.005438+0.

The expression of the first symbol is expanded as follows:

[g _ t = \ sum _ {I =1} {t} \ Nabura f(\ theta){ I} \ Nabura f (\ theta) {i}]

This value is actually the accumulation of the sum of squares of each gradient in history.

AdaGrad algorithm can automatically adjust the learning rate in training, and adopt higher learning rate for parameters with lower frequency; On the contrary, a smaller learning rate is used for parameters with higher frequency. So Adagrad is very suitable for dealing with sparse data.

However, the disadvantage of this algorithm is that the learning rate may be very small, which makes the convergence speed of the algorithm very slow.

An intuitive explanation of this algorithm can be found in Professor Li Hongyi's video course: ML Lecture 3- 1: Gradient descent.

RMSProp

RMS is the abbreviation of root mean square. RMSProp is an adaptive learning rate method proposed by Geoff Hinton, the godfather of artificial intelligence. AdaGrad will accumulate all the previous gradient squares, while RMSProp only calculates the corresponding average, so it can alleviate the problem that the learning rate of Adagrad algorithm drops rapidly.

The formula of this algorithm is as follows:

[e[\ Nabura f (\ theta _ {I}) 2] {t} = \ gamma e[\ Nabura f (\ theta _ {I}) 2] {t-1}+(1-gamma. Lambda} {\ sqrt {e [g 2] {t+1}+\ epsilon}} \ Nabura f(\theta_{i})]

Again, it was introduced to avoid division by 0. Is the attenuation parameter, which is usually set to 0.9.

This is the average of the square of the gradient at time t.

In the tradition of Bible and Koran, Adam (the name of the first human being)

Adam is the abbreviation of adaptive moment estimation. It uses the first moment estimation and the second moment estimation of the gradient to dynamically adjust the learning rate of each parameter.

The advantage of Adam is that the learning rate of each iteration has a certain range after deviation correction, which makes the parameters more stable.

The formula of this algorithm is as follows:

[m {t} = \ beta _ {1} m {t-1}+(1-\ beta _ {1}) \ Nabura f (\ theta) \ v {t}. Beta _ {2 })\ Nabura f (\ theta) 2 \ \ Widehat {m} {t} = \ frac {m {t}} {1-\ beta {t} _1} \ \.

, which are the first-order moment estimation and the second-order moment estimation of the gradient respectively. Yes, it is a correction, which can be approximated as an unbiased estimate of expected value.

The proponent of Adam algorithm suggests that the default value is 0.9, the default value is 0.999, and the default value is.

In practical application, Adam is commonly used, and it can get a prediction result quickly.

Optimization summary

Here we list several optimization algorithms. It's hard to say which is the best, and different algorithms are suitable for different scenarios. In practical engineering, it may be necessary to try one by one to determine which one to choose. This process is also one of the processes that the current AI project has to go through.

In fact, the research in this field goes far beyond this. If you are interested, you can continue to read the paper Sebastian Ruder: overview of gradient descent optimization algorithm or the slide of deep learning optimization for more research.

Limited by space, I won't go into details here.

Algorithm limitation

The gradient descent algorithm has some limitations. First of all, it requires that the function must be differentiable, and this method cannot be used for nondifferentiable functions.

In addition, in some cases, the gradient descent algorithm may converge slowly or produce Z-shaped oscillation when it approaches the extreme point. This needs to be avoided by adjusting the learning rate.

In addition, gradient descent will encounter the following two types of problems.

Local minima

Local minima means that the minima we find are only the minima of a region, not the global minima. Because the starting point of the algorithm is arbitrary, taking the following figure as an example, it is easy for us to fall into local minima.

It's like walking down from the top of the mountain. The platform you first walked to may not be the foot of the mountain, but it may just be the platform halfway up the mountain.

The starting point of the algorithm determines the convergence speed of the algorithm and whether it will fall into local minima.

The bad news is that there seems to be no particularly good way to determine which point is a good starting point, which is a bit of luck. It may be a good method to try different random points many times, which is why the optimization algorithm is particularly time-consuming.

But the good news is:

For convex or concave functions, there is no problem of local extremum. Its local extremum must be the global extremum.

Some recent studies show that some local minima are not as bad as imagined, and they are very close to the results brought by global minima.

saddle point

In addition to the local minimum, there may be another situation in the process of gradient descent, that is, saddle point. Saddle point means that we find a point with a gradient of 0, but it is not the extreme value of the function, and there are both small and large values around it. This is like a saddle.

As shown in the figure below:

Many kinds of random functions show the following properties: in low-dimensional space, local extremum is very common. However, in high-dimensional space, local minima are rare, while saddle points are common.

But for saddle point, it can be determined by mathematical method Hessian matrix. At this point, it will not be expanded here, and interested readers can continue to explore through several links provided here.

References and recommended reading materials

Wikipedia: gradient descent

Sebastian Ruder: Overview of gradient descent optimization algorithms

Andrew Ng: Machine learning.

Andrew Ng: Deep learning

Peter Flack: Machine learning

Li Hongyi -ml lecture 3- 1: gradient descent

PDF: Li Hongyi gradient descent

Introduction to Optimization in Deep Learning: Gradient Descent

Introduction to Optimization in Deep Learning: Momentum, RMSProp and Adam

Random gradient descent-small batch and more

Overview of Liu Jianping-Binal gradient descent method

Reflections on the relationship among partial derivative, directional derivative, gradient and differential of multivariate function

Three forms of [machine learning] gradient descent methods BGD, SGD and MBGD.

Author: Paul https://paul.pub/gradient-descent/