In numerical optimization, the nonlinear conjugate gradient method generalizes the conjugate gradient method to nonlinear optimization. For a quadratic function f ( x ) {\displaystyle \displaystyle f(x)}
f ( x ) = ‖ A x − b ‖ 2 , {\displaystyle \displaystyle f(x)=\|Ax-b\|^{2},}the minimum of f {\displaystyle f} is obtained when the gradient is 0:
∇ x f = 2 A T ( A x − b ) = 0 {\displaystyle \nabla _{x}f=2A^{T}(Ax-b)=0} .Whereas linear conjugate gradient seeks a solution to the linear equation A T A x = A T b {\displaystyle \displaystyle A^{T}Ax=A^{T}b} , the nonlinear conjugate gradient method is generally used to find the local minimum of a nonlinear function using its gradient ∇ x f {\displaystyle \nabla _{x}f} alone. It works when the function is approximately quadratic near the minimum, which is the case when the function is twice differentiable at the minimum and the second derivative is non-singular there.
Given a function f ( x ) {\displaystyle \displaystyle f(x)} of N {\displaystyle N} variables to minimize, its gradient ∇ x f {\displaystyle \nabla _{x}f} indicates the direction of maximum increase. One simply starts in the opposite (steepest descent) direction:
Δ x 0 = − ∇ x f ( x 0 ) {\displaystyle \Delta x_{0}=-\nabla _{x}f(x_{0})}with an adjustable step length α {\displaystyle \displaystyle \alpha } and performs a line search in this direction until it reaches the minimum of f {\displaystyle \displaystyle f} :
α 0 := arg min α f ( x 0 + α Δ x 0 ) {\displaystyle \displaystyle \alpha _{0}:=\arg \min _{\alpha }f(x_{0}+\alpha \Delta x_{0})} , x 1 = x 0 + α 0 Δ x 0 {\displaystyle \displaystyle x_{1}=x_{0}+\alpha _{0}\Delta x_{0}}After this first iteration in the steepest direction Δ x 0 {\displaystyle \displaystyle \Delta x_{0}} , the following steps constitute one iteration of moving along a subsequent conjugate direction s n {\displaystyle \displaystyle s_{n}} , where s 0 = Δ x 0 {\displaystyle \displaystyle s_{0}=\Delta x_{0}} :
- Calculate the steepest direction: Δ x n = − ∇ x f ( x n ) {\displaystyle \Delta x_{n}=-\nabla _{x}f(x_{n})} ,
- Compute β n {\displaystyle \displaystyle \beta _{n}} according to one of the formulas below,
- Update the conjugate direction: s n = Δ x n + β n s n − 1 {\displaystyle \displaystyle s_{n}=\Delta x_{n}+\beta _{n}s_{n-1}}
- Perform a line search: optimize α n = arg min α f ( x n + α s n ) {\displaystyle \displaystyle \alpha _{n}=\arg \min _{\alpha }f(x_{n}+\alpha s_{n})} ,
- Update the position: x n + 1 = x n + α n s n {\displaystyle \displaystyle x_{n+1}=x_{n}+\alpha _{n}s_{n}} ,
With a pure quadratic function the minimum is reached within N iterations (excepting roundoff error), but a non-quadratic function will make slower progress. Subsequent search directions lose conjugacy requiring the search direction to be reset to the steepest descent direction at least every N iterations, or sooner if progress stops. However, resetting every iteration turns the method into steepest descent. The algorithm stops when it finds the minimum, determined when no progress is made after a direction reset (i.e. in the steepest descent direction), or when some tolerance criterion is reached.
Within a linear approximation, the parameters α {\displaystyle \displaystyle \alpha } and β {\displaystyle \displaystyle \beta } are the same as in the linear conjugate gradient method but have been obtained with line searches. The conjugate gradient method can follow narrow (ill-conditioned) valleys, where the steepest descent method slows down and follows a criss-cross pattern.
Four of the best known formulas for β n {\displaystyle \displaystyle \beta _{n}} are named after their developers:
- Fletcher–Reeves:
- Polak–Ribière:
- Hestenes–Stiefel:
- Dai–Yuan:
These formulas are equivalent for a quadratic function, but for nonlinear optimization the preferred formula is a matter of heuristics or taste. A popular choice is β = max { 0 , β P R } {\displaystyle \displaystyle \beta =\max\{0,\beta ^{PR}\}} , which provides a direction reset automatically.
Algorithms based on Newton's method potentially converge much faster. There, both step direction and length are computed from the gradient as the solution of a linear system of equations, with the coefficient matrix being the exact Hessian matrix (for Newton's method proper) or an estimate thereof (in the quasi-Newton methods, where the observed change in the gradient during the iterations is used to update the Hessian estimate). For high-dimensional problems, the exact computation of the Hessian is usually prohibitively expensive, and even its storage can be problematic, requiring O ( N 2 ) {\displaystyle O(N^{2})} memory (but see the limited-memory L-BFGS quasi-Newton method).
The conjugate gradient method can also be derived using optimal control theory. In this accelerated optimization theory, the conjugate gradient method falls out as a nonlinear optimal feedback controller,
u = k ( x , x ˙ ) := − γ a ∇ x f ( x ) − γ b x ˙ {\displaystyle u=k(x,{\dot {x}}):=-\gamma _{a}\nabla _{x}f(x)-\gamma _{b}{\dot {x}}}
for the double integrator system,
x ¨ = u {\displaystyle {\ddot {x}}=u}
The quantities γ a > 0 {\displaystyle \gamma _{a}>0} and γ b > 0 {\displaystyle \gamma _{b}>0} are variable feedback gains.
See also
- Gradient descent
- Broyden–Fletcher–Goldfarb–Shanno algorithm
- Conjugate gradient method
- L-BFGS (limited memory BFGS)
- Nelder–Mead method
- Wolfe conditions
References
Fletcher, R.; Reeves, C. M. (1964). "Function minimization by conjugate gradients". The Computer Journal. 7 (2): 149–154. doi:10.1093/comjnl/7.2.149. https://doi.org/10.1093%2Fcomjnl%2F7.2.149 ↩
Polak, E.; Ribière, G. (1969). "Note sur la convergence de méthodes de directions conjuguées". Revue Française d'Automatique, Informatique, Recherche Opérationnelle. 3 (1): 35–43. ↩
Hestenes, M. R.; Stiefel, E. (1952). "Methods of Conjugate Gradients for Solving Linear Systems". Journal of Research of the National Bureau of Standards. 49 (6): 409–436. doi:10.6028/jres.049.044. https://doi.org/10.6028%2Fjres.049.044 ↩
Dai, Y.-H.; Yuan, Y. (1999). "A nonlinear conjugate gradient method with a strong global convergence property". SIAM Journal on Optimization. 10 (1): 177–182. doi:10.1137/S1052623497318992. /wiki/Doi_(identifier) ↩
Shewchuk, J. R. (August 1994). "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" (PDF). https://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf ↩
Ross, I. M. (2019). "An Optimal Control Theory for Accelerated Optimization". arXiv:1902.09004 [math.OC]. /wiki/I._Michael_Ross ↩
Ross, I. M. (2019). "An Optimal Control Theory for Accelerated Optimization". arXiv:1902.09004 [math.OC]. /wiki/I._Michael_Ross ↩