Broyden–Fletcher–Goldfarb–Shanno algorithm

<h2 id="rationale">Rationale</h2>
The optimization problem is to minimize 
 
 
 
 f
 (
 
 x
 
 )
 
 
 {\displaystyle f(\mathbf {x} )}
 
, where 
 
 
 
 
 x
 
 
 
 {\displaystyle \mathbf {x} }
 
 is a vector in 
 
 
 
 
 
 R
 
 
 n
 
 
 
 
 {\displaystyle \mathbb {R} ^{n}}
 
, and 
 
 
 
 f
 
 
 {\displaystyle f}
 
 is a differentiable scalar function. There are no constraints on the values that 
 
 
 
 
 x
 
 
 
 {\displaystyle \mathbf {x} }
 
 can take.
The algorithm begins at an initial estimate 
 
 
 
 
 
 x
 
 
 0
 
 
 
 
 {\displaystyle \mathbf {x} _{0}}
 
 for the optimal value and proceeds iteratively to get a better estimate at each stage.
The <a href="/facts/Descent_direction/pj86tpeJ">search direction</a> pk at stage k is given by the solution of the analogue of the Newton equation:

B
          
            k
          
        
        
          
            p
          
          
            k
          
        
        =
        −
        ∇
        f
        (
        
          
            x
          
          
            k
          
        
        )
        ,
      
    
    {\displaystyle B_{k}\mathbf {p} _{k}=-\nabla f(\mathbf {x} _{k}),}

where 
 
 
 
 
 B
 
 k
 
 
 
 
 {\displaystyle B_{k}}
 
 is an approximation to the <a href="/facts/Hessian_matrix/RguMNr3m">Hessian matrix</a> at 
 
 
 
 
 
 x
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {x} _{k}}
 
, which is updated iteratively at each stage, and 
 
 
 
 ∇
 f
 (
 
 
 x
 
 
 k
 
 
 )
 
 
 {\displaystyle \nabla f(\mathbf {x} _{k})}
 
 is the gradient of the function evaluated at xk. A <a href="/facts/Line_search/GQWhNEsR">line search</a> in the direction pk is then used to find the next point xk+1 by minimizing 
 
 
 
 f
 (
 
 
 x
 
 
 k
 
 
 +
 γ
 
 
 p
 
 
 k
 
 
 )
 
 
 {\displaystyle f(\mathbf {x} _{k}+\gamma \mathbf {p} _{k})}
 
 over the scalar 
 
 
 
 γ
 >
 0.
 
 
 {\displaystyle \gamma >0.}

The quasi-Newton condition imposed on the update of 
 
 
 
 
 B
 
 k
 
 
 
 
 {\displaystyle B_{k}}
 
 is

B
          
            k
            +
            1
          
        
        (
        
          
            x
          
          
            k
            +
            1
          
        
        −
        
          
            x
          
          
            k
          
        
        )
        =
        ∇
        f
        (
        
          
            x
          
          
            k
            +
            1
          
        
        )
        −
        ∇
        f
        (
        
          
            x
          
          
            k
          
        
        )
        .
      
    
    {\displaystyle B_{k+1}(\mathbf {x} _{k+1}-\mathbf {x} _{k})=\nabla f(\mathbf {x} _{k+1})-\nabla f(\mathbf {x} _{k}).}

Let 
 
 
 
 
 
 y
 
 
 k
 
 
 =
 ∇
 f
 (
 
 
 x
 
 
 k
 +
 1
 
 
 )
 −
 ∇
 f
 (
 
 
 x
 
 
 k
 
 
 )
 
 
 {\displaystyle \mathbf {y} _{k}=\nabla f(\mathbf {x} _{k+1})-\nabla f(\mathbf {x} _{k})}
 
 and 
 
 
 
 
 
 s
 
 
 k
 
 
 =
 
 
 x
 
 
 k
 +
 1
 
 
 −
 
 
 x
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {s} _{k}=\mathbf {x} _{k+1}-\mathbf {x} _{k}}
 
, then 
 
 
 
 
 B
 
 k
 +
 1
 
 
 
 
 {\displaystyle B_{k+1}}
 
 satisfies

B
 
 k
 +
 1
 
 
 
 
 s
 
 
 k
 
 
 =
 
 
 y
 
 
 k
 
 
 
 
 {\displaystyle B_{k+1}\mathbf {s} _{k}=\mathbf {y} _{k}}
 
,
which is the secant equation.
The curvature condition 
 
 
 
 
 
 s
 
 
 k
 
 
 ⊤
 
 
 
 
 y
 
 
 k
 
 
 >
 0
 
 
 {\displaystyle \mathbf {s} _{k}^{\top }\mathbf {y} _{k}>0}
 
 should be satisfied for 
 
 
 
 
 B
 
 k
 +
 1
 
 
 
 
 {\displaystyle B_{k+1}}
 
 to be positive definite, which can be verified by pre-multiplying the secant equation with 
 
 
 
 
 
 s
 
 
 k
 
 
 T
 
 
 
 
 {\displaystyle \mathbf {s} _{k}^{T}}
 
. If the function is not <a href="/facts/Strongly_convex_function/IbzG5SLF">strongly convex</a>, then the condition has to be enforced explicitly e.g. by finding a point xk+1 satisfying the <a href="/facts/Wolfe_conditions/5C8cDvC8">Wolfe conditions</a>, which entail the curvature condition, using line search.
Instead of requiring the full Hessian matrix at the point 
 
 
 
 
 
 x
 
 
 k
 +
 1
 
 
 
 
 {\displaystyle \mathbf {x} _{k+1}}
 
 to be computed as 
 
 
 
 
 B
 
 k
 +
 1
 
 
 
 
 {\displaystyle B_{k+1}}
 
, the approximate Hessian at stage k is updated by the addition of two matrices:

B
          
            k
            +
            1
          
        
        =
        
          B
          
            k
          
        
        +
        
          U
          
            k
          
        
        +
        
          V
          
            k
          
        
        .
      
    
    {\displaystyle B_{k+1}=B_{k}+U_{k}+V_{k}.}

Both 
 
 
 
 
 U
 
 k
 
 
 
 
 {\displaystyle U_{k}}
 
 and 
 
 
 
 
 V
 
 k
 
 
 
 
 {\displaystyle V_{k}}
 
 are symmetric rank-one matrices, but their sum is a rank-two update matrix. BFGS and <a href="/facts/Davidon%25E2%2580%2593Fletcher%25E2%2580%2593Powell_formula/5oYlloCx">DFP</a> updating matrix both differ from its predecessor by a rank-two matrix. Another simpler rank-one method is known as <a href="/facts/Symmetric_rank-one/ygdw0YXf">symmetric rank-one</a> method, which does not guarantee the <a href="/facts/Positive_definiteness/5Jvw9A5o">positive definiteness</a>. In order to maintain the symmetry and positive definiteness of 
 
 
 
 
 B
 
 k
 +
 1
 
 
 
 
 {\displaystyle B_{k+1}}
 
, the update form can be chosen as 
 
 
 
 
 B
 
 k
 +
 1
 
 
 =
 
 B
 
 k
 
 
 +
 α
 
 u
 
 
 
 u
 
 
 ⊤
 
 
 +
 β
 
 v
 
 
 
 v
 
 
 ⊤
 
 
 
 
 {\displaystyle B_{k+1}=B_{k}+\alpha \mathbf {u} \mathbf {u} ^{\top }+\beta \mathbf {v} \mathbf {v} ^{\top }}
 
. Imposing the secant condition, 
 
 
 
 
 B
 
 k
 +
 1
 
 
 
 
 s
 
 
 k
 
 
 =
 
 
 y
 
 
 k
 
 
 
 
 {\displaystyle B_{k+1}\mathbf {s} _{k}=\mathbf {y} _{k}}
 
. Choosing 
 
 
 
 
 u
 
 =
 
 
 y
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {u} =\mathbf {y} _{k}}
 
 and 
 
 
 
 
 v
 
 =
 
 B
 
 k
 
 
 
 
 s
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {v} =B_{k}\mathbf {s} _{k}}
 
, we can obtain:<a class="footnote-ref" id="fnref:8" href="#fn:8">8</a>

α
        =
        
          
            1
            
              
                
                  y
                
                
                  k
                
                
                  T
                
              
              
                
                  s
                
                
                  k
                
              
            
          
        
        ,
      
    
    {\displaystyle \alpha ={\frac {1}{\mathbf {y} _{k}^{T}\mathbf {s} _{k}}},}

β
        =
        −
        
          
            1
            
              
                
                  s
                
                
                  k
                
                
                  T
                
              
              
                B
                
                  k
                
              
              
                
                  s
                
                
                  k
                
              
            
          
        
        .
      
    
    {\displaystyle \beta =-{\frac {1}{\mathbf {s} _{k}^{T}B_{k}\mathbf {s} _{k}}}.}

Finally, we substitute 
 
 
 
 α
 
 
 {\displaystyle \alpha }
 
 and 
 
 
 
 β
 
 
 {\displaystyle \beta }
 
 into 
 
 
 
 
 B
 
 k
 +
 1
 
 
 =
 
 B
 
 k
 
 
 +
 α
 
 u
 
 
 
 u
 
 
 ⊤
 
 
 +
 β
 
 v
 
 
 
 v
 
 
 ⊤
 
 
 
 
 {\displaystyle B_{k+1}=B_{k}+\alpha \mathbf {u} \mathbf {u} ^{\top }+\beta \mathbf {v} \mathbf {v} ^{\top }}
 
 and get the update equation of 
 
 
 
 
 B
 
 k
 +
 1
 
 
 
 
 {\displaystyle B_{k+1}}
 
:

B
          
            k
            +
            1
          
        
        =
        
          B
          
            k
          
        
        +
        
          
            
              
                
                  y
                
                
                  k
                
              
              
                
                  y
                
                
                  k
                
                
                  
                    T
                  
                
              
            
            
              
                
                  y
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                
                  s
                
                
                  k
                
              
            
          
        
        −
        
          
            
              
                B
                
                  k
                
              
              
                
                  s
                
                
                  k
                
              
              
                
                  s
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                B
                
                  k
                
                
                  
                    T
                  
                
              
            
            
              
                
                  s
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                B
                
                  k
                
              
              
                
                  s
                
                
                  k
                
              
            
          
        
        .
      
    
    {\displaystyle B_{k+1}=B_{k}+{\frac {\mathbf {y} _{k}\mathbf {y} _{k}^{\mathrm {T} }}{\mathbf {y} _{k}^{\mathrm {T} }\mathbf {s} _{k}}}-{\frac {B_{k}\mathbf {s} _{k}\mathbf {s} _{k}^{\mathrm {T} }B_{k}^{\mathrm {T} }}{\mathbf {s} _{k}^{\mathrm {T} }B_{k}\mathbf {s} _{k}}}.}

<h2 id="algorithm">Algorithm</h2>
Consider the following unconstrained optimization problem

minimize
                    
                      
                        x
                      
                      ∈
                      
                        
                          R
                        
                        
                          n
                        
                      
                    
                  
                
                
              
              
                f
                (
                
                  x
                
                )
                ,
              
            
          
        
      
    
    {\displaystyle {\begin{aligned}{\underset {\mathbf {x} \in \mathbb {R} ^{n}}{\text{minimize}}}\quad &f(\mathbf {x} ),\end{aligned}}}

where 
 
 
 
 f
 :
 
 
 R
 
 
 n
 
 
 →
 
 R
 
 
 
 {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} }
 
 is a nonlinear objective function.
From an initial guess 
 
 
 
 
 
 x
 
 
 0
 
 
 ∈
 
 
 R
 
 
 n
 
 
 
 
 {\displaystyle \mathbf {x} _{0}\in \mathbb {R} ^{n}}
 
 and an initial guess of the Hessian matrix 
 
 
 
 
 B
 
 0
 
 
 ∈
 
 
 R
 
 
 n
 ×
 n
 
 
 
 
 {\displaystyle B_{0}\in \mathbb {R} ^{n\times n}}
 
 the following steps are repeated as 
 
 
 
 
 
 x
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {x} _{k}}
 
 converges to the solution:

<ol><li>Obtain a direction 
 
 
 
 
 
 p
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {p} _{k}}
 
 by solving 
 
 
 
 
 B
 
 k
 
 
 
 
 p
 
 
 k
 
 
 =
 −
 ∇
 f
 (
 
 
 x
 
 
 k
 
 
 )
 
 
 {\displaystyle B_{k}\mathbf {p} _{k}=-\nabla f(\mathbf {x} _{k})}
 
.</li>
<li>Perform a one-dimensional optimization (<a href="/facts/Line_search/GQWhNEsR">line search</a>) to find an acceptable stepsize 
 
 
 
 
 α
 
 k
 
 
 
 
 {\displaystyle \alpha _{k}}
 
 in the direction found in the first step. If an exact line search is performed, then 
 
 
 
 
 α
 
 k
 
 
 =
 arg
 ⁡
 min
 f
 (
 
 
 x
 
 
 k
 
 
 +
 α
 
 
 p
 
 
 k
 
 
 )
 
 
 {\displaystyle \alpha _{k}=\arg \min f(\mathbf {x} _{k}+\alpha \mathbf {p} _{k})}
 
 . In practice, an inexact line search usually suffices, with an acceptable 
 
 
 
 
 α
 
 k
 
 
 
 
 {\displaystyle \alpha _{k}}
 
 satisfying <a href="/facts/Wolfe_conditions/5C8cDvC8">Wolfe conditions</a>.</li>
<li>Set 
 
 
 
 
 
 s
 
 
 k
 
 
 =
 
 α
 
 k
 
 
 
 
 p
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {s} _{k}=\alpha _{k}\mathbf {p} _{k}}
 
 and update 
 
 
 
 
 
 x
 
 
 k
 +
 1
 
 
 =
 
 
 x
 
 
 k
 
 
 +
 
 
 s
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {x} _{k+1}=\mathbf {x} _{k}+\mathbf {s} _{k}}
 
.</li>
<li>
 
 
 
 
 
 y
 
 
 k
 
 
 =
 
 ∇
 f
 (
 
 
 x
 
 
 k
 +
 1
 
 
 )
 −
 ∇
 f
 (
 
 
 x
 
 
 k
 
 
 )
 
 
 
 {\displaystyle \mathbf {y} _{k}={\nabla f(\mathbf {x} _{k+1})-\nabla f(\mathbf {x} _{k})}}
 
.</li>
<li>
 
 
 
 
 B
 
 k
 +
 1
 
 
 =
 
 B
 
 k
 
 
 +
 
 
 
 
 
 y
 
 
 k
 
 
 
 
 y
 
 
 k
 
 
 
 T
 
 
 
 
 
 
 
 y
 
 
 k
 
 
 
 T
 
 
 
 
 
 s
 
 
 k
 
 
 
 
 
 −
 
 
 
 
 B
 
 k
 
 
 
 
 s
 
 
 k
 
 
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 
 B
 
 k
 
 
 
 T
 
 
 
 
 
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 
 B
 
 k
 
 
 
 
 s
 
 
 k
 
 
 
 
 
 
 
 {\displaystyle B_{k+1}=B_{k}+{\frac {\mathbf {y} _{k}\mathbf {y} _{k}^{\mathrm {T} }}{\mathbf {y} _{k}^{\mathrm {T} }\mathbf {s} _{k}}}-{\frac {B_{k}\mathbf {s} _{k}\mathbf {s} _{k}^{\mathrm {T} }B_{k}^{\mathrm {T} }}{\mathbf {s} _{k}^{\mathrm {T} }B_{k}\mathbf {s} _{k}}}}
 
.</li></ol>
Convergence can be determined by observing the norm of the gradient; given some 
 
 
 
 ϵ
 >
 0
 
 
 {\displaystyle \epsilon >0}
 
, one may stop the algorithm when 
 
 
 
 
 |
 
 
 |
 
 ∇
 f
 (
 
 
 x
 
 
 k
 
 
 )
 
 |
 
 
 |
 
 ≤
 ϵ
 .
 
 
 {\displaystyle ||\nabla f(\mathbf {x} _{k})||\leq \epsilon .}
 
 If 
 
 
 
 
 B
 
 0
 
 
 
 
 {\displaystyle B_{0}}
 
 is initialized with 
 
 
 
 
 B
 
 0
 
 
 =
 I
 
 
 {\displaystyle B_{0}=I}
 
, the first step will be equivalent to a <a href="/facts/Gradient_descent/pFFrek0F">gradient descent</a>, but further steps are more and more refined by 
 
 
 
 
 B
 
 k
 
 
 
 
 {\displaystyle B_{k}}
 
, the approximation to the Hessian.
The first step of the algorithm is carried out using the inverse of the matrix 
 
 
 
 
 B
 
 k
 
 
 
 
 {\displaystyle B_{k}}
 
, which can be obtained efficiently by applying the <a href="/facts/Sherman%25E2%2580%2593Morrison_formula/KYb0x04E">Sherman–Morrison formula</a> to the step 5 of the algorithm, giving

B
          
            k
            +
            1
          
          
            −
            1
          
        
        =
        
          (
          
            I
            −
            
              
                
                  
                    
                      s
                    
                    
                      k
                    
                  
                  
                    
                      y
                    
                    
                      k
                    
                    
                      T
                    
                  
                
                
                  
                    
                      y
                    
                    
                      k
                    
                    
                      T
                    
                  
                  
                    
                      s
                    
                    
                      k
                    
                  
                
              
            
          
          )
        
        
          B
          
            k
          
          
            −
            1
          
        
        
          (
          
            I
            −
            
              
                
                  
                    
                      y
                    
                    
                      k
                    
                  
                  
                    
                      s
                    
                    
                      k
                    
                    
                      T
                    
                  
                
                
                  
                    
                      y
                    
                    
                      k
                    
                    
                      T
                    
                  
                  
                    
                      s
                    
                    
                      k
                    
                  
                
              
            
          
          )
        
        +
        
          
            
              
                
                  s
                
                
                  k
                
              
              
                
                  s
                
                
                  k
                
                
                  T
                
              
            
            
              
                
                  y
                
                
                  k
                
                
                  T
                
              
              
                
                  s
                
                
                  k
                
              
            
          
        
        .
      
    
    {\displaystyle B_{k+1}^{-1}=\left(I-{\frac {\mathbf {s} _{k}\mathbf {y} _{k}^{T}}{\mathbf {y} _{k}^{T}\mathbf {s} _{k}}}\right)B_{k}^{-1}\left(I-{\frac {\mathbf {y} _{k}\mathbf {s} _{k}^{T}}{\mathbf {y} _{k}^{T}\mathbf {s} _{k}}}\right)+{\frac {\mathbf {s} _{k}\mathbf {s} _{k}^{T}}{\mathbf {y} _{k}^{T}\mathbf {s} _{k}}}.}

This can be computed efficiently without temporary matrices, recognizing that 
 
 
 
 
 B
 
 k
 
 
 −
 1
 
 
 
 
 {\displaystyle B_{k}^{-1}}
 
 is symmetric,
and that 
 
 
 
 
 
 y
 
 
 k
 
 
 
 T
 
 
 
 
 B
 
 k
 
 
 −
 1
 
 
 
 
 y
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {y} _{k}^{\mathrm {T} }B_{k}^{-1}\mathbf {y} _{k}}
 
 and 
 
 
 
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 
 
 y
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {s} _{k}^{\mathrm {T} }\mathbf {y} _{k}}
 
 are scalars, using an expansion such as

B
          
            k
            +
            1
          
          
            −
            1
          
        
        =
        
          B
          
            k
          
          
            −
            1
          
        
        +
        
          
            
              (
              
                
                  s
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                
                  y
                
                
                  k
                
              
              +
              
                
                  y
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                B
                
                  k
                
                
                  −
                  1
                
              
              
                
                  y
                
                
                  k
                
              
              )
              (
              
                
                  s
                
                
                  k
                
              
              
                
                  s
                
                
                  k
                
                
                  
                    T
                  
                
              
              )
            
            
              (
              
                
                  s
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                
                  y
                
                
                  k
                
              
              
                )
                
                  2
                
              
            
          
        
        −
        
          
            
              
                B
                
                  k
                
                
                  −
                  1
                
              
              
                
                  y
                
                
                  k
                
              
              
                
                  s
                
                
                  k
                
                
                  
                    T
                  
                
              
              +
              
                
                  s
                
                
                  k
                
              
              
                
                  y
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                B
                
                  k
                
                
                  −
                  1
                
              
            
            
              
                
                  s
                
                
                  k
                
                
                  
                    T
                  
                
              
              
                
                  y
                
                
                  k
                
              
            
          
        
        .
      
    
    {\displaystyle B_{k+1}^{-1}=B_{k}^{-1}+{\frac {(\mathbf {s} _{k}^{\mathrm {T} }\mathbf {y} _{k}+\mathbf {y} _{k}^{\mathrm {T} }B_{k}^{-1}\mathbf {y} _{k})(\mathbf {s} _{k}\mathbf {s} _{k}^{\mathrm {T} })}{(\mathbf {s} _{k}^{\mathrm {T} }\mathbf {y} _{k})^{2}}}-{\frac {B_{k}^{-1}\mathbf {y} _{k}\mathbf {s} _{k}^{\mathrm {T} }+\mathbf {s} _{k}\mathbf {y} _{k}^{\mathrm {T} }B_{k}^{-1}}{\mathbf {s} _{k}^{\mathrm {T} }\mathbf {y} _{k}}}.}

Therefore, in order to avoid any matrix inversion, the inverse of the Hessian can be approximated instead of the Hessian itself: 
 
 
 
 
 H
 
 k
 
 
 
 
 =
 def
 
 
 
 B
 
 k
 
 
 −
 1
 
 
 .
 
 
 {\displaystyle H_{k}{\overset {\operatorname {def} }{=}}B_{k}^{-1}.}
 
<a class="footnote-ref" id="fnref:9" href="#fn:9">9</a>
From an initial guess 
 
 
 
 
 
 x
 
 
 0
 
 
 
 
 {\displaystyle \mathbf {x} _{0}}
 
 and an approximate inverted Hessian matrix 
 
 
 
 
 H
 
 0
 
 
 
 
 {\displaystyle H_{0}}
 
 the following steps are repeated as 
 
 
 
 
 
 x
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {x} _{k}}
 
 converges to the solution:

<ol><li>Obtain a direction 
 
 
 
 
 
 p
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {p} _{k}}
 
 by solving 
 
 
 
 
 
 p
 
 
 k
 
 
 =
 −
 
 H
 
 k
 
 
 ∇
 f
 (
 
 
 x
 
 
 k
 
 
 )
 
 
 {\displaystyle \mathbf {p} _{k}=-H_{k}\nabla f(\mathbf {x} _{k})}
 
.</li>
<li>Perform a one-dimensional optimization (<a href="/facts/Line_search/GQWhNEsR">line search</a>) to find an acceptable stepsize 
 
 
 
 
 α
 
 k
 
 
 
 
 {\displaystyle \alpha _{k}}
 
 in the direction found in the first step. If an exact line search is performed, then 
 
 
 
 
 α
 
 k
 
 
 =
 arg
 ⁡
 min
 f
 (
 
 
 x
 
 
 k
 
 
 +
 α
 
 
 p
 
 
 k
 
 
 )
 
 
 {\displaystyle \alpha _{k}=\arg \min f(\mathbf {x} _{k}+\alpha \mathbf {p} _{k})}
 
 . In practice, an inexact line search usually suffices, with an acceptable 
 
 
 
 
 α
 
 k
 
 
 
 
 {\displaystyle \alpha _{k}}
 
 satisfying <a href="/facts/Wolfe_conditions/5C8cDvC8">Wolfe conditions</a>.</li>
<li>Set 
 
 
 
 
 
 s
 
 
 k
 
 
 =
 
 α
 
 k
 
 
 
 
 p
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {s} _{k}=\alpha _{k}\mathbf {p} _{k}}
 
 and update 
 
 
 
 
 
 x
 
 
 k
 +
 1
 
 
 =
 
 
 x
 
 
 k
 
 
 +
 
 
 s
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {x} _{k+1}=\mathbf {x} _{k}+\mathbf {s} _{k}}
 
.</li>
<li>
 
 
 
 
 
 y
 
 
 k
 
 
 =
 
 ∇
 f
 (
 
 
 x
 
 
 k
 +
 1
 
 
 )
 −
 ∇
 f
 (
 
 
 x
 
 
 k
 
 
 )
 
 
 
 {\displaystyle \mathbf {y} _{k}={\nabla f(\mathbf {x} _{k+1})-\nabla f(\mathbf {x} _{k})}}
 
.</li>
<li>
 
 
 
 
 H
 
 k
 +
 1
 
 
 =
 
 H
 
 k
 
 
 +
 
 
 
 (
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 
 
 y
 
 
 k
 
 
 +
 
 
 y
 
 
 k
 
 
 
 T
 
 
 
 
 H
 
 k
 
 
 
 
 y
 
 
 k
 
 
 )
 (
 
 
 s
 
 
 k
 
 
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 )
 
 
 (
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 
 
 y
 
 
 k
 
 
 
 )
 
 2
 
 
 
 
 
 −
 
 
 
 
 H
 
 k
 
 
 
 
 y
 
 
 k
 
 
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 +
 
 
 s
 
 
 k
 
 
 
 
 y
 
 
 k
 
 
 
 T
 
 
 
 
 H
 
 k
 
 
 
 
 
 
 s
 
 
 k
 
 
 
 T
 
 
 
 
 
 y
 
 
 k
 
 
 
 
 
 
 
 {\displaystyle H_{k+1}=H_{k}+{\frac {(\mathbf {s} _{k}^{\mathrm {T} }\mathbf {y} _{k}+\mathbf {y} _{k}^{\mathrm {T} }H_{k}\mathbf {y} _{k})(\mathbf {s} _{k}\mathbf {s} _{k}^{\mathrm {T} })}{(\mathbf {s} _{k}^{\mathrm {T} }\mathbf {y} _{k})^{2}}}-{\frac {H_{k}\mathbf {y} _{k}\mathbf {s} _{k}^{\mathrm {T} }+\mathbf {s} _{k}\mathbf {y} _{k}^{\mathrm {T} }H_{k}}{\mathbf {s} _{k}^{\mathrm {T} }\mathbf {y} _{k}}}}
 
.</li></ol>
In statistical estimation problems (such as <a href="/facts/Maximum_likelihood_estimation/0Yq2dpQD">maximum likelihood</a> or Bayesian inference), <a href="/facts/Credible_interval/BeVTjGrm">credible intervals</a> or <a href="/facts/Confidence_interval/NS8mG5UE">confidence intervals</a> for the solution can be estimated from the <a href="/facts/Matrix_inverse/fPqXk3V8">inverse</a> of the final Hessian matrix . However, these quantities are technically defined by the true Hessian matrix, and the BFGS approximation may not converge to the true Hessian matrix.<a class="footnote-ref" id="fnref:10" href="#fn:10">10</a>

<h2 id="further-developments">Further developments</h2>
The BFGS update formula heavily relies on the curvature 
 
 
 
 
 
 s
 
 
 k
 
 
 ⊤
 
 
 
 
 y
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {s} _{k}^{\top }\mathbf {y} _{k}}
 
 being strictly positive and bounded away from zero.
This condition is satisfied when we perform a line search with Wolfe conditions on a convex target.
However, some real-life applications (like Sequential Quadratic Programming methods) routinely produce negative or nearly-zero curvatures.
This can occur when optimizing a nonconvex target or when employing a trust-region approach instead of a line search.
It is also possible to produce spurious values due to noise in the target.
In such cases, one of the so-called damped BFGS updates can be used (see <a class="footnote-ref" id="fnref:11" href="#fn:11">11</a>) which modify 
 
 
 
 
 
 s
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {s} _{k}}
 
 and/or 
 
 
 
 
 
 y
 
 
 k
 
 
 
 
 {\displaystyle \mathbf {y} _{k}}
 
 in order to obtain a more robust update.

<h2 id="notable-implementations">Notable implementations</h2>
Notable open source implementations are:

<ul><li><a href="/facts/ALGLIB/fSduLPd2">ALGLIB</a> implements BFGS and its limited-memory version in C++ and C#</li>
<li><a href="/facts/GNU_Octave/vVVq5evM">GNU Octave</a> uses a form of BFGS in its fsolve function, with <a href="/facts/Trust_region/9xmnBqxY">trust region</a> extensions.</li>
<li>The <a href="/facts/GNU_Scientific_Library/WKX9L554">GSL</a> implements BFGS as gsl_multimin_fdfminimizer_vector_bfgs2.<a class="footnote-ref" id="fnref:12" href="#fn:12">12</a></li>
<li>In <a href="/facts/R_(programming_language)/LSrkr8K8">R</a>, the BFGS algorithm (and the L-BFGS-B version that allows box constraints) is implemented as an option of the base function optim().<a class="footnote-ref" id="fnref:13" href="#fn:13">13</a></li>
<li>In <a href="/facts/SciPy/bNcMQGc0">SciPy</a>, the scipy.optimize.fmin_bfgs function implements BFGS.<a class="footnote-ref" id="fnref:14" href="#fn:14">14</a> It is also possible to run BFGS using any of the <a href="/facts/L-BFGS/spe3MAb3">L-BFGS</a> algorithms by setting the parameter L to a very large number. It is also one of the default methods used when running scipy.optimize.minimize with no constraints.<a class="footnote-ref" id="fnref:15" href="#fn:15">15</a></li>
<li>In <a href="/facts/Julia_(programming_language)/AoB0PJ9C">Julia</a>, the <a href="https://julianlsolvers.github.io/Optim.jl/stable/">Optim.jl</a> package implements BFGS and L-BFGS as a solver option to the optimize() function (among other options).<a class="footnote-ref" id="fnref:16" href="#fn:16">16</a></li></ul>
Notable proprietary implementations include:

<ul><li>The large scale nonlinear optimization software <a href="/facts/Artelys_Knitro/e5D2EHkH">Artelys Knitro</a> implements, among others, both BFGS and L-BFGS algorithms.</li>
<li>In the MATLAB <a href="/facts/Optimization_Toolbox/WQDfjsKm">Optimization Toolbox</a>, the fminunc function uses BFGS with cubic line search when the problem size is set to "medium scale."</li>
<li><a href="/facts/Mathematica/dRwoGFG2">Mathematica</a> includes BFGS.</li>
<li>LS-DYNA also uses BFGS to solve implicit Problems.</li></ul>
<h2 id="see-also">See also</h2>

<ul><li><a href="/facts/BHHH_algorithm/RXprq4w2">BHHH algorithm</a></li>
<li><a href="/facts/Davidon%25E2%2580%2593Fletcher%25E2%2580%2593Powell_formula/5oYlloCx">Davidon–Fletcher–Powell formula</a></li>
<li><a href="/facts/Gradient_descent/pFFrek0F">Gradient descent</a></li>
<li><a href="/facts/L-BFGS/spe3MAb3">L-BFGS</a></li>
<li><a href="/facts/Levenberg%25E2%2580%2593Marquardt_algorithm/yqhqgb1g">Levenberg–Marquardt algorithm</a></li>
<li><a href="/facts/Nelder%25E2%2580%2593Mead_method/6aXswEax">Nelder–Mead method</a></li>
<li><a href="/facts/Pattern_search_(optimization)/J7HGHb50">Pattern search (optimization)</a></li>
<li><a href="/facts/Quasi-Newton_methods/1YX1vMRa">Quasi-Newton methods</a></li>
<li><a href="/facts/Symmetric_rank-one/ygdw0YXf">Symmetric rank-one</a></li>
<li><a href="/facts/Compact_quasi-Newton_representation/N8ChSQmf">Compact quasi-Newton representation</a></li></ul>

<h2 id="further-reading">Further reading</h2>
<ul><li>Avriel, Mordecai (2003), Nonlinear Programming: Analysis and Methods, Dover Publishing, <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-486-43227-4</li>
<li>Bonnans, J. Frédéric; Gilbert, J. Charles; <a href="/facts/Claude_Lemar%25C3%25A9chal/aMcoIXVJ">Lemaréchal, Claude</a>; <a href="/facts/Claudia_Sagastiz%25C3%25A1bal/12mra0FN">Sagastizábal, Claudia A.</a> (2006), "Newtonian Methods", Numerical Optimization: Theoretical and Practical Aspects (Second ed.), Berlin: Springer, pp. 51–66, <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 3-540-35445-X</li>
<li>Fletcher, Roger (1987), <a href="https://archive.org/details/practicalmethods0000flet">Practical Methods of Optimization</a> (2nd ed.), New York: <a href="/facts/John_Wiley_%2526_Sons/53g3LqJc">John Wiley & Sons</a>, <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-471-91547-8</li>
<li><a href="/facts/David_G._Luenberger/cRQlQCYn">Luenberger, David G.</a>; <a href="/facts/Yinyu_Ye/N8tK0Uqd">Ye, Yinyu</a> (2008), Linear and nonlinear programming, International Series in Operations Research & Management Science, vol. 116 (Third ed.), New York: Springer, pp. xiv+546, <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-387-74502-2, <a href="/facts/MR_(identifier)/uP137L11">MR</a> <a href="https://mathscinet.ams.org/mathscinet-getitem?mr=2423726">2423726</a></li>
<li>Kelley, C. T. (1999), Iterative Methods for Optimization, Philadelphia: Society for Industrial and Applied Mathematics, pp. 71–86, <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 0-89871-433-8</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Fletcher, Roger (1987), Practical Methods of Optimization (2nd ed.), New York: John Wiley & Sons, ISBN 978-0-471-91547-8 <a href="978-0-471-91547-8" target="_blank">978-0-471-91547-8</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Dennis, J. E. Jr.; Schnabel, Robert B. (1983), "Secant Methods for Unconstrained Minimization", Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Englewood Cliffs, NJ: Prentice-Hall, pp. 194–215, ISBN 0-13-627216-9 <a href="0-13-627216-9" target="_blank">0-13-627216-9</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Byrd, Richard H.; Lu, Peihuang; Nocedal, Jorge; Zhu, Ciyou (1995), "A Limited Memory Algorithm for Bound Constrained Optimization", SIAM Journal on Scientific Computing, 16 (5): 1190–1208, CiteSeerX 10.1.1.645.5814, doi:10.1137/0916069 <a href="http://www.ece.northwestern.edu/~nocedal/PSfiles/limited.ps.gz" target="_blank">http://www.ece.northwestern.edu/~nocedal/PSfiles/limited.ps.gz</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">Broyden, C. G. (1970), "The convergence of a class of double-rank minimization algorithms", Journal of the Institute of Mathematics and Its Applications, 6: 76–90, doi:10.1093/imamat/6.1.76 <a href="/wiki/Charles_George_Broyden" target="_blank">/wiki/Charles_George_Broyden</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
<li id="fn:5">Fletcher, R. (1970), "A New Approach to Variable Metric Algorithms", Computer Journal, 13 (3): 317–322, doi:10.1093/comjnl/13.3.317 <a href="/wiki/Doi_(identifier)" target="_blank">/wiki/Doi_(identifier)</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></li>
<li id="fn:6">Goldfarb, D. (1970), "A Family of Variable Metric Updates Derived by Variational Means", Mathematics of Computation, 24 (109): 23–26, doi:10.1090/S0025-5718-1970-0258249-6 <a href="/wiki/Donald_Goldfarb" target="_blank">/wiki/Donald_Goldfarb</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></li>
<li id="fn:7">Shanno, David F. (July 1970), "Conditioning of quasi-Newton methods for function minimization", Mathematics of Computation, 24 (111): 647–656, doi:10.1090/S0025-5718-1970-0274029-X, MR 0274029 <a href="/wiki/Doi_(identifier)" target="_blank">/wiki/Doi_(identifier)</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></li>
<li id="fn:8">Fletcher, Roger (1987), Practical methods of optimization (2nd ed.), New York: John Wiley & Sons, ISBN 978-0-471-91547-8 <a href="978-0-471-91547-8" target="_blank">978-0-471-91547-8</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></li>
<li id="fn:9">Nocedal, Jorge; Wright, Stephen J. (2006), Numerical Optimization (2nd ed.), Berlin, New York: Springer-Verlag, ISBN 978-0-387-30303-1 <a href="978-0-387-30303-1" target="_blank">978-0-387-30303-1</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></li>
<li id="fn:10">Ge, Ren-pu; Powell, M. J. D. (1983). "The Convergence of Variable Metric Matrices in Unconstrained Optimization". Mathematical Programming. 27 (2). 123. doi:10.1007/BF02591941. S2CID 8113073. <a href="/wiki/Mathematical_Programming" target="_blank">/wiki/Mathematical_Programming</a> <a href="#fnref:10" class="footnote-back-ref">↩</a></li>
<li id="fn:11">Jorge Nocedal; Stephen J. Wright (2006), Numerical Optimization <a href="#fnref:11" class="footnote-back-ref">↩</a></li>
<li id="fn:12">"GNU Scientific Library — GSL 2.6 documentation". www.gnu.org. Retrieved 2020-11-22. <a href="https://www.gnu.org/software/gsl/doc/html/index.html" target="_blank">https://www.gnu.org/software/gsl/doc/html/index.html</a> <a href="#fnref:12" class="footnote-back-ref">↩</a></li>
<li id="fn:13">"R: General-purpose Optimization". stat.ethz.ch. Retrieved 2020-11-22. <a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/optim.html" target="_blank">https://stat.ethz.ch/R-manual/R-devel/library/stats/html/optim.html</a> <a href="#fnref:13" class="footnote-back-ref">↩</a></li>
<li id="fn:14">"scipy.optimize.fmin_bfgs — SciPy v1.5.4 Reference Guide". docs.scipy.org. Retrieved 2020-11-22. <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_bfgs.html" target="_blank">https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_bfgs.html</a> <a href="#fnref:14" class="footnote-back-ref">↩</a></li>
<li id="fn:15">"scipy.optimize.minimize — SciPy v1.5.4 Reference Guide". docs.scipy.org. Retrieved 2025-01-22. <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html" target="_blank">https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html</a> <a href="#fnref:15" class="footnote-back-ref">↩</a></li>
<li id="fn:16">"Optim.jl Configurable options". julianlsolvers. <a href="https://julianlsolvers.github.io/Optim.jl/stable/#user/config/#solver-options" target="_blank">https://julianlsolvers.github.io/Optim.jl/stable/#user/config/#solver-options</a> <a href="#fnref:16" class="footnote-back-ref">↩</a></li>
</ol>

Broyden–Fletcher–Goldfarb–Shanno algorithm open-in-new

Broyden–Fletcher–Goldfarb–Shanno algorithm