Actor-critic algorithm

<h2 id="overview">Overview</h2>
The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.

<h3>Actor</h3>
The actor uses a policy function 
 
 
 
 π
 (
 a
 
 |
 
 s
 )
 
 
 {\displaystyle \pi (a|s)}
 
, while the critic estimates either the <a href="/facts/Value_function/mx1w3z2R">value function</a> 
 
 
 
 V
 (
 s
 )
 
 
 {\displaystyle V(s)}
 
, the action-value Q-function 
 
 
 
 Q
 (
 s
 ,
 a
 )
 
 
 {\displaystyle Q(s,a)}
 
, the advantage function 
 
 
 
 A
 (
 s
 ,
 a
 )
 
 
 {\displaystyle A(s,a)}
 
, or any combination thereof.
The actor is a parameterized function 
 
 
 
 
 π
 
 θ
 
 
 
 
 {\displaystyle \pi _{\theta }}
 
, where 
 
 
 
 θ
 
 
 {\displaystyle \theta }
 
 are the parameters of the actor. The actor takes as argument the state of the environment 
 
 
 
 s
 
 
 {\displaystyle s}
 
 and produces a <a href="/facts/Probability_distribution/EpsKKVRu">probability distribution</a> 
 
 
 
 
 π
 
 θ
 
 
 (
 ⋅
 
 |
 
 s
 )
 
 
 {\displaystyle \pi _{\theta }(\cdot |s)}
 
.
If the action space is discrete, then 
 
 
 
 
 ∑
 
 a
 
 
 
 π
 
 θ
 
 
 (
 a
 
 |
 
 s
 )
 =
 1
 
 
 {\displaystyle \sum _{a}\pi _{\theta }(a|s)=1}
 
. If the action space is continuous, then 
 
 
 
 
 ∫
 
 a
 
 
 
 π
 
 θ
 
 
 (
 a
 
 |
 
 s
 )
 d
 a
 =
 1
 
 
 {\displaystyle \int _{a}\pi _{\theta }(a|s)da=1}
 
.
The goal of policy optimization is to improve the actor. That is, to find some 
 
 
 
 θ
 
 
 {\displaystyle \theta }
 
 that maximizes the expected episodic reward 
 
 
 
 J
 (
 θ
 )
 
 
 {\displaystyle J(\theta )}
 
:
 
 
 
 J
 (
 θ
 )
 =
 
 
 E
 
 
 
 π
 
 θ
 
 
 
 
 
 [
 
 
 ∑
 
 t
 =
 0
 
 
 T
 
 
 
 γ
 
 t
 
 
 
 r
 
 t
 
 
 
 ]
 
 
 
 {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}r_{t}\right]}
 
where 
 
 
 
 γ
 
 
 {\displaystyle \gamma }
 
 is the <a href="/facts/Discount_factor/90o5yEkL">discount factor</a>, 
 
 
 
 
 r
 
 t
 
 
 
 
 {\displaystyle r_{t}}
 
 is the reward at step 
 
 
 
 t
 
 
 {\displaystyle t}
 
, and 
 
 
 
 T
 
 
 {\displaystyle T}
 
 is the time-horizon (which can be infinite).
The goal of policy gradient method is to optimize 
 
 
 
 J
 (
 θ
 )
 
 
 {\displaystyle J(\theta )}
 
 by <a href="/facts/Gradient_descent/pFFrek0F">gradient ascent</a> on the policy gradient 
 
 
 
 ∇
 J
 (
 θ
 )
 
 
 {\displaystyle \nabla J(\theta )}
 
.
As detailed on the <a href="/facts/Policy_gradient_method/NrgPPS0Q">policy gradient method</a> page, there are many <a href="/facts/Unbiased_estimator/oxIvEgmd">unbiased estimators</a> of the policy gradient:
 
 
 
 
 ∇
 
 θ
 
 
 J
 (
 θ
 )
 =
 
 
 E
 
 
 
 π
 
 θ
 
 
 
 
 
 [
 
 
 ∑
 
 0
 ≤
 j
 ≤
 T
 
 
 
 ∇
 
 θ
 
 
 ln
 ⁡
 
 π
 
 θ
 
 
 (
 
 A
 
 j
 
 
 
 |
 
 
 S
 
 j
 
 
 )
 ⋅
 
 Ψ
 
 j
 
 
 
 
 |
 
 
 
 S
 
 0
 
 
 =
 
 s
 
 0
 
 
 
 ]
 
 
 
 {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{0\leq j\leq T}\nabla _{\theta }\ln \pi _{\theta }(A_{j}|S_{j})\cdot \Psi _{j}{\Big |}S_{0}=s_{0}\right]}
 
where 
 
 
 
 
 Ψ
 
 j
 
 
 
 
 {\textstyle \Psi _{j}}
 
 is a linear sum of the following:

<ul><li>
 
 
 
 
 ∑
 
 0
 ≤
 i
 ≤
 T
 
 
 (
 
 γ
 
 i
 
 
 
 R
 
 i
 
 
 )
 
 
 {\textstyle \sum _{0\leq i\leq T}(\gamma ^{i}R_{i})}
 
.</li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 ∑
 
 j
 ≤
 i
 ≤
 T
 
 
 (
 
 γ
 
 i
 −
 j
 
 
 
 R
 
 i
 
 
 )
 
 
 {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})}
 
: the REINFORCE algorithm.</li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 ∑
 
 j
 ≤
 i
 ≤
 T
 
 
 (
 
 γ
 
 i
 −
 j
 
 
 
 R
 
 i
 
 
 )
 −
 b
 (
 
 S
 
 j
 
 
 )
 
 
 {\textstyle \gamma ^{j}\sum _{j\leq i\leq T}(\gamma ^{i-j}R_{i})-b(S_{j})}
 
: the REINFORCE with baseline algorithm. Here 
 
 
 
 b
 
 
 {\displaystyle b}
 
 is an arbitrary function.</li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 (
 
 
 R
 
 j
 
 
 +
 γ
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 +
 1
 
 
 )
 −
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 
 
 )
 
 )
 
 
 
 {\textstyle \gamma ^{j}\left(R_{j}+\gamma V^{\pi _{\theta }}(S_{j+1})-V^{\pi _{\theta }}(S_{j})\right)}
 
: <a href="/facts/Temporal_difference_learning/FJDMDwBD">TD(1) learning</a>.</li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 Q
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 
 
 ,
 
 A
 
 j
 
 
 )
 
 
 {\textstyle \gamma ^{j}Q^{\pi _{\theta }}(S_{j},A_{j})}
 
.</li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 A
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 
 
 ,
 
 A
 
 j
 
 
 )
 
 
 {\textstyle \gamma ^{j}A^{\pi _{\theta }}(S_{j},A_{j})}
 
: Advantage Actor-Critic (A2C).<a class="footnote-ref" id="fnref:3" href="#fn:3">3</a></li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 (
 
 
 R
 
 j
 
 
 +
 γ
 
 R
 
 j
 +
 1
 
 
 +
 
 γ
 
 2
 
 
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 +
 2
 
 
 )
 −
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 
 
 )
 
 )
 
 
 
 {\textstyle \gamma ^{j}\left(R_{j}+\gamma R_{j+1}+\gamma ^{2}V^{\pi _{\theta }}(S_{j+2})-V^{\pi _{\theta }}(S_{j})\right)}
 
: TD(2) learning.</li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 (
 
 
 ∑
 
 k
 =
 0
 
 
 n
 −
 1
 
 
 
 γ
 
 k
 
 
 
 R
 
 j
 +
 k
 
 
 +
 
 γ
 
 n
 
 
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 +
 n
 
 
 )
 −
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 
 
 )
 
 )
 
 
 
 {\textstyle \gamma ^{j}\left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}
 
: TD(n) learning.</li>
<li>
 
 
 
 
 γ
 
 j
 
 
 
 ∑
 
 n
 =
 1
 
 
 ∞
 
 
 
 
 
 λ
 
 n
 −
 1
 
 
 
 1
 −
 λ
 
 
 
 ⋅
 
 (
 
 
 ∑
 
 k
 =
 0
 
 
 n
 −
 1
 
 
 
 γ
 
 k
 
 
 
 R
 
 j
 +
 k
 
 
 +
 
 γ
 
 n
 
 
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 +
 n
 
 
 )
 −
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 
 S
 
 j
 
 
 )
 
 )
 
 
 
 {\textstyle \gamma ^{j}\sum _{n=1}^{\infty }{\frac {\lambda ^{n-1}}{1-\lambda }}\cdot \left(\sum _{k=0}^{n-1}\gamma ^{k}R_{j+k}+\gamma ^{n}V^{\pi _{\theta }}(S_{j+n})-V^{\pi _{\theta }}(S_{j})\right)}
 
: TD(λ) learning, also known as GAE (generalized advantage estimate).<a class="footnote-ref" id="fnref:4" href="#fn:4">4</a> This is obtained by an exponentially decaying sum of the TD(n) learning terms.</li></ul>
<h3>Critic</h3>
In the unbiased estimators given above, certain functions such as 
 
 
 
 
 V
 
 
 π
 
 θ
 
 
 
 
 ,
 
 Q
 
 
 π
 
 θ
 
 
 
 
 ,
 
 A
 
 
 π
 
 θ
 
 
 
 
 
 
 {\displaystyle V^{\pi _{\theta }},Q^{\pi _{\theta }},A^{\pi _{\theta }}}
 
 appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.
For example, if the critic is estimating the state-value function 
 
 
 
 
 V
 
 
 π
 
 θ
 
 
 
 
 (
 s
 )
 
 
 {\displaystyle V^{\pi _{\theta }}(s)}
 
, then it can be learned by any value function approximation method. Let the critic be a function approximator 
 
 
 
 
 V
 
 ϕ
 
 
 (
 s
 )
 
 
 {\displaystyle V_{\phi }(s)}
 
 with parameters 
 
 
 
 ϕ
 
 
 {\displaystyle \phi }
 
.
The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:
 
 
 
 
 δ
 
 i
 
 
 =
 
 R
 
 i
 
 
 +
 γ
 
 V
 
 ϕ
 
 
 (
 
 S
 
 i
 +
 1
 
 
 )
 −
 
 V
 
 ϕ
 
 
 (
 
 S
 
 i
 
 
 )
 
 
 {\displaystyle \delta _{i}=R_{i}+\gamma V_{\phi }(S_{i+1})-V_{\phi }(S_{i})}
 
The critic parameters are updated by gradient descent on the squared TD error:
 
 
 
 ϕ
 ←
 ϕ
 −
 α
 
 ∇
 
 ϕ
 
 
 (
 
 δ
 
 i
 
 
 
 )
 
 2
 
 
 =
 ϕ
 +
 α
 
 δ
 
 i
 
 
 
 ∇
 
 ϕ
 
 
 
 V
 
 ϕ
 
 
 (
 
 S
 
 i
 
 
 )
 
 
 {\displaystyle \phi \leftarrow \phi -\alpha \nabla _{\phi }(\delta _{i})^{2}=\phi +\alpha \delta _{i}\nabla _{\phi }V_{\phi }(S_{i})}
 
where 
 
 
 
 α
 
 
 {\displaystyle \alpha }
 
 is the learning rate. Note that the gradient is taken with respect to the 
 
 
 
 ϕ
 
 
 {\displaystyle \phi }
 
 in 
 
 
 
 
 V
 
 ϕ
 
 
 (
 
 S
 
 i
 
 
 )
 
 
 {\displaystyle V_{\phi }(S_{i})}
 
 only, since the 
 
 
 
 ϕ
 
 
 {\displaystyle \phi }
 
 in 
 
 
 
 γ
 
 V
 
 ϕ
 
 
 (
 
 S
 
 i
 +
 1
 
 
 )
 
 
 {\displaystyle \gamma V_{\phi }(S_{i+1})}
 
 constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use <a href="/facts/Automatic_differentiation/dQcdhBAM">automatic differentiation</a>, and requires "stopping the gradient" at that point.
Similarly, if the critic is estimating the action-value function 
 
 
 
 
 Q
 
 
 π
 
 θ
 
 
 
 
 
 
 {\displaystyle Q^{\pi _{\theta }}}
 
, then it can be learned by <a href="/facts/Q-learning/emySKrgI">Q-learning</a> or <a href="/facts/State%25E2%2580%2593action%25E2%2580%2593reward%25E2%2580%2593state%25E2%2580%2593action/kmv4vSKG">SARSA</a>. In SARSA, the critic maintains an estimate of the Q-function, parameterized by 
 
 
 
 ϕ
 
 
 {\displaystyle \phi }
 
, denoted as 
 
 
 
 
 Q
 
 ϕ
 
 
 (
 s
 ,
 a
 )
 
 
 {\displaystyle Q_{\phi }(s,a)}
 
. The temporal difference error is then calculated as 
 
 
 
 
 δ
 
 i
 
 
 =
 
 R
 
 i
 
 
 +
 γ
 
 Q
 
 θ
 
 
 (
 
 S
 
 i
 +
 1
 
 
 ,
 
 A
 
 i
 +
 1
 
 
 )
 −
 
 Q
 
 θ
 
 
 (
 
 S
 
 i
 
 
 ,
 
 A
 
 i
 
 
 )
 
 
 {\displaystyle \delta _{i}=R_{i}+\gamma Q_{\theta }(S_{i+1},A_{i+1})-Q_{\theta }(S_{i},A_{i})}
 
. The critic is then updated by
 
 
 
 θ
 ←
 θ
 +
 α
 
 δ
 
 i
 
 
 
 ∇
 
 θ
 
 
 
 Q
 
 θ
 
 
 (
 
 S
 
 i
 
 
 ,
 
 A
 
 i
 
 
 )
 
 
 {\displaystyle \theta \leftarrow \theta +\alpha \delta _{i}\nabla _{\theta }Q_{\theta }(S_{i},A_{i})}
 
The advantage critic can be trained by training both a Q-function 
 
 
 
 
 Q
 
 ϕ
 
 
 (
 s
 ,
 a
 )
 
 
 {\displaystyle Q_{\phi }(s,a)}
 
 and a state-value function 
 
 
 
 
 V
 
 ϕ
 
 
 (
 s
 )
 
 
 {\displaystyle V_{\phi }(s)}
 
, then let 
 
 
 
 
 A
 
 ϕ
 
 
 (
 s
 ,
 a
 )
 =
 
 Q
 
 ϕ
 
 
 (
 s
 ,
 a
 )
 −
 
 V
 
 ϕ
 
 
 (
 s
 )
 
 
 {\displaystyle A_{\phi }(s,a)=Q_{\phi }(s,a)-V_{\phi }(s)}
 
. Although, it is more common to train just a state-value function 
 
 
 
 
 V
 
 ϕ
 
 
 (
 s
 )
 
 
 {\displaystyle V_{\phi }(s)}
 
, then estimate the advantage by<a class="footnote-ref" id="fnref:5" href="#fn:5">5</a>
 
 
 
 
 A
 
 ϕ
 
 
 (
 
 S
 
 i
 
 
 ,
 
 A
 
 i
 
 
 )
 ≈
 
 ∑
 
 j
 ∈
 0
 :
 n
 −
 1
 
 
 
 γ
 
 j
 
 
 
 R
 
 i
 +
 j
 
 
 +
 
 γ
 
 n
 
 
 
 V
 
 ϕ
 
 
 (
 
 S
 
 i
 +
 n
 
 
 )
 −
 
 V
 
 ϕ
 
 
 (
 
 S
 
 i
 
 
 )
 
 
 {\displaystyle A_{\phi }(S_{i},A_{i})\approx \sum _{j\in 0:n-1}\gamma ^{j}R_{i+j}+\gamma ^{n}V_{\phi }(S_{i+n})-V_{\phi }(S_{i})}
 
Here, 
 
 
 
 n
 
 
 {\displaystyle n}
 
 is a positive integer. The higher 
 
 
 
 n
 
 
 {\displaystyle n}
 
 is, the more lower is the bias in the advantage estimation, but at the price of higher variance.
The Generalized Advantage Estimation (GAE) introduces a hyperparameter 
 
 
 
 λ
 
 
 {\displaystyle \lambda }
 
 that smoothly interpolates between Monte Carlo returns (
 
 
 
 λ
 =
 1
 
 
 {\displaystyle \lambda =1}
 
, high variance, no bias) and 1-step <a href="/facts/Temporal_difference_learning/FJDMDwBD">TD learning</a> (
 
 
 
 λ
 =
 0
 
 
 {\displaystyle \lambda =0}
 
, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with 
 
 
 
 λ
 
 
 {\displaystyle \lambda }
 
 being the decay strength.<a class="footnote-ref" id="fnref:6" href="#fn:6">6</a>

<h2 id="variants">Variants</h2>
<ul><li>Asynchronous Advantage Actor-Critic (A3C): <a href="/facts/Parallel_computing/nQzcpzQt">Parallel and asynchronous</a> version of A2C.<a class="footnote-ref" id="fnref:7" href="#fn:7">7</a></li>
<li>Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.<a class="footnote-ref" id="fnref:8" href="#fn:8">8</a></li>
<li>Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.<a class="footnote-ref" id="fnref:9" href="#fn:9">9</a></li></ul>
<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Reinforcement_learning/NrgPPS0Q">Reinforcement learning</a></li>
<li><a href="/facts/Policy_gradient_method/NrgPPS0Q">Policy gradient method</a></li>
<li><a href="/facts/Deep_reinforcement_learning/9np68cqu">Deep reinforcement learning</a></li></ul>

<ul><li>Konda, Vijay R.; Tsitsiklis, John N. (January 2003). <a href="http://epubs.siam.org/doi/10.1137/S0363012901385691">"On Actor-Critic Algorithms"</a>. SIAM Journal on Control and Optimization. 42 (4): 1143–1166. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1137%2FS0363012901385691">10.1137/S0363012901385691</a>. <a href="/facts/ISSN_(identifier)/DPAflDvU">ISSN</a> <a href="https://search.worldcat.org/issn/0363-0129">0363-0129</a>.</li>
<li>Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2 ed.). Cambridge, Massachusetts: The MIT Press. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-262-03924-6.</li>
<li>Bertsekas, Dimitri P. (2019). Reinforcement learning and optimal control (2 ed.). Belmont, Massachusetts: Athena Scientific. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-1-886529-39-7.</li>
<li>Grossi, Csaba (2010). Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (1 ed.). Cham: Springer International Publishing. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-3-031-00423-0.</li>
<li>Grondman, Ivo; Busoniu, Lucian; Lopes, Gabriel A. D.; Babuska, Robert (November 2012). <a href="https://ieeexplore.ieee.org/document/6392457">"A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients"</a>. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 42 (6): 1291–1307. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1109%2FTSMCC.2012.2218595">10.1109/TSMCC.2012.2218595</a>. <a href="/facts/ISSN_(identifier)/DPAflDvU">ISSN</a> <a href="https://search.worldcat.org/issn/1094-6977">1094-6977</a>.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Arulkumaran, Kai; Deisenroth, Marc Peter; Brundage, Miles; Bharath, Anil Anthony (November 2017). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. arXiv:1708.05866. Bibcode:2017ISPM...34...26A. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888. <a href="https://ieeexplore.ieee.org/document/8103164" target="_blank">https://ieeexplore.ieee.org/document/8103164</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Konda, Vijay; Tsitsiklis, John (1999). "Actor-Critic Algorithms". Advances in Neural Information Processing Systems. 12. MIT Press. <a href="https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html" target="_blank">https://proceedings.neurips.cc/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, arXiv:1602.01783 <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv:1506.02438 <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
<li id="fn:5">Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, arXiv:1602.01783 <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></li>
<li id="fn:6">Schulman, John; Moritz, Philipp; Levine, Sergey; Jordan, Michael; Abbeel, Pieter (2018-10-20), High-Dimensional Continuous Control Using Generalized Advantage Estimation, arXiv:1506.02438 <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></li>
<li id="fn:7">Mnih, Volodymyr; Badia, Adrià Puigdomènech; Mirza, Mehdi; Graves, Alex; Lillicrap, Timothy P.; Harley, Tim; Silver, David; Kavukcuoglu, Koray (2016-06-16), Asynchronous Methods for Deep Reinforcement Learning, arXiv:1602.01783 <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></li>
<li id="fn:8">Haarnoja, Tuomas; Zhou, Aurick; Hartikainen, Kristian; Tucker, George; Ha, Sehoon; Tan, Jie; Kumar, Vikash; Zhu, Henry; Gupta, Abhishek (2019-01-29), Soft Actor-Critic Algorithms and Applications, arXiv:1812.05905 <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></li>
<li id="fn:9">Lillicrap, Timothy P.; Hunt, Jonathan J.; Pritzel, Alexander; Heess, Nicolas; Erez, Tom; Tassa, Yuval; Silver, David; Wierstra, Daan (2019-07-05), Continuous control with deep reinforcement learning, arXiv:1509.02971 <a href="/wiki/ArXiv_(identifier)" target="_blank">/wiki/ArXiv_(identifier)</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></li>
</ol>

Actor-critic algorithm open-in-new

Actor-critic algorithm