Asymptotic equipartition property

<h2 id="definition">Definition</h2>
Given a discrete-time stationary ergodic stochastic process 
 
 
 
 X
 
 
 {\displaystyle X}
 
 on the <a href="/facts/Probability_space/dEcv2VrU">probability space</a> 
 
 
 
 (
 Ω
 ,
 B
 ,
 p
 )
 
 
 {\displaystyle (\Omega ,B,p)}
 
, the asymptotic equipartition property is an assertion that, <a href="/facts/Almost_sure/fWaxzpBA">almost surely</a>,

−
        
          
            1
            n
          
        
        log
        ⁡
        p
        (
        
          X
          
            1
          
        
        ,
        
          X
          
            2
          
        
        ,
        …
        ,
        
          X
          
            n
          
        
        )
        →
        H
        (
        X
        )
        
        
           as 
        
        
        n
        →
        ∞
      
    
    {\displaystyle -{\frac {1}{n}}\log p(X_{1},X_{2},\dots ,X_{n})\to H(X)\quad {\text{ as }}\quad n\to \infty }

where 
 
 
 
 H
 (
 X
 )
 
 
 {\displaystyle H(X)}
 
 or simply 
 
 
 
 H
 
 
 {\displaystyle H}
 
 denotes the <a href="/facts/Entropy_rate/L2VllINH">entropy rate</a> of 
 
 
 
 X
 
 
 {\displaystyle X}
 
, which must exist for all discrete-time <a href="/facts/Stationary_process/vozdgfEX">stationary processes</a> including the ergodic ones. The asymptotic equipartition property is proved for finite-valued (i.e. 
 
 
 
 
 |
 
 Ω
 
 |
 
 <
 ∞
 
 
 {\displaystyle |\Omega |<\infty }
 
) stationary ergodic stochastic processes in the Shannon–McMillan–Breiman theorem using the ergodic theory and for any <a href="/facts/Independent_identically_distributed_random_variables/othIRaWt">i.i.d.</a> sources directly using the law of large numbers in both the discrete-valued case (where 
 
 
 
 H
 
 
 {\displaystyle H}
 
 is simply the <a href="/facts/Entropy_(information_theory)/NLg4NLvt">entropy</a> of a symbol) and the continuous-valued case (where 
 
 
 
 H
 
 
 {\displaystyle H}
 
 is the differential entropy instead). The definition of the asymptotic equipartition property can also be extended for certain classes of continuous-time stochastic processes for which a typical set exists for long enough observation time. The convergence is proven <a href="/facts/Almost_sure/fWaxzpBA">almost sure</a> in all cases.

<h2 id="discrete-time-iid-sources">Discrete-time i.i.d. sources</h2>
Given 
 
 
 
 X
 
 
 {\displaystyle X}
 
 is an <a href="/facts/Independent_identically_distributed_random_variables/othIRaWt">i.i.d.</a> source which may take values in the alphabet 
 
 
 
 
 
 X
 
 
 
 
 {\displaystyle {\mathcal {X}}}
 
, its <a href="/facts/Time_series/fSXPR817">time series</a> 
 
 
 
 
 X
 
 1
 
 
 ,
 …
 ,
 
 X
 
 n
 
 
 
 
 {\displaystyle X_{1},\ldots ,X_{n}}
 
 is i.i.d. with <a href="/facts/Entropy_(information_theory)/NLg4NLvt">entropy</a> 
 
 
 
 H
 (
 X
 )
 
 
 {\displaystyle H(X)}
 
. The weak <a href="/facts/Law_of_large_numbers/X3Bjcy3v">law of large numbers</a> gives the asymptotic equipartition property with <a href="/facts/Convergence_in_probability/pWdatFlY">convergence in probability</a>,

lim
          
            n
            →
            ∞
          
        
        Pr
        
          [
          
            
              |
              
                −
                
                  
                    1
                    n
                  
                
                log
                ⁡
                p
                (
                
                  X
                  
                    1
                  
                
                ,
                
                  X
                  
                    2
                  
                
                ,
                …
                ,
                
                  X
                  
                    n
                  
                
                )
                −
                H
                (
                X
                )
              
              |
            
            >
            ε
          
          ]
        
        =
        0
        
        ∀
        ε
        >
        0.
      
    
    {\displaystyle \lim _{n\to \infty }\Pr \left[\left|-{\frac {1}{n}}\log p(X_{1},X_{2},\ldots ,X_{n})-H(X)\right|>\varepsilon \right]=0\qquad \forall \varepsilon >0.}

since the entropy is equal to the expectation of <a class="footnote-ref" id="fnref:1" href="#fn:1">1</a>

−
        
          
            1
            n
          
        
        log
        ⁡
        p
        (
        
          X
          
            1
          
        
        ,
        
          X
          
            2
          
        
        ,
        …
        ,
        
          X
          
            n
          
        
        )
        .
      
    
    {\displaystyle -{\frac {1}{n}}\log p(X_{1},X_{2},\ldots ,X_{n}).}

The strong law of large numbers asserts the stronger almost sure convergence,

Pr
        
          [
          
            
              lim
              
                n
                →
                ∞
              
            
            −
            
              
                1
                n
              
            
            log
            ⁡
            p
            (
            
              X
              
                1
              
            
            ,
            
              X
              
                2
              
            
            ,
            …
            ,
            
              X
              
                n
              
            
            )
            =
            H
            (
            X
            )
          
          ]
        
        =
        1.
      
    
    {\displaystyle \Pr \left[\lim _{n\to \infty }-{\frac {1}{n}}\log p(X_{1},X_{2},\ldots ,X_{n})=H(X)\right]=1.}
  
Convergence in the sense of L1 asserts an even stronger
  
    
      
        
          E
        
        
          [
          
            |
            
              
                lim
                
                  n
                  →
                  ∞
                
              
              −
              
                
                  1
                  n
                
              
              log
              ⁡
              p
              (
              
                X
                
                  1
                
              
              ,
              
                X
                
                  2
                
              
              ,
              …
              ,
              
                X
                
                  n
                
              
              )
              −
              H
              (
              X
              )
            
            |
          
          ]
        
        =
        0
      
    
    {\displaystyle \mathbb {E} \left[\left|\lim _{n\to \infty }-{\frac {1}{n}}\log p(X_{1},X_{2},\ldots ,X_{n})-H(X)\right|\right]=0}

<h2 id="discrete-time-finite-valued-stationary-ergodic-sources">Discrete-time finite-valued stationary ergodic sources</h2>
Consider a finite-valued sample space 
 
 
 
 Ω
 
 
 {\displaystyle \Omega }
 
, i.e. 
 
 
 
 
 |
 
 Ω
 
 |
 
 <
 ∞
 
 
 {\displaystyle |\Omega |<\infty }
 
, for the discrete-time <a href="/facts/Stationary_ergodic_process/h8Kwf0Bo">stationary ergodic process</a> 
 
 
 
 X
 :=
 {
 
 X
 
 n
 
 
 }
 
 
 {\displaystyle X:=\{X_{n}\}}
 
 defined on the <a href="/facts/Probability_space/dEcv2VrU">probability space</a> 
 
 
 
 (
 Ω
 ,
 B
 ,
 p
 )
 
 
 {\displaystyle (\Omega ,B,p)}
 
. The Shannon–McMillan–Breiman theorem, due to <a href="/facts/Claude_Shannon/EremsCws">Claude Shannon</a>, <a href="/facts/Brockway_McMillan/x9oT8XFJ">Brockway McMillan</a>, and <a href="/facts/Leo_Breiman/nxaHnlTT">Leo Breiman</a>, states that we have convergence in the sense of L1.<a class="footnote-ref" id="fnref:2" href="#fn:2">2</a> <a href="/facts/Chung_Kai-lai/CPbll2Js">Chung Kai-lai</a> generalized this to the case where 
 
 
 
 X
 
 
 {\displaystyle X}
 
 may take value in a set of countable infinity, provided that the entropy rate is still finite.<a class="footnote-ref" id="fnref:3" href="#fn:3">3</a>

Proof sketch<a class="footnote-ref" id="fnref:4" href="#fn:4">4</a>
<ul><li>Let x denote some measurable set 
 
 
 
 x
 =
 X
 (
 A
 )
 
 
 {\displaystyle x=X(A)}
 
 for some 
 
 
 
 A
 ∈
 B
 
 
 {\displaystyle A\in B}
 
</li>
<li>Parameterize the joint probability by n and x as 
 
 
 
 j
 (
 n
 ,
 x
 )
 :=
 p
 
 (
 
 x
 
 0
 
 
 n
 −
 1
 
 
 )
 
 .
 
 
 {\displaystyle j(n,x):=p\left(x_{0}^{n-1}\right).}
 
</li>
<li>Parameterize the conditional probability by i, k and x as 
 
 
 
 c
 (
 i
 ,
 k
 ,
 x
 )
 :=
 p
 
 (
 
 
 x
 
 i
 
 
 ∣
 
 x
 
 i
 −
 k
 
 
 i
 −
 1
 
 
 
 )
 
 .
 
 
 {\displaystyle c(i,k,x):=p\left(x_{i}\mid x_{i-k}^{i-1}\right).}
 
</li>
<li>Take the limit of the conditional probability as k → ∞ and denote it as 
 
 
 
 c
 (
 i
 ,
 x
 )
 :=
 p
 
 (
 
 
 x
 
 i
 
 
 ∣
 
 x
 
 −
 ∞
 
 
 i
 −
 1
 
 
 
 )
 
 .
 
 
 {\displaystyle c(i,x):=p\left(x_{i}\mid x_{-\infty }^{i-1}\right).}
 
</li>
<li>Argue the two notions of entropy rate 
 
 
 
 
 lim
 
 n
 →
 ∞
 
 
 
 
 1
 n
 
 
 
 E
 
 [
 −
 log
 ⁡
 j
 (
 n
 ,
 X
 )
 ]
 
 
 and
 
 
 
 lim
 
 n
 →
 ∞
 
 
 
 E
 
 [
 −
 log
 ⁡
 c
 (
 n
 ,
 n
 ,
 X
 )
 ]
 
 
 {\displaystyle \lim _{n\to \infty }{\frac {1}{n}}\mathrm {E} [-\log j(n,X)]\quad {\text{and}}\quad \lim _{n\to \infty }\mathrm {E} [-\log c(n,n,X)]}
 
 exist and are equal for any stationary process including the stationary ergodic process X. Denote it as H.</li>
<li>Argue that both 
 
 
 
 
 
 
 
 c
 (
 i
 ,
 k
 ,
 X
 )
 
 
 
 :=
 
 {
 
 p
 
 (
 
 
 X
 
 i
 
 
 ∣
 
 X
 
 i
 −
 k
 
 
 i
 −
 1
 
 
 
 )
 
 
 }
 
 
 
 
 
 c
 (
 i
 ,
 X
 )
 
 
 
 :=
 
 {
 
 p
 
 (
 
 
 X
 
 i
 
 
 ∣
 
 X
 
 −
 ∞
 
 
 i
 −
 1
 
 
 
 )
 
 
 }
 
 
 
 
 
 
 
 {\displaystyle {\begin{aligned}c(i,k,X)&:=\left\{p\left(X_{i}\mid X_{i-k}^{i-1}\right)\right\}\\c(i,X)&:=\left\{p\left(X_{i}\mid X_{-\infty }^{i-1}\right)\right\}\end{aligned}}}
 
 where i is the time index, are stationary ergodic processes, whose sample means converge <a href="/facts/Almost_surely/fWaxzpBA">almost surely</a> to some values denoted by 
 
 
 
 
 H
 
 k
 
 
 
 
 {\displaystyle H^{k}}
 
 and 
 
 
 
 
 H
 
 ∞
 
 
 
 
 {\displaystyle H^{\infty }}
 
 respectively.</li>
<li>Define the k-th order Markov approximation to the probability 
 
 
 
 a
 (
 n
 ,
 k
 ,
 x
 )
 
 
 {\displaystyle a(n,k,x)}
 
 as 
 
 
 
 a
 (
 n
 ,
 k
 ,
 x
 )
 :=
 p
 
 (
 
 X
 
 0
 
 
 k
 −
 1
 
 
 )
 
 
 ∏
 
 i
 =
 k
 
 
 n
 −
 1
 
 
 p
 
 (
 
 
 X
 
 i
 
 
 ∣
 
 X
 
 i
 −
 k
 
 
 i
 −
 1
 
 
 
 )
 
 =
 j
 (
 k
 ,
 x
 )
 
 ∏
 
 i
 =
 k
 
 
 n
 −
 1
 
 
 c
 (
 i
 ,
 k
 ,
 x
 )
 
 
 {\displaystyle a(n,k,x):=p\left(X_{0}^{k-1}\right)\prod _{i=k}^{n-1}p\left(X_{i}\mid X_{i-k}^{i-1}\right)=j(k,x)\prod _{i=k}^{n-1}c(i,k,x)}
 
</li>
<li>Argue that 
 
 
 
 a
 (
 n
 ,
 k
 ,
 X
 (
 Ω
 )
 )
 
 
 {\displaystyle a(n,k,X(\Omega ))}
 
 is finite from the finite-value assumption.</li>
<li>Express 
 
 
 
 −
 
 
 1
 n
 
 
 log
 ⁡
 a
 (
 n
 ,
 k
 ,
 X
 )
 
 
 {\displaystyle -{\frac {1}{n}}\log a(n,k,X)}
 
 in terms of the sample mean of 
 
 
 
 c
 (
 i
 ,
 k
 ,
 X
 )
 
 
 {\displaystyle c(i,k,X)}
 
 and show that it converges almost surely to Hk</li>
<li>Define the probability measure 
 
 
 
 a
 (
 n
 ,
 x
 )
 :=
 p
 
 (
 
 
 x
 
 0
 
 
 n
 −
 1
 
 
 ∣
 
 x
 
 −
 ∞
 
 
 −
 1
 
 
 
 )
 
 .
 
 
 {\displaystyle a(n,x):=p\left(x_{0}^{n-1}\mid x_{-\infty }^{-1}\right).}
 
</li>
<li>Express 
 
 
 
 −
 
 
 1
 n
 
 
 log
 ⁡
 a
 (
 n
 ,
 X
 )
 
 
 {\displaystyle -{\frac {1}{n}}\log a(n,X)}
 
 in terms of the sample mean of 
 
 
 
 c
 (
 i
 ,
 X
 )
 
 
 {\displaystyle c(i,X)}
 
 and show that it converges almost surely to H∞.</li>
<li>Argue that 
 
 
 
 
 H
 
 k
 
 
 ↘
 H
 
 
 {\displaystyle H^{k}\searrow H}
 
 as k → ∞ using the stationarity of the process.</li>
<li>Argue that H = H∞ using the <a href="/facts/L%C3%A9vy%27s_martingale_convergence_theorem/52N9mHcI">Lévy's martingale convergence theorem</a> and the finite-value assumption.</li>
<li>Show that 
 
 
 
 
 E
 
 
 [
 
 
 
 a
 (
 n
 ,
 k
 ,
 X
 )
 
 
 j
 (
 n
 ,
 X
 )
 
 
 
 ]
 
 =
 a
 (
 n
 ,
 k
 ,
 X
 (
 Ω
 )
 )
 
 
 {\displaystyle \mathrm {E} \left[{\frac {a(n,k,X)}{j(n,X)}}\right]=a(n,k,X(\Omega ))}
 
 which is finite as argued before.</li>
<li>Show that 
 
 
 
 
 E
 
 
 [
 
 
 
 j
 (
 n
 ,
 X
 )
 
 
 a
 (
 n
 ,
 X
 )
 
 
 
 ]
 
 =
 1
 
 
 {\displaystyle \mathrm {E} \left[{\frac {j(n,X)}{a(n,X)}}\right]=1}
 
 by conditioning on the infinite past 
 
 
 
 
 X
 
 −
 ∞
 
 
 −
 1
 
 
 
 
 {\displaystyle X_{-\infty }^{-1}}
 
 and iterating the expectation.</li>
<li>Show that 
 
 
 
 ∀
 α
 ∈
 
 R
 
  
 :
  
 Pr
 
 [
 
 
 
 
 a
 (
 n
 ,
 k
 ,
 X
 )
 
 
 j
 (
 n
 ,
 X
 )
 
 
 
 ≥
 α
 
 ]
 
 ≤
 
 
 
 a
 (
 n
 ,
 k
 ,
 X
 (
 Ω
 )
 )
 
 α
 
 
 
 
 {\displaystyle \forall \alpha \in \mathbb {R} \ :\ \Pr \left[{\frac {a(n,k,X)}{j(n,X)}}\geq \alpha \right]\leq {\frac {a(n,k,X(\Omega ))}{\alpha }}}
 
 using the <a href="/facts/Markov%27s_inequality/Re85kMQJ">Markov's inequality</a> and the expectation derived previously.</li>
<li>Similarly, show that 
 
 
 
 ∀
 α
 ∈
 
 R
 
  
 :
  
 Pr
 
 [
 
 
 
 
 j
 (
 n
 ,
 X
 )
 
 
 a
 (
 n
 ,
 X
 )
 
 
 
 ≥
 α
 
 ]
 
 ≤
 
 
 1
 α
 
 
 ,
 
 
 {\displaystyle \forall \alpha \in \mathbb {R} \ :\ \Pr \left[{\frac {j(n,X)}{a(n,X)}}\geq \alpha \right]\leq {\frac {1}{\alpha }},}
 
 which is equivalent to 
 
 
 
 ∀
 α
 ∈
 
 R
 
  
 :
  
 Pr
 
 [
 
 
 
 1
 n
 
 
 log
 ⁡
 
 
 
 j
 (
 n
 ,
 X
 )
 
 
 a
 (
 n
 ,
 X
 )
 
 
 
 ≥
 
 
 1
 n
 
 
 log
 ⁡
 α
 
 ]
 
 ≤
 
 
 1
 α
 
 
 .
 
 
 {\displaystyle \forall \alpha \in \mathbb {R} \ :\ \Pr \left[{\frac {1}{n}}\log {\frac {j(n,X)}{a(n,X)}}\geq {\frac {1}{n}}\log \alpha \right]\leq {\frac {1}{\alpha }}.}
 
</li>
<li>Show that limsup of 
 
 
 
 
 
 1
 n
 
 
 log
 ⁡
 
 
 
 a
 (
 n
 ,
 k
 ,
 X
 )
 
 
 j
 (
 n
 ,
 X
 )
 
 
 
 
 
 and
 
 
 
 
 1
 n
 
 
 log
 ⁡
 
 
 
 j
 (
 n
 ,
 X
 )
 
 
 a
 (
 n
 ,
 X
 )
 
 
 
 
 
 {\displaystyle {\frac {1}{n}}\log {\frac {a(n,k,X)}{j(n,X)}}\quad {\text{and}}\quad {\frac {1}{n}}\log {\frac {j(n,X)}{a(n,X)}}}
 
 are non-positive almost surely by setting α = nβ for any β > 1 and applying the <a href="/facts/Borel%E2%80%93Cantelli_lemma/BX397zgr">Borel–Cantelli lemma</a>.</li>
<li>Show that liminf and limsup of 
 
 
 
 −
 
 
 1
 n
 
 
 log
 ⁡
 j
 (
 n
 ,
 X
 )
 
 
 {\displaystyle -{\frac {1}{n}}\log j(n,X)}
 
 are lower and upper bounded almost surely by H∞ and Hk respectively by breaking up the logarithms in the previous result.</li>
<li>Complete the proof by pointing out that the upper and lower bounds are shown previously to approach H as k → ∞.</li></ul>

<h2 id="non-stationary-discrete-time-source-producing-independent-symbols">Non-stationary discrete-time source producing independent symbols</h2>
The assumptions of stationarity/ergodicity/identical distribution of random variables is not essential for the asymptotic equipartition property to hold. Indeed, as is quite clear intuitively, the asymptotic equipartition property requires only some form of the law of large numbers to hold, which is fairly general. However, the expression needs to be suitably generalized, and the conditions need to be formulated precisely.
Consider a source that produces independent symbols, possibly with different output statistics at each instant, for which the statistics of the process are known completely, that is, the marginal distribution of the process seen at each time instant is known. The joint distribution is just the product of marginals. Then, under the condition (which can be relaxed) that 
 
 
 
 
 V
 a
 r
 
 [
 log
 ⁡
 p
 (
 
 X
 
 i
 
 
 )
 ]
 <
 M
 
 
 {\displaystyle \mathrm {Var} [\log p(X_{i})]<M}
 
 for all i, for some M > 0, the following holds (AEP):

lim
 
 n
 →
 ∞
 
 
 Pr
 
 [
 
 
 
 |
 
 −
 
 
 1
 n
 
 
 log
 ⁡
 p
 (
 
 X
 
 1
 
 
 ,
 
 X
 
 2
 
 
 ,
 …
 ,
 
 X
 
 n
 
 
 )
 −
 
 
 
 H
 ¯
 
 
 
 n
 
 
 (
 X
 )
 
 |
 
 <
 ε
 
 ]
 
 =
 1
 
 ∀
 ε
 >
 0
 
 
 {\displaystyle \lim _{n\to \infty }\Pr \left[\,\left|-{\frac {1}{n}}\log p(X_{1},X_{2},\ldots ,X_{n})-{\overline {H}}_{n}(X)\right|<\varepsilon \right]=1\qquad \forall \varepsilon >0}

where

H
              ¯
            
          
          
            n
          
        
        (
        X
        )
        =
        
          
            1
            n
          
        
        H
        (
        
          X
          
            1
          
        
        ,
        
          X
          
            2
          
        
        ,
        …
        ,
        
          X
          
            n
          
        
        )
      
    
    {\displaystyle {\overline {H}}_{n}(X)={\frac {1}{n}}H(X_{1},X_{2},\ldots ,X_{n})}

Proof
The proof follows from a simple application of <a href="/facts/Markov%27s_inequality/Re85kMQJ">Markov's inequality</a> (applied to second moment of 
 
 
 
 log
 ⁡
 (
 p
 (
 
 X
 
 i
 
 
 )
 )
 
 
 {\displaystyle \log(p(X_{i}))}
 
.

Pr
                
                  [
                  
                    
                      |
                      
                        −
                        
                          
                            1
                            n
                          
                        
                        log
                        ⁡
                        p
                        (
                        
                          X
                          
                            1
                          
                        
                        ,
                        
                          X
                          
                            2
                          
                        
                        ,
                        …
                        ,
                        
                          X
                          
                            n
                          
                        
                        )
                        −
                        
                          
                            H
                            ¯
                          
                        
                        (
                        X
                        )
                      
                      |
                    
                    >
                    ε
                  
                  ]
                
              
              
                
                ≤
                
                  
                    1
                    
                      
                        n
                        
                          2
                        
                      
                      
                        ε
                        
                          2
                        
                      
                    
                  
                
                
                  V
                  a
                  r
                
                
                  [
                  
                    
                      ∑
                      
                        i
                        =
                        1
                      
                      
                        n
                      
                    
                    
                      
                        (
                        
                          log
                          ⁡
                          (
                          p
                          (
                          
                            X
                            
                              i
                            
                          
                          )
                        
                        )
                      
                      
                        2
                      
                    
                  
                  ]
                
              
            
            
              
              
                
                ≤
                
                  
                    M
                    
                      n
                      
                        ε
                        
                          2
                        
                      
                    
                  
                
                →
                0
                
                   as 
                
                n
                →
                ∞
              
            
          
        
      
    
    {\displaystyle {\begin{aligned}\Pr \left[\left|-{\frac {1}{n}}\log p(X_{1},X_{2},\ldots ,X_{n})-{\overline {H}}(X)\right|>\varepsilon \right]&\leq {\frac {1}{n^{2}\varepsilon ^{2}}}\mathrm {Var} \left[\sum _{i=1}^{n}\left(\log(p(X_{i})\right)^{2}\right]\\&\leq {\frac {M}{n\varepsilon ^{2}}}\to 0{\text{ as }}n\to \infty \end{aligned}}}

It is obvious that the proof holds if any moment 
 
 
 
 
 E
 
 
 [
 
 
 |
 
 log
 ⁡
 p
 (
 
 X
 
 i
 
 
 )
 
 
 |
 
 
 r
 
 
 
 ]
 
 
 
 {\displaystyle \mathrm {E} \left[|\log p(X_{i})|^{r}\right]}
 
 is uniformly bounded for r > 1 (again by <a href="/facts/Markov%27s_inequality/Re85kMQJ">Markov's inequality</a> applied to r-th moment). <a href="/facts/Q.E.D./B8I8jPwT">Q.E.D.</a>
Even this condition is not necessary, but given a non-stationary random process, it should not be difficult to test whether the asymptotic equipartition property holds using the above method.

<h3>Applications</h3>
The asymptotic equipartition property for non-stationary discrete-time independent process leads us to (among other results) the <a href="/facts/Source_coding_theorem/MHSHNrcI">source coding theorem</a> for non-stationary source (with independent output symbols) and <a href="/facts/Noisy-channel_coding_theorem/vvr8u4Gg">noisy-channel coding theorem</a> for non-stationary memoryless channels.

<h2 id="measure-theoretic-form">Measure-theoretic form</h2>

 
 
 
 T
 
 
 {\textstyle T}
 
 is a measure-preserving map on the probability space 
 
 
 
 Ω
 
 
 {\textstyle \Omega }
 
.
If 
 
 
 
 P
 
 
 {\textstyle P}
 
 is a finite or countable partition of 
 
 
 
 Ω
 
 
 {\textstyle \Omega }
 
, then its entropy is 
 
 
 
 H
 (
 P
 )
 :=
 −
 
 ∑
 
 p
 ∈
 P
 
 
 μ
 (
 p
 )
 ln
 ⁡
 μ
 (
 p
 )
 
 
 {\displaystyle H(P):=-\sum _{p\in P}\mu (p)\ln \mu (p)}
 
 with the convention that 
 
 
 
 0
 ln
 ⁡
 0
 =
 0
 
 
 {\displaystyle 0\ln 0=0}
 
.
We only consider partitions with finite entropy: 
 
 
 
 H
 (
 P
 )
 <
 ∞
 
 
 {\textstyle H(P)<\infty }
 
.
If 
 
 
 
 P
 
 
 {\textstyle P}
 
 is a finite or countable partition of 
 
 
 
 Ω
 
 
 {\textstyle \Omega }
 
, then we construct a sequence of partitions by iterating the map:
 
 
 
 
 P
 
 (
 n
 )
 
 
 :=
 P
 ∨
 
 T
 
 −
 1
 
 
 P
 ∨
 ⋯
 ∨
 
 T
 
 −
 (
 n
 −
 1
 )
 
 
 P
 
 
 {\displaystyle P^{(n)}:=P\vee T^{-1}P\vee \dots \vee T^{-(n-1)}P}
 
where 
 
 
 
 P
 ∨
 Q
 
 
 {\textstyle P\vee Q}
 
 is the least upper bound partition, that is, the least refined partition that refines both 
 
 
 
 P
 
 
 {\textstyle P}
 
 and 
 
 
 
 Q
 
 
 {\textstyle Q}
 
:
 
 
 
 P
 ∨
 Q
 :=
 {
 p
 ∩
 q
 :
 p
 ∈
 P
 ,
 q
 ∈
 Q
 }
 
 
 {\displaystyle P\vee Q:=\{p\cap q:p\in P,q\in Q\}}
 
Write 
 
 
 
 P
 (
 x
 )
 
 
 {\textstyle P(x)}
 
 to be the set in 
 
 
 
 P
 
 
 {\textstyle P}
 
 where 
 
 
 
 x
 
 
 {\textstyle x}
 
 falls in. So, for example, 
 
 
 
 
 P
 
 (
 n
 )
 
 
 (
 x
 )
 
 
 {\textstyle P^{(n)}(x)}
 
 is the 
 
 
 
 n
 
 
 {\textstyle n}
 
-letter initial segment of the 
 
 
 
 (
 P
 ,
 T
 )
 
 
 {\textstyle (P,T)}
 
 name of 
 
 
 
 x
 
 
 {\textstyle x}
 
.
Write 
 
 
 
 
 I
 
 P
 
 
 (
 x
 )
 
 
 {\textstyle I_{P}(x)}
 
 to be the information (in units of <a href="/facts/Nat_(unit)/UgNgfVgI">nats</a>) about 
 
 
 
 x
 
 
 {\textstyle x}
 
 we can recover, if we know which element in the partition 
 
 
 
 P
 
 
 {\textstyle P}
 
 that 
 
 
 
 x
 
 
 {\textstyle x}
 
 falls in:
 
 
 
 
 I
 
 P
 
 
 :=
 −
 ln
 ⁡
 μ
 (
 P
 (
 x
 )
 )
 
 
 {\displaystyle I_{P}:=-\ln \mu (P(x))}
 
Similarly, the conditional information of partition 
 
 
 
 P
 
 
 {\textstyle P}
 
, conditional on partition 
 
 
 
 Q
 
 
 {\textstyle Q}
 
, about 
 
 
 
 x
 
 
 {\textstyle x}
 
, is
 
 
 
 
 I
 
 P
 
 |
 
 Q
 
 
 (
 x
 )
 :=
 −
 ln
 ⁡
 
 
 
 P
 ∨
 Q
 (
 x
 )
 
 
 Q
 (
 x
 )
 
 
 
 
 
 {\displaystyle I_{P|Q}(x):=-\ln {\frac {P\vee Q(x)}{Q(x)}}}

h
 
 T
 
 
 (
 P
 )
 
 
 {\textstyle h_{T}(P)}
 
 is the <a href="/facts/Kolmogorov-Sinai_entropy/acIdlpqt">Kolmogorov-Sinai entropy</a>
 
 
 
 
 h
 
 T
 
 
 (
 P
 )
 :=
 
 lim
 
 n
 
 
 
 
 1
 n
 
 
 H
 (
 
 P
 
 (
 n
 )
 
 
 )
 =
 
 lim
 
 n
 
 
 
 E
 
 x
 ∼
 μ
 
 
 
 [
 
 
 
 1
 n
 
 
 
 I
 
 
 P
 
 (
 n
 )
 
 
 
 
 (
 x
 )
 
 ]
 
 
 
 {\displaystyle h_{T}(P):=\lim _{n}{\frac {1}{n}}H(P^{(n)})=\lim _{n}E_{x\sim \mu }\left[{\frac {1}{n}}I_{P^{(n)}}(x)\right]}
 
In other words, by definition, there is a convergence in expectation. The SMB theorem states that when 
 
 
 
 T
 
 
 {\textstyle T}
 
 is ergodic, there is convergence in L1.<a class="footnote-ref" id="fnref:5" href="#fn:5">5</a>

Theorem (ergodic case)—If 
 
 
 
 T
 
 
 {\textstyle T}
 
 is ergodic, then 
 
 
 
 x
 ↦
 
 
 1
 n
 
 
 
 I
 
 
 P
 
 (
 n
 )
 
 
 
 
 (
 x
 )
 
 
 {\displaystyle x\mapsto {\frac {1}{n}}I_{P^{(n)}}(x)}
 
 converges in L1 to the constant function 
 
 
 
 x
 ↦
 
 h
 
 T
 
 
 (
 P
 )
 
 
 {\textstyle x\mapsto h_{T}(P)}
 
.
In other words, 
 
 
 
 
 E
 
 x
 ∼
 μ
 
 
 
 [
 
 |
 
 
 lim
 
 n
 
 
 
 
 1
 n
 
 
 
 I
 
 
 P
 
 (
 n
 )
 
 
 
 
 (
 x
 )
 −
 
 h
 
 T
 
 
 (
 P
 )
 
 |
 
 ]
 
 =
 0
 
 
 {\displaystyle E_{x\sim \mu }\left[\left|\lim _{n}{\frac {1}{n}}I_{P^{(n)}}(x)-h_{T}(P)\right|\right]=0}
 
 
In particular, since L1 convergence implies almost sure convergence, 
 
 
 
 
 h
 
 T
 
 
 (
 P
 )
 =
 
 lim
 
 n
 
 
 
 
 1
 n
 
 
 
 I
 
 
 P
 
 (
 n
 )
 
 
 
 
 (
 x
 )
 
 
 {\displaystyle h_{T}(P)=\lim _{n}{\frac {1}{n}}I_{P^{(n)}}(x)}
 
 with probability 1.

Corollary (entropy equipartition property)—
 
 
 
 ∀
 ϵ
 >
 0
 ,
 ∃
 N
 ,
 ∀
 n
 ≥
 N
 
 
 {\textstyle \forall \epsilon >0,\exists N,\forall n\geq N}
 
, we can partition the partition 
 
 
 
 
 ∨
 
 k
 =
 0
 
 
 n
 −
 1
 
 
 
 T
 
 −
 k
 
 
 P
 
 
 {\textstyle \vee _{k=0}^{n-1}T^{-k}P}
 
 into two parts, the “good” part 
 
 
 
 G
 
 
 {\textstyle G}
 
 and the “bad” part 
 
 
 
 B
 
 
 {\textstyle B}
 
.
The bad part is small: 
 
 
 
 
 ∑
 
 b
 ∈
 B
 
 
 μ
 (
 b
 )
 <
 ϵ
 
 
 {\displaystyle \sum _{b\in B}\mu (b)<\epsilon }
 
 
The good part is almost equipartitioned according to entropy: 
 
 
 
 ∀
 g
 ∈
 G
 ,
 
 −
 
 
 1
 n
 
 
 ln
 ⁡
 μ
 (
 g
 )
 ∈
 
 h
 
 T
 
 
 (
 P
 )
 ±
 ϵ
 
 
 {\displaystyle \forall g\in G,\quad -{\frac {1}{n}}\ln \mu (g)\in h_{T}(P)\pm \epsilon }

If 
 
 
 
 T
 
 
 {\textstyle T}
 
 is not necessarily ergodic, then the underlying probability space would be split up into multiple subsets, each invariant under 
 
 
 
 T
 
 
 {\textstyle T}
 
. In this case, we still have L1 convergence to some function, but that function is no longer a constant function.<a class="footnote-ref" id="fnref:6" href="#fn:6">6</a>

Theorem (general case)—Let 
 
 
 
 
 
 I
 
 
 
 
 {\textstyle {\mathcal {I}}}
 
 be the sigma-algebra generated by all 
 
 
 
 T
 
 
 {\textstyle T}
 
 -invariant measurable subsets of 
 
 
 
 Ω
 
 
 {\textstyle \Omega }
 
, - 
 
 
 
 x
 ↦
 
 
 1
 n
 
 
 
 I
 
 
 P
 
 (
 n
 )
 
 
 
 
 (
 x
 )
 
 
 {\displaystyle x\mapsto {\frac {1}{n}}I_{P^{(n)}}(x)}
 
 converges in L1 to

x
        ↦
        E
        
          [
          
            
              lim
              
                n
              
            
            
              I
              
                P
                
                  |
                
                
                  ∨
                  
                    k
                    =
                    1
                  
                  
                    n
                  
                
                
                  T
                  
                    −
                    k
                  
                
                P
              
            
            
              
                |
              
            
            
            
              
                I
              
            
          
          ]
        
      
    
    {\displaystyle x\mapsto E\left[\lim _{n}I_{P|\vee _{k=1}^{n}T^{-k}P}{\big |}\;{\mathcal {I}}\right]}

When 
 
 
 
 T
 
 
 {\textstyle T}
 
 is ergodic, 
 
 
 
 
 
 I
 
 
 
 
 {\textstyle {\mathcal {I}}}
 
 is trivial, and so the function
 
 
 
 x
 ↦
 E
 
 [
 
 
 lim
 
 n
 
 
 
 I
 
 P
 
 |
 
 
 ∨
 
 k
 =
 1
 
 
 n
 
 
 
 T
 
 −
 k
 
 
 P
 
 
 
 
 |
 
 
 
 
 
 I
 
 
 
 ]
 
 
 
 {\displaystyle x\mapsto E\left[\lim _{n}I_{P|\vee _{k=1}^{n}T^{-k}P}{\big |}\;{\mathcal {I}}\right]}
 
simplifies into the constant function 
 
 
 
 x
 ↦
 E
 
 [
 
 
 lim
 
 n
 
 
 
 I
 
 P
 
 |
 
 
 ∨
 
 k
 =
 1
 
 
 n
 
 
 
 T
 
 −
 k
 
 
 P
 
 
 
 ]
 
 
 
 {\textstyle x\mapsto E\left[\lim _{n}I_{P|\vee _{k=1}^{n}T^{-k}P}\right]}
 
, which by definition, equals 
 
 
 
 
 lim
 
 n
 
 
 H
 (
 P
 
 |
 
 
 ∨
 
 k
 =
 1
 
 
 n
 
 
 
 T
 
 −
 k
 
 
 P
 )
 
 
 {\textstyle \lim _{n}H(P|\vee _{k=1}^{n}T^{-k}P)}
 
, which equals 
 
 
 
 
 h
 
 T
 
 
 (
 P
 )
 
 
 {\textstyle h_{T}(P)}
 
 by a proposition.

<h2 id="continuous-time-stationary-ergodic-sources">Continuous-time stationary ergodic sources</h2>
Discrete-time functions can be interpolated to continuous-time functions. If such interpolation f is <a href="/facts/Measurable/yCq7zlI4">measurable</a>, we may define the continuous-time stationary process accordingly as 
 
 
 
 
 
 
 X
 ~
 
 
 
 :=
 f
 ∘
 X
 
 
 {\displaystyle {\tilde {X}}:=f\circ X}
 
. If the asymptotic equipartition property holds for the discrete-time process, as in the i.i.d. or finite-valued stationary ergodic cases shown above, it automatically holds for the continuous-time stationary process derived from it by some measurable interpolation. i.e.

−
        
          
            1
            n
          
        
        log
        ⁡
        p
        (
        
          
            
              
                X
                ~
              
            
          
          
            0
          
          
            τ
          
        
        )
        →
        H
        (
        X
        )
      
    
    {\displaystyle -{\frac {1}{n}}\log p({\tilde {X}}_{0}^{\tau })\to H(X)}

where n corresponds to the degree of freedom in time τ. nH(X)/τ and H(X) are the entropy per unit time and per degree of freedom respectively, defined by <a href="/facts/Claude_E._Shannon/EremsCws">Shannon</a>.
An important class of such continuous-time stationary process is the bandlimited stationary ergodic process with the sample space being a subset of the continuous 
 
 
 
 
 
 
 L
 
 
 
 2
 
 
 
 
 {\displaystyle {\mathcal {L}}_{2}}
 
 functions. The asymptotic equipartition property holds if the process is white, in which case the time samples are i.i.d., or there exists T > 1/2W, where W is the <a href="/facts/Bandwidth_(signal_processing)/7eMDzfJI">nominal bandwidth</a>, such that the T-spaced time samples take values in a finite set, in which case we have the discrete-time finite-valued stationary ergodic process.
Any <a href="/facts/Time-invariant/NpWUk0YG">time-invariant</a> operations also preserves the asymptotic equipartition property, stationarity and ergodicity and we may easily turn a stationary process to non-stationary without losing the asymptotic equipartition property by nulling out a finite number of time samples in the process.

<h2 id="category-theory">Category theory</h2>
A <a href="/facts/Category_theoretic/6zS3DXPa">category theoretic</a> definition for the equipartition property is given by <a href="/facts/Mikhael_Gromov_(mathematician)/N0t5hhcI">Gromov</a>.<a class="footnote-ref" id="fnref:7" href="#fn:7">7</a> Given a sequence of <a href="/facts/Product_(category_theory)/evFecz0t">Cartesian powers</a> 
 
 
 
 
 P
 
 N
 
 
 =
 P
 ×
 ⋯
 ×
 P
 
 
 {\displaystyle P^{N}=P\times \cdots \times P}
 
 of a measure space P, this sequence admits an asymptotically equivalent sequence HN of homogeneous measure spaces (i.e. all sets have the same measure; all morphisms are invariant under the group of automorphisms, and thus factor as a morphism to the <a href="/facts/Terminal_object/oeckvXRj">terminal object</a>).
The above requires a definition of asymptotic equivalence. This is given in terms of a distance function, giving how much an injective correspondence differs from an <a href="/facts/Isomorphism/pj8KgkxU">isomorphism</a>. An injective correspondence 
 
 
 
 π
 :
 P
 →
 Q
 
 
 {\displaystyle \pi :P\to Q}
 
 is a <a href="/facts/Partially_defined_map/0vJMw0Nm">partially defined map</a> that is a <a href="/facts/Bijection/j4gTuTmW">bijection</a>; that is, it is a bijection between a subset 
 
 
 
 
 P
 ′
 
 ⊂
 P
 
 
 {\displaystyle P'\subset P}
 
 and 
 
 
 
 
 Q
 ′
 
 ⊂
 Q
 
 
 {\displaystyle Q'\subset Q}
 
. Then define

|
        
        P
        −
        Q
        
          
            |
          
          
            π
          
        
        =
        
          |
        
        P
        ∖
        
          P
          ′
        
        
          |
        
        +
        
          |
        
        Q
        ∖
        
          Q
          ′
        
        
          |
        
        ,
      
    
    {\displaystyle |P-Q|_{\pi }=|P\setminus P'|+|Q\setminus Q'|,}

where |S| denotes the measure of a set S. In what follows, the measure of P and Q are taken to be 1, so that the measure spaces are probability spaces. This distance 
 
 
 
 
 |
 
 P
 −
 Q
 
 
 |
 
 
 π
 
 
 
 
 {\displaystyle |P-Q|_{\pi }}
 
 is commonly known as the <a href="/facts/Earth_mover%27s_distance/ocImwLx2">earth mover's distance</a> or <a href="/facts/Wasserstein_metric/r0tB6sXm">Wasserstein metric</a>.
Similarly, define

|
        
        log
        ⁡
        P
        :
        Q
        
          
            |
          
          
            π
          
        
        =
        
          
            
              
                sup
                
                  p
                  ∈
                  
                    P
                    ′
                  
                
              
              
                |
              
              log
              ⁡
              p
              −
              log
              ⁡
              π
              (
              p
              )
              
                |
              
            
            
              log
              ⁡
              min
              
                (
                
                  
                    |
                  
                  set
                  ⁡
                  (
                  
                    P
                    ′
                  
                  )
                  
                    |
                  
                  ,
                  
                    |
                  
                  set
                  ⁡
                  (
                  
                    Q
                    ′
                  
                  )
                  
                    |
                  
                
                )
              
            
          
        
        .
      
    
    {\displaystyle |\log P:Q|_{\pi }={\frac {\sup _{p\in P'}|\log p-\log \pi (p)|}{\log \min \left(|\operatorname {set} (P')|,|\operatorname {set} (Q')|\right)}}.}

with 
 
 
 
 
 |
 
 set
 ⁡
 (
 P
 )
 
 |
 
 
 
 {\displaystyle |\operatorname {set} (P)|}
 
 taken to be the counting measure on P. Thus, this definition requires that P be a finite measure space. Finally, let

dist
          
          
            π
          
        
        (
        P
        ,
        Q
        )
        =
        
          |
        
        P
        −
        Q
        
          
            |
          
          
            π
          
        
        +
        
          |
        
        log
        ⁡
        P
        :
        Q
        
          
            |
          
          
            π
          
        
        .
      
    
    {\displaystyle {\text{dist}}_{\pi }(P,Q)=|P-Q|_{\pi }+|\log P:Q|_{\pi }.}

A sequence of injective correspondences 
 
 
 
 
 π
 
 N
 
 
 :
 
 P
 
 N
 
 
 →
 
 Q
 
 N
 
 
 
 
 {\displaystyle \pi _{N}:P_{N}\to Q_{N}}
 
 are then asymptotically equivalent when

dist
          
          
            
              π
              
                N
              
            
          
        
        (
        
          P
          
            N
          
        
        ,
        
          Q
          
            N
          
        
        )
        →
        0
        
        
           as 
        
        
        N
        →
        ∞
        .
      
    
    {\displaystyle {\text{dist}}_{\pi _{N}}(P_{N},Q_{N})\to 0\quad {\text{ as }}\quad N\to \infty .}

Given a homogenous space sequence HN that is asymptotically equivalent to PN, the entropy H(P) of P may be taken as

H
        (
        P
        )
        =
        
          lim
          
            N
            →
            ∞
          
        
        
          
            1
            N
          
        
        
          |
        
        set
        ⁡
        (
        
          H
          
            N
          
        
        )
        
          |
        
        .
      
    
    {\displaystyle H(P)=\lim _{N\to \infty }{\frac {1}{N}}|\operatorname {set} (H_{N})|.}

<h2 id="see-also">See also</h2>
<ul><li><a href="/facts/Cram%C3%A9r%27s_theorem_(large_deviations)/QftHEKQ6">Cramér's theorem (large deviations)</a></li>
<li><a href="/facts/Noisy-channel_coding_theorem/vvr8u4Gg">Noisy-channel coding theorem</a></li>
<li><a href="/facts/Shannon%27s_source_coding_theorem/MHSHNrcI">Shannon's source coding theorem</a></li></ul>
<h2 id="notes">Notes</h2>

<h3>Journal articles</h3>
<ul><li><a href="/facts/Claude_E._Shannon/EremsCws">Claude E. Shannon</a>. "<a href="https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf">A Mathematical Theory of Communication</a>". <a href="/facts/Bell_System_Technical_Journal/JjXNS5y1">Bell System Technical Journal</a>, July/October 1948.</li>
<li><a href="/facts/Sergio_Verd%C3%BA/MnGwqcBY">Sergio Verdú</a> and <a href="/facts/Te_Sun_Han/zh6SYrxP">Te Sun Han</a>. "<a href="https://ieeexplore.ieee.org/document/568696/">The Role of the Asymptotic Equipartition Property in Noiseless Source Coding</a>". <a href="/facts/IEEE_Transactions_on_Information_Theory/NbsWT0E4">IEEE Transactions on Information Theory</a>, 43(3): 847–857, 1997. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1109%2F18.568696">10.1109/18.568696</a>.</li></ul>
<h3>Textbooks</h3>
<ul><li><a href="/facts/Thomas_M._Cover/MNeKM9lt">Cover, Thomas M.</a>; <a href="/facts/Joy_A._Thomas/ayNUMwUc">Thomas, Joy A.</a> (1991). Elements of Information Theory (first ed.). Hoboken, New Jersey: Wiley. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-471-24195-9.</li>
<li><a href="/facts/David_J._C._MacKay/NK3D8JME">MacKay, David J.C.</a> (2003). <a href="http://www.inference.phy.cam.ac.uk/mackay/itila/book.html">Information Theory, Inference, and Learning Algorithms</a>. Cambridge University Press. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 0-521-64298-1.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Cover & Thomas (1991), p. 51. - Cover, Thomas M.; Thomas, Joy A. (1991). Elements of Information Theory (first ed.). Hoboken, New Jersey: Wiley. ISBN 978-0-471-24195-9. <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Hawkins, Jane M. (2021). Ergodic dynamics: from basic theory to applications. Graduate texts in mathematics. Cham, Switzerland: Springer. p. 204. ISBN 978-3-030-59241-7. <a href="978-3-030-59241-7" target="_blank">978-3-030-59241-7</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Algoet, Paul H.; Cover, Thomas M. (1988). "A Sandwich Proof of the Shannon–McMillan–Breiman Theorem" (PDF). The Annals of Probability. 16 (2): 899–909. doi:10.1214/aop/1176991794. JSTOR 2243846. Archived from the original (PDF) on 2016-12-06. <a href="/wiki/Thomas_M._Cover" target="_blank">/wiki/Thomas_M._Cover</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">Algoet, Paul H.; Cover, Thomas M. (1988). "A Sandwich Proof of the Shannon–McMillan–Breiman Theorem" (PDF). The Annals of Probability. 16 (2): 899–909. doi:10.1214/aop/1176991794. JSTOR 2243846. Archived from the original (PDF) on 2016-12-06. <a href="/wiki/Thomas_M._Cover" target="_blank">/wiki/Thomas_M._Cover</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
<li id="fn:5">Petersen, Karl E. (1983). "6.2. The Shannon–McMillan–Breiman Theorem". Ergodic Theory. Cambridge Studies in Advanced Mathematics. Cambridge: Cambridge University Press. ISBN 978-0-521-38997-6. <a href="978-0-521-38997-6" target="_blank">978-0-521-38997-6</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></li>
<li id="fn:6">Pollicott, Mark; Yuri, Michiko (1998). "12.4. The Shannon–McMillan–Brieman theorem". Dynamical Systems and Ergodic Theory. London Mathematical Society Student Texts. Cambridge: Cambridge University Press. ISBN 978-0-521-57294-1. <a href="978-0-521-57294-1" target="_blank">978-0-521-57294-1</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></li>
<li id="fn:7">Misha Gromov (2013). "In a Search for a Structure, Part 1: On Entropy". (See page 5, where the equipartition property is called the 'Bernoulli approximation theorem'.) <a href="/wiki/Misha_Gromov" target="_blank">/wiki/Misha_Gromov</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></li>
</ol>

Asymptotic equipartition property open-in-new

Asymptotic equipartition property