Structured prediction

<h2 id="applications">Applications</h2>
<p>An example application is the problem of translating a <a href="/facts/Natural_language/B4gLS7Kd">natural language</a> sentence into a syntactic representation such as a <a href="/facts/Parse_tree/FeBEtXcz">parse tree</a>. This can be seen as a structured prediction problem<a class="footnote-ref" id="fnref:2" href="#fn:2"><sup>2</sup></a> in which the structured output domain is the set of all possible parse trees. Structured prediction is used in a wide variety of domains including <a href="/facts/Bioinformatics/D5x2L8ee">bioinformatics</a>, <a href="/facts/Natural_language_processing/1hjMKsSN">natural language processing</a> (NLP), <a href="/facts/Speech_recognition/z7S7Pgk6">speech recognition</a>, and <a href="/facts/Computer_vision/Tl2Yyk66">computer vision</a>. 
</p>
<h3>Example: sequence tagging</h3>
<p>Sequence tagging is a class of problems prevalent in NLP in which input data are often sequential, for instance sentences of text. The sequence tagging problem appears in several guises, such as <a href="/facts/Part-of-speech_tagging/BYAPmH3r">part-of-speech tagging</a> (POS tagging) and <a href="/facts/Named_entity_recognition/ko6GRQQd">named entity recognition</a>. In POS tagging, for example, each word in a sequence must be 'tagged' with a <i>class label</i> representing the type of word:
</p>
<table><tbody><tr><td>This</td><td><a href="/facts/Determiner/PENQvAFs">DT</a></td></tr><tr><td>is</td><td><a href="/facts/Verb/zp42buvn">VBZ</a></td></tr><tr><td>a</td><td><a href="/facts/Determiner/PENQvAFs">DT</a></td></tr><tr><td>tagged</td><td><a href="/facts/Adjective/0ohBjtmT">JJ</a></td></tr><tr><td>sentence.</td><td><a href="/facts/Noun/SsnDNfkX">NN</a></td></tr></tbody></table>
<p>The main challenge of this problem is to resolve <a href="/facts/Ambiguity/mgn4oEoQ">ambiguity</a>: in the above example, the words "sentence" and "tagged" in English can also be verbs.
</p><p>While this problem can be solved by simply performing <a href="/facts/Statistical_classification/jXXHRkXR">classification</a> of individual <a href="/facts/Lexical_analysis/T1JYWpIf">tokens</a>, this approach does not take into account the empirical fact that tags do not occur independently; instead, each tag displays a strong <a href="/facts/Conditional_dependence/NrFCMzeg">conditional dependence</a> on the tag of the previous word. This fact can be exploited in a sequence model such as a <a href="/facts/Hidden_Markov_model/ur1zTAhP">hidden Markov model</a> or <a href="/facts/Conditional_random_field/TNuLaGzM">conditional random field</a><a class="footnote-ref" id="fnref:3" href="#fn:3"><sup>3</sup></a> that predicts the entire tag sequence for a sentence (rather than just individual tags) via the <a href="/facts/Viterbi_algorithm/FYoyJ6YY">Viterbi algorithm</a>.
</p>
<h2 id="techniques">Techniques</h2>
<p>Probabilistic <a href="/facts/Graphical_model/XxfmKhmM">graphical models</a> form a large class of structured prediction models. In particular, <a href="/facts/Bayesian_network/Ry31vq6T">Bayesian networks</a> and <a href="/facts/Random_field/8N7Id48z">random fields</a> are popular. Other algorithms and models for structured prediction include <a href="/facts/Inductive_logic_programming/gKc9Nxq4">inductive logic programming</a>, <a href="/facts/Case-based_reasoning/JkxLXCqF">case-based reasoning</a>, <a href="/facts/Structured_SVM/kKIH061u">structured SVMs</a>, <a href="/facts/Markov_logic_network/YvjQWd0j">Markov logic networks</a>, <a href="/facts/Probabilistic_Soft_Logic/5FkpdxDP">Probabilistic Soft Logic</a>, and <a href="/facts/Constrained_conditional_model/W6jHv094">constrained conditional models</a>. The main techniques are:
</p>
<ul><li><a href="/facts/Conditional_random_field/TNuLaGzM">Conditional random fields</a></li>
<li><a href="/facts/Structured_support_vector_machine/kKIH061u">Structured support vector machines</a></li>
<li><a href="/facts/Structured_kNN/V9eK1upf">Structured <i>k</i>-nearest neighbours</a></li>
<li><a href="/facts/Recurrent_neural_network/bx7hBVB1">Recurrent neural networks</a>, in particular <a href="/facts/Recurrent_neural_network/bx7hBVB1">Elman networks</a></li>
<li><a href="/facts/Transformer_(deep_learning_architecture)/cDbjx6a8">Transformers</a>.</li></ul>
<h3>Structured perceptron</h3>
<p>One of the easiest ways to understand algorithms for general structured prediction is the structured perceptron by <a href="/facts/Michael_Collins_(computational_linguist)/Bu454wSr">Collins</a>.<a class="footnote-ref" id="fnref:4" href="#fn:4"><sup>4</sup></a> This algorithm combines the <a href="/facts/Perceptron/ArxdkAC1">perceptron</a> algorithm for learning <a href="/facts/Linear_classifier/vsE823wQ">linear classifiers</a> with an inference algorithm (classically the <a href="/facts/Viterbi_algorithm/FYoyJ6YY">Viterbi algorithm</a> when used on sequence data) and can be described abstractly as follows:
</p>
<ol><li>First, define a function 
  
    
      
        ϕ
        (
        x
        ,
        y
        )
      
    
    {\displaystyle \phi (x,y)}
  
 that maps a training sample 
  
    
      
        x
      
    
    {\displaystyle x}
  
 and a candidate prediction 
  
    
      
        y
      
    
    {\displaystyle y}
  
 to a vector of length 
  
    
      
        n
      
    
    {\displaystyle n}
  
 (
  
    
      
        x
      
    
    {\displaystyle x}
  
 and 
  
    
      
        y
      
    
    {\displaystyle y}
  
 may have any structure; 
  
    
      
        n
      
    
    {\displaystyle n}
  
 is problem-dependent, but must be fixed for each model). Let 
  
    
      
        G
        E
        N
      
    
    {\displaystyle GEN}
  
 be a function that generates candidate predictions.</li>
<li>Then:</li></ol>
Let 
  
    
      
        w
      
    
    {\displaystyle w}
  
 be a weight vector of length 
  
    
      
        n
      
    
    {\displaystyle n}

For a predetermined number of iterations:
For each sample 
  
    
      
        x
      
    
    {\displaystyle x}
  
 in the training set with true output 
  
    
      
        t
      
    
    {\displaystyle t}
  
:
Make a prediction 
  
    
      
        
          
            
              y
              ^
            
          
        
      
    
    {\displaystyle {\hat {y}}}
  
: 
  
    
      
        
          
            
              y
              ^
            
          
        
        =
        
          
            a
            r
            g
            
            m
            a
            x
          
        
        
        {
        y
        ∈
        G
        E
        N
        (
        x
        )
        }
        
        (
        
          w
          
            T
          
        
        ,
        ϕ
        (
        x
        ,
        y
        )
        )
      
    
    {\displaystyle {\hat {y}}={\operatorname {arg\,max} }\,\{y\in GEN(x)\}\,(w^{T},\phi (x,y))}

Update 
  
    
      
        w
      
    
    {\displaystyle w}
  
 (from 
  
    
      
        
          
            
              y
              ^
            
          
        
      
    
    {\displaystyle {\hat {y}}}
  
 towards 
  
    
      
        t
      
    
    {\displaystyle t}
  
): 
  
    
      
        w
        =
        w
        +
        c
        (
        −
        ϕ
        (
        x
        ,
        
          
            
              y
              ^
            
          
        
        )
        +
        ϕ
        (
        x
        ,
        t
        )
        )
      
    
    {\displaystyle w=w+c(-\phi (x,{\hat {y}})+\phi (x,t))}
  
, where 
  
    
      
        c
      
    
    {\displaystyle c}
  
 is the <a href="/facts/Learning_rate/EClSgMCR">learning rate</a>.
<p>In practice, finding the argmax over 
  
    
      
        
          G
          E
          N
        
        (
        
          x
        
        )
      
    
    {\displaystyle {GEN}({x})}
  
 is done using an algorithm such as Viterbi or a <a href="/facts/Max-sum_algorithm/TRY5LVbP">max-sum</a>, rather than an <a href="/facts/Exhaustive_search/YU2fzBIt">exhaustive search</a> through an exponentially large set of candidates.
</p><p>The idea of learning is similar to that for <a href="/facts/Perceptron/ArxdkAC1">multiclass perceptrons</a>.
</p>

<ul><li>Noah Smith, <a href="https://www.cs.cmu.edu/~nasmith/LSP/">Linguistic Structure Prediction</a>, 2011.</li>
<li>Michael Collins, <a href="https://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf">Discriminative Training Methods for Hidden Markov Models</a>, 2002.</li></ul>
<h2 id="external-links">External links</h2>
<ul><li><a href="https://github.com/ashish01/CollinsTagger">Implementation of Collins structured perceptron</a></li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1"><p>Gökhan BakIr, Ben Taskar, Thomas Hofmann, Bernhard Schölkopf, Alex Smola and SVN Vishwanathan (2007), Predicting Structured Data, MIT Press. <a href="https://mitpress.mit.edu/books/predicting-structured-data" target="_blank">https://mitpress.mit.edu/books/predicting-structured-data</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></p></li>
<li id="fn:2"><p>Lafferty, J.; McCallum, A.; Pereira, F. (2001). "Conditional random fields: Probabilistic models for segmenting and labeling sequence data" (PDF). Proc. 18th International Conf. on Machine Learning. pp. 282–289. <a href="http://www.cis.upenn.edu/~pereira/papers/crf.pdf" target="_blank">http://www.cis.upenn.edu/~pereira/papers/crf.pdf</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></p></li>
<li id="fn:3"><p>Lafferty, J.; McCallum, A.; Pereira, F. (2001). "Conditional random fields: Probabilistic models for segmenting and labeling sequence data" (PDF). Proc. 18th International Conf. on Machine Learning. pp. 282–289. <a href="http://www.cis.upenn.edu/~pereira/papers/crf.pdf" target="_blank">http://www.cis.upenn.edu/~pereira/papers/crf.pdf</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></p></li>
<li id="fn:4"><p>Collins, Michael (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms (PDF). Proc. EMNLP. Vol. 10. <a href="http://acl.ldc.upenn.edu/W/W02/W02-1001.pdf" target="_blank">http://acl.ldc.upenn.edu/W/W02/W02-1001.pdf</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></p></li>
</ol>

Structured prediction open-in-new

Structured prediction