In the following, let X {\displaystyle X} be our n {\displaystyle n} -dimensional input space. Let H {\displaystyle {\mathcal {H}}} be a class of functions that we wish to use in order to learn a { 0 , 1 } {\displaystyle \{0,1\}} -valued target function f {\displaystyle f} defined over X {\displaystyle X} . Let D {\displaystyle {\mathcal {D}}} be the distribution of the inputs over X {\displaystyle X} . The goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . Let us suppose we have a function s i z e ( f ) {\displaystyle size(f)} that can measure the complexity of f {\displaystyle f} . Let Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} be an oracle that, whenever called, returns an example x {\displaystyle x} and its correct label f ( x ) {\displaystyle f(x)} .
When no noise corrupts the data, we can define learning in the Valiant setting:12
Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the Valiant setting if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} and 0 < δ ≤ 1 {\displaystyle 0<\delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 ε , 1 δ , n , size ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,{\text{size}}(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition error ( h ) ≤ ε {\displaystyle {\text{error}}(h)\leq \varepsilon } .
In the following we will define learnability of f {\displaystyle f} when data have suffered some modification.345
In the classification noise model6 a noise rate 0 ≤ η < 1 2 {\displaystyle 0\leq \eta <{\frac {1}{2}}} is introduced. Then, instead of Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} that returns always the correct label of example x {\displaystyle x} , algorithm A {\displaystyle {\mathcal {A}}} can only call a faulty oracle Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} that will flip the label of x {\displaystyle x} with probability η {\displaystyle \eta } . As in the Valiant case, the goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . In applications it is difficult to have access to the real value of η {\displaystyle \eta } , but we assume we have access to its upperbound η B {\displaystyle \eta _{B}} .7 Note that if we allow the noise rate to be 1 / 2 {\displaystyle 1/2} , then learning becomes impossible in any amount of computation time, because every label conveys no information about the target function.
Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the classification noise model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 ≤ η ≤ 1 2 {\displaystyle 0\leq \eta \leq {\frac {1}{2}}} , 0 ≤ ε ≤ 1 {\displaystyle 0\leq \varepsilon \leq 1} and 0 ≤ δ ≤ 1 {\displaystyle 0\leq \delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 1 − 2 η B , 1 ε , 1 δ , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{1-2\eta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition e r r o r ( h ) ≤ ε {\displaystyle error(h)\leq \varepsilon } .
Statistical Query Learning8 is a kind of active learning problem in which the learning algorithm A {\displaystyle {\mathcal {A}}} can decide if to request information about the likelihood P f ( x ) {\displaystyle P_{f(x)}} that a function f {\displaystyle f} correctly labels example x {\displaystyle x} , and receives an answer accurate within a tolerance α {\displaystyle \alpha } . Formally, whenever the learning algorithm A {\displaystyle {\mathcal {A}}} calls the oracle Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} , it receives as feedback probability Q f ( x ) {\displaystyle Q_{f(x)}} , such that Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } .
Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the statistical query learning model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} and polynomials p ( ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot )} , q ( ⋅ , ⋅ , ⋅ ) {\displaystyle q(\cdot ,\cdot ,\cdot )} , and r ( ⋅ , ⋅ , ⋅ ) {\displaystyle r(\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} the following hold:
Note that the confidence parameter δ {\displaystyle \delta } does not appear in the definition of learning. This is because the main purpose of δ {\displaystyle \delta } is to allow the learning algorithm a small probability of failure due to an unrepresentative sample. Since now Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} always guarantees to meet the approximation criterion Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } , the failure probability is no longer needed.
The statistical query model is strictly weaker than the PAC model: any efficiently SQ-learnable class is efficiently PAC learnable in the presence of classification noise, but there exist efficient PAC-learnable problems such as parity that are not efficiently SQ-learnable.9
In the malicious classification model10 an adversary generates errors to foil the learning algorithm. This setting describes situations of error burst, which may occur when for a limited time transmission equipment malfunctions repeatedly. Formally, algorithm A {\displaystyle {\mathcal {A}}} calls an oracle Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )} that returns a correctly labeled example x {\displaystyle x} drawn, as usual, from distribution D {\displaystyle {\mathcal {D}}} over the input space with probability 1 − β {\displaystyle 1-\beta } , but it returns with probability β {\displaystyle \beta } an example drawn from a distribution that is not related to D {\displaystyle {\mathcal {D}}} . Moreover, this maliciously chosen example may strategically selected by an adversary who has knowledge of f {\displaystyle f} , β {\displaystyle \beta } , D {\displaystyle {\mathcal {D}}} , or the current progress of the learning algorithm.
Definition: Given a bound β B < 1 2 {\displaystyle \beta _{B}<{\frac {1}{2}}} for 0 ≤ β < 1 2 {\displaystyle 0\leq \beta <{\frac {1}{2}}} , we say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the malicious classification model, if there exist a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} , 0 < δ ≤ 1 {\displaystyle 0<\delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 1 / 2 − β B , 1 ε , 1 δ , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{1/2-\beta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition e r r o r ( h ) ≤ ε {\displaystyle error(h)\leq \varepsilon } .
In the nonuniform random attribute noise1112 model the algorithm is learning a Boolean function, a malicious oracle Oracle ( x , ν ) {\displaystyle {\text{Oracle}}(x,\nu )} may flip each i {\displaystyle i} -th bit of example x = ( x 1 , x 2 , … , x n ) {\displaystyle x=(x_{1},x_{2},\ldots ,x_{n})} independently with probability ν i ≤ ν {\displaystyle \nu _{i}\leq \nu } .
This type of error can irreparably foil the algorithm, in fact the following theorem holds:
In the nonuniform random attribute noise setting, an algorithm A {\displaystyle {\mathcal {A}}} can output a function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that e r r o r ( h ) < ε {\displaystyle error(h)<\varepsilon } only if ν < 2 ε {\displaystyle \nu <2\varepsilon } .
Valiant, L. G. (August 1985). Learning Disjunction of Conjunctions. In IJCAI (pp. 560–566). http://www.ijcai.org/Past%20Proceedings/IJCAI-85-VOL1/PDF/107.pdf ↩
Valiant, Leslie G. "A theory of the learnable." Communications of the ACM 27.11 (1984): 1134–1142. ↩
Laird, P. D. (1988). Learning from good and bad data. Kluwer Academic Publishers. https://link.springer.com/content/pdf/bfm%3A978-1-4613-1685-5%2F1.pdf ↩
Kearns, Michael. "Efficient noise-tolerant learning from statistical queries Archived 3 May 2013 at the Wayback Machine." Journal of the ACM 45.6 (1998): 983–1006. https://www.cs.iastate.edu/~honavar/noise-pac.pdf ↩
Brunk, Clifford A., and Michael J. Pazzani. "An investigation of noise-tolerant relational concept learning algorithms." Proceedings of the 8th International Workshop on Machine Learning. 1991. http://www.ics.uci.edu/~pazzani/Publications/An-Investigation-MLW-91.pdf ↩
Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory, chapter 5. MIT press. https://books.google.com/books?id=vCA01wY6iywC ↩
Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343–370. http://homepages.math.uic.edu/~lreyzin/papers/angluin88b.pdf ↩
Kearns, M. (1998). [www.cis.upenn.edu/~mkearns/papers/sq-journal.pdf Efficient noise-tolerant learning from statistical queries]. Journal of the ACM, 45(6), 983–1006. ↩
Kearns, M., & Li, M. (1993). [www.cis.upenn.edu/~mkearns/papers/malicious.pdf Learning in the presence of malicious errors]. SIAM Journal on Computing, 22(4), 807–837. ↩
Goldman, S. A., & Sloan, Robert, H. (1991). The difficulty of random attribute noise. Technical Report WUCS 91 29, Washington University, Department of Computer Science. /wiki/Sally_Goldman ↩
Sloan, R. H. (1989). Computational learning theory: New models and algorithms (Doctoral dissertation, Massachusetts Institute of Technology). http://dspace.mit.edu/bitstream/handle/1721.1/38339/20770411.pdf?sequence=1 ↩