Misra–Gries heavy hitters algorithm

On this page

Misra and Gries addressed the heavy-hitters problem—finding elements that appear more than n ÷ k times in a bag of n elements—by extending the Boyer-Moore majority finding algorithm. Their Misra-Gries algorithm solves this by making two passes over the data, maintaining at most k candidate values and their counts. Notably, it is one of the earliest streaming algorithms, efficiently processing large data streams with limited memory. This foundational work is detailed in their paper Finding repeated elements and laid the groundwork for subsequent advances in frequency estimation within data streams.

We don't have any images related to Misra–Gries heavy hitters algorithm yet.

You can add one yourself here.

We don't have any YouTube videos related to Misra–Gries heavy hitters algorithm yet.

You can add one yourself here.

We don't have any PDF documents related to Misra–Gries heavy hitters algorithm yet.

You can add one yourself here.

We don't have any Books related to Misra–Gries heavy hitters algorithm yet.

You can add one yourself here.

We don't have any archived web articles related to Misra–Gries heavy hitters algorithm yet.

You can submit a link to a page to archive here.

Misra–Gries algorithm

A bag is like a set in which the same value may occur multiple times. Assume that a bag is available as an array b[0:n – 1] of n elements. In the abstract description of the algorithm, we treat b and its segments also as bags. Henceforth, a heavy hitter of bag b is a value that occurs more than n ÷ k times in it, for some integer k, k≥2.

A k-reduced bag for bag b is derived from b by repeating the following operation until no longer possible: Delete k distinct elements from b. From its definition, a k-reduced bag contains fewer than k different values. The following theorem is easy to prove:

Theorem 1. Each heavy-hitter of b is an element of a k-reduced bag for b.

The first pass of the heavy-hitters computation constructs a k-reduced bag t. The second pass declares an element of t to be a heavy-hitter if it occurs more than n ÷ k times in b. According to Theorem 1, this procedure determines all and only the heavy-hitters. The second pass is easy to program, so we describe only the first pass.

In order to construct t, scan the values in b in arbitrary order, for specificity the following algorithm scans them in the order of increasing indices. Invariant P of the algorithm is that t is a k-reduced bag for the scanned values and d is the number of distinct values in t. Initially, no value has been scanned, t is the empty bag, and d is zero.

P: 0 ≤ i ≤ n ∧ {\displaystyle \wedge } t is a k-reduced bag for b[0:i – 1] ∧ {\displaystyle \wedge } d is the number of distinct values in t ∧ {\displaystyle \wedge } 0 ≤ d < k

Whenever element b[i] is scanned, in order to preserve the invariant: (1) if b[i] is not in t, add it to t and increase d by 1, (2) if b[i] is in t, add it to t but don't modify d, and (3) if d becomes equal to k, reduce t by deleting k distinct values from it and update d appropriately.

algorithm Misra–Gries is t, d := { }, 0 for i from 0 to n-1 do if b[i] ∉ {\displaystyle \notin } t then t, d:= t ∪ {b[i]}, d+1 else t, d:= t ∪ {b[i]}, d endif if d = k then Delete k distinct values from t; update d endif endfor

A possible implementation of t is as a set of pairs of the form (vi, ci) where each vi is a distinct value in t and ci is the number of occurrences of vi in t. Then d is the size of this set. The step "Delete k distinct values from t" amounts to reducing each ci by 1 and then removing any pair (vi, ci) from the set if ci becomes 0.

Using an AVL tree implementation of t, the algorithm has a running time of O(n log k). In order to assess the space requirement, assume that the elements of b can have m possible values, so the storage of a value vi needs O(log m) bits. Since each counter ci may have a value as high as n, its storage needs O(log n) bits. Therefore, for O(k) value-counter pairs, the space requirement is O(k (log n + log m)).

Summaries

In the field of streaming algorithms, the output of the Misra-Gries algorithm in the first pass may be called a summary, and such summaries are used to solve the frequent elements problem in the data stream model. A streaming algorithm makes a small, bounded number of passes over a list of data items called a stream. It processes the elements using at most logarithmic amount of extra space in the size of the list to produce an answer.

The term Misra–Gries summary appears to have been coined by Graham Cormode.⁴ ⁵

References

A number of articles cite Misra-Gries[1] as one of the first to provide a heavy hitters algorithm. For example, see David P. Woodruff's article heavy hitter algorithms in data streams[2] and the article by Pandey et. al. on heavy hitters using external memory.[3] ↩
Misra, J.; Gries, David (1982). "Finding repeated elements". Science of Computer Programming. 2 (2): 143–152. doi:10.1016/0167-6423(82)90012-0. hdl:1813/6345. /wiki/Jayadev_Misra ↩
Misra, J.; Gries, David (1982). "Finding repeated elements". Science of Computer Programming. 2 (2): 143–152. doi:10.1016/0167-6423(82)90012-0. hdl:1813/6345. /wiki/Jayadev_Misra ↩
Cormode, Graham (2014). "Misra-Gries Summaries". In Kao, Ming-Yang (ed.). Encyclopedia of Algorithms. Springer US. pp. 1–5. doi:10.1007/978-3-642-27848-8_572-1. ISBN 9783642278488. 9783642278488 ↩
"Misra-Gries Summaries" (PDF). UT News. Retrieved 2022-09-19. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf ↩