In the word of internet, where everything is in a very large scale. For example you want to find out that among all the ads, which ones are clicked most frequently in a day?
This problem comes down to find the kth most frequent element in a data stream. (or find the element whose frequency is greater than a user defined value).
In order to find the kth most frequent element in a data stream, we need a space of the length of the data stream N. which is not desirable. So we have to use approximation
Two kind of method are proposed to solve the problem
1. counter-based: only monitor a subset of the incoming stream, each element being monitored was assigned a counter, if the incoming element is monitored, the counter is updated, otherwise, do something depending on algorithm.
2. Estimate frequency for all elements using bit-maps of counters
Each element is hashed into the counters’ space using a family of hash functions. Hashed-to counters are queried for the frequencies
We have a simple space saving algorithm that is easy to implement and understand. Here is a detailed description of the space saving algorithm to solve both of the problem
Also notice that we can use min heap to store the k counters so that we can easily replace (remove) the lowest frequency element.