Find the kth most frequent element in a data stream

In the word of internet, where everything is in a very large scale. For example you want to find out that among all the ads, which ones are clicked most frequently in a day?

This problem comes down to find the kth most frequent element in a data stream. (or find the element whose frequency is greater than a user defined value).

In order to find the kth most frequent element in a data stream, we need a space of the length of the data stream N. which is not desirable. So we have to use approximation

Two kind of method are proposed to solve the problem
1. counter-based: only monitor a subset of the incoming stream, each element being monitored was assigned a counter, if the incoming element is monitored, the counter is updated, otherwise, do something depending on algorithm.
2. Estimate frequency for all elements using bit-maps of counters
Each element is hashed into the counters’ space using a family of hash functions. Hashed-to counters are queried for the frequencies

We have a simple space saving algorithm that is easy to implement and understand. Here is a detailed description of the space saving algorithm to solve both of the problem
space-saving algorithm
Also notice that we can use min heap to store the k counters so that we can easily replace (remove) the lowest frequency element.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s