Find median (from a stream of data)

1. If the data was stored in an given array, then we could sort the data in nlogn time. Then the median can be found in O(1) time. But when it comes to a stream of data, the practical sorting algorithm is selection sort, which takes O(n^2) time.

2. Another method is to have two heaps, one min and one max, and we make sure that the number of element in min heap and max heap are equal, or differ at most 1. When a new data comes in, we check if we should put it to the max heap or min heap, if it is larger than the root of the max heap, then put it to the max heap, otherwise put it to the min heap. If the min heap or max heap gets more than 1 difference between another, then remove one element from one heap and insert it to the other, the running time is nlogn.

3. If the range of the data is within a range, we can use counting sort or radix sort which runs in O(n) time to sort the array and the total running time would be O(n).

4. Using the quick select -> reduce the problem to finding the kth element in the array. Randomized quick select: 1. Pick a random pivot p. 2. rearrange the array so that everything less than pivot goes in front of the pivot and everything greater than the pivot goes after the pivot. find the index of the pivot p L, if k = L output p, otherwise recurse on less or greater part of the subarray. Another deterministic quick select method: 1. group the input array of 5 elements, find the median of each group, and find the median of the median. use that as pivot and split the array into two parts, Greater and Less part just like randomized quick sort, and recurse on the appropriate part.

5. find median with limited memory, say if you only have 2G memory but you are looking for the median of 10G numbers. Using counting sort: 1. Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram. this is to create buckets that store ranges of your input data, and find the mid bucket. also record the number of elements in the buckets less than the median buckets. 2. Now you count up in that histogram until you reach the bucket that covers the midpoint of the values. this is to find the bucket contains the mid element 3. Pass through again, ignoring all numbers that don’t have that same set of top bits, and make a histogram of the bottom bits. this is to find median in the bucket, ignore those who is not in the bucket. 4. Count up through that histogram until you reach the element that covers the midpoint of the (entire list of) values (remember you have the record of the .
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s