You have a billion urls, where each is a huge page. How do you detect the duplicate documents?

Here are two observations we can make
1. every page is huge, so bringing them to the memory will not be feasible. So we need to calculate a hashvalue for the content for each page.
2. There are billions of urls, so comparing page to each other is O(n^2), which is too slow. So we create a hashtable and check if the content of the value is in the already visited hashtable. if so, we can assume that the two pages have the same content and can through out one of it.

So, we will have a list of urls that contains the unique pages. But wait, each hashvalue is 4 bytes and there are a billion urls, so there needs to be 4G memory, plus another few G for the urls, so it is very likely that all of the data can not fit in one machine.

What do we do? Here are a few options.

»»We could split this up into files. We’ll have to deal with the file loading / unloading—ugh.

»»We could hash to disk. Size wouldn’t be a problem, but access time might. A hash table on disk would require a random access read for each check and write to store a viewed url. This could take msecs waiting for seek and rotational latencies. Elevator algorithms could elimate random bouncing from track to track.

»»Or, we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.
First, we hash the document to get a hash value v.
v%n tells us which machine this document’s hash table can be found on.
v / n is the value in the hash table that is located on its machine.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s