Rabin Karp

For the strstr(S, P) problem (finding the first substring in S that is the same as pattern P, return index, or return -1 if not found).

The brute force problem was straight forward. You simply compare every substring in S with P char by char. that will result to a O(m*n) run time where m is the length of the S and n is the length of n.

The Rabin Karp is based on the same method of comparing every substring of S to P, but not char by char, instead, we calculate a hash value of the S and compare it with the hash value of P, and only compare them char by char when the two hash value are the same. If we choose a good hash function (minimize or eliminate the collision), we only need to compare the S and P char by char only once. Then the running time will be O(m+n), but in the worst case, when the hash collision happens for all the substring in S, the running time is still O(m*n).

How do we calculate the hash value of a string? We can consider each string a number except that instead of the usual base 10 or 2 numbers, it is a base B numbers where B a number equals to the number of alphabets allowed in the string (a~zA~Z0~9…). For an ascii number, we can use B = 256. and the hash value is only translating the base 256 number to base 10 number. So the hash value of pattern P will be:

int B = 256
int h = 0;
for (int i = 0; i < P.length(); i++) {
    h = (h*B + P.charAt(i));
}

Since we are limited to the 32 bit integer, it would be very easy to overflow, so we can use a module of a number M to collapse all the numbers that is less than M. By choosing a prime number for M, we minimize the number of collisions. Why? Because prime number will be the most unique number that formed by multiplying other numbers since its only factor is 1 and itself.

Here we need to use the property of the “modulo” operator
(A + B)%M = (A%M + B%M)%M;

int B = 256
int h = 0;
for (int i = 0; i < P.length(); i++) {
    h = (h*B%M + P.charAt(i)%M)%M;
}

In the same way, we calculate the hash value of the first window of length P.length() in S.

Then you will notice that to calculate the hash value, we need to iterate through a P.length(). Here after we calculate the first window, what we need to do is only to “slide” the window. Which means we remove the first char from the hash value and add the trailing char. To calculate h(i) from h(i-1), we do the following (assume int m = P.length). Notice that removing the heading char might lead to a negative value, so we will need the following property of modulo operator:

(A – B)%M = (A%M – B%M + k*M)%M; since (A – B)%M will be less than M, then we can use k = 1 to make sure new hash value is a positive number.

Here is the code for a working RobinKarp algorithm

public class RK {
	public static int rabinKarp(String P, String S) {
		int b = 256; // base that is large enough to cover all the alphabeticals
		int m = 103; // a Prime number that is smaller than 256 for modulo
		long h = 0; // init hash value for string;
		int p = 0; // init hash value for pattern
		//calculate the hash value for pattern and first window
		for (int i = 0; i < P.length(); i++) {
			h = (h*b + S.charAt(i))%m; //simply use the ascii for to translate alphabetical to number
			p = (p*b + P.charAt(i))%m;
		}
		
		//calculate b^(m-1) for sliding window
		long max = 1;
		for (int i = 0; i < P.length()-1; i++) {
			max *= b;
		}
		
		for (int i = 0; i <= S.length() - P.length(); i++) {
			if (h == p) {
				return i;
			}
			//renew h by sliding the window right
			h = (h - (S.charAt(i)*max)%m + m)%m;
			h = (h*b + S.charAt(i))%m;
		}
		return -1;
	}
	
	public static void main(String[] args) {
		System.out.println(rabinKarp("bca", "abcabc"));
	}
}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s