Implement strStr()

This is my first time learning the KMP algorithm, it is a fast algorithm to match pattern string in the target string. The algorithm goes as follows:

First pre-process the pattern string to get an array “next”.
Then use the next array together with the pattern string to match the target string.
Because we have the information from the next string, we can avoid backtracking when we do the comparison with the target string. The run time of the KMP algorithm is O(m + n) where m is the length of the pattern and n is the length of the target string. The brute force method runs at O(mn).

The next string:
Lets take a look at an example to explain what is the “next” string. Say we have Pattern P = ADABADADA. The prefix of all the substring is as listed (not containing the substring itself):

0:empty string
1:A
2:AD
3:ADA
4:ADAB
5:ADABA
6:ADABAD
7:ADABADA
8:ADABADAD

Now consider each of the listed prefix (not including the substring itself), of all the prefix strings. we can find the longest substring that is also a suffix of current prefix string.

0:empty string
1:empty string
2:empty string
3:A
4:empty string
5:A
6:AD
7:ADA
8:AD

As we can see, at some point, if we have a partial match (say index i = 7, we have longest common prefix and suffix of (ADA), also we have another common prefix and suffix that is shorter (A). What will be the longest common prefix depends on the next coming character:
1. If next character is B, then we can extend the longest match futher to ADAB.
2. If the next character is D, then we can not extend the longest match to ADAB, so we go back and check the second longest suffix A, see if we could extend A further, D matches the AD, so we extend the longest match to AD.
3. if the next character is say, C, after we tried the second longest suffix, we reached the empty string. then we can not extend the match and reset the longest match to empty string.

The following code is the function that generate this “next” array.
Notice that index i means we are checking the substring of P[0~i] inclusive, k holds the pointer to the current largest common suffix so that T[i] = k means P[0~k-1] = P[i-k~i-1]. After we did comparison for P[k] and P[i], if they are equal, then we simply extend the common suffix (T[i] = .k+1 and i++, k++). If they do not match, we go to the second largest common suffix (k = T[k]), then compare again, until k = 0 or hit a match. Then we can fill T[i+1] = k+1(match) or T[i+1] = 0(k == 0).

	public int [] getNext(String needle) {
		int [] T = new int [needle.length()];
		T[0] = 0;
		if (needle.length() == 1)
			return T;
		T[1] = 0;
		int k = 0; // the longest match suffix pointer
		for (int i = 1; i < needle.length() - 1; i++) {
			while(true) {
				// if next candidate can be expended
				if (needle.charAt(i) == needle.charAt(k)) {
					T[i+1] = k + 1;
					k++;
					break;
				}
				else if (k > 0) {
					k = T[k];
				}
				else {
					T[i+1] = 0;
					break;
				}
			}
		}
		return T;
	}

Then we could proceed to find the pattern string in the target string. This process is simple. The brute force method requires backing up to the previous point to scan the whole pattern again.
With the next array, the searching does not require back up.

To use the next array, the process is almost identical to building the next array. If we could not extend the match, we go to the second largest match and try. If we hit the pattern length, we find a match. If we can not match even the first character, we increase the target index by one and reset the pattern index to 0.

    public String strStr(String haystack, String needle) {
        if (haystack.length() == 0 && needle.length() == 0)
            return "";
        if (haystack.length() == 0)
            return null;
        if (needle.length() == 0)
            return haystack;
        int i = 0, j = 0;// i is the haystack index, j is the needle index
		int [] next = getNext(needle);
		while (i != haystack.length()) {
			if (haystack.charAt(i) == needle.charAt(j)) {
				i++;
				j++;
				if ( j == needle.length()) // found
					return haystack.substring(i - needle.length());
			} else if (j > 0) { //second suffix j, i stay the same
				j = next[j];
			} else { // if j = 0; then try next
				j = 0;
				i++;
			}
		}
		return null;
    }

in each comparison Each time through the loop, either we increase i or we slide the pattern right. Both of
these events can occur at most n times, and so the repeat loop is executed at most 2n
times. The cost of each iteration of the repeat loop is O(1). Therefore, the running time
is O(n), assuming that the values π(q) are already computed

Second Submission

public class Solution {
    public String strStr(String haystack, String needle) {
        if (needle.length() == 0)
            return haystack;
        if (haystack.length() == 0)
            return null;
        int i = 0; //index for the haystack
        int j = 0; //index for the needle
        int [] failure = getNext(needle);// get the failure function array
        while( i != haystack.length()) {
            if (haystack.charAt(i) == needle.charAt(j)) {
                i++;
                j++;
                if (j == needle.length()) {
                    return haystack.substring(i - needle.length()); // the pointer to the first substring
                }
            }
            else if (j > 0) { // slide the needle right, important, if next failure [j] = 0, need to compare 0 to i
                j = failure[j];
            }
            else {// if failure [j] = 0
                i++;
                j=0;
            }
        }
        return null;
    }
    
    private int [] getNext(String needle) {
	    int [] failure = new int [needle.length()];
	    failure[0] = 0; // failure 1 is also 0 because if you have two index on needle, say aa and ab, if you have a mismatch on the second index, you need to move to 0 and compare for both cases. 
	    if (needle.length() == 1)
	        return failure;
	    int k = 0; // k is pointing to the next possible extending position, T[i] = k means substring(0, k-1) = substring(i-k, i-1)
	    for (int i = 1; i < needle.length() - 1; i++) { // compare start from second char index 1, should not compare index 0 to 0 itself
	        while(true) { 
	            if (needle.charAt(i) == needle.charAt(k)) { // check the k-1 value and i to decide the i+1 value
	                failure[i + 1] = k + 1;
	                k++; // if you already have k match, next should test k+1 and only upon failure, reduce to k = failure[k];
	                break;
	            } 
	            else if ( k > 0) { // important, even if next failure[k] = 0, need to check the if index 0 is a commen prefix and suffix
	                k = failure[k]; // key point, check the next longest substring that is both prefix and suffix
	            }
	            else {
	                failure[i + 1] = 0;
	                break;
	            }
	        }
	    }
	    return failure;
	}
}

Third submission:

See the comment,
Notice that failure[0] stores the location for the next comparison. So
failure[1] == 0 because upon failure on comparison on P[1], next to compare is P[0]. So we have

if (needle.charAt(i) == needle.charAt(last)) {
	            // failure[i] records the location of the next comparison
	            // if char at i and last are equal, in the case future failure happen
	            // we back up to (last + 1) if (i+1) fail because i and last has the same prefix
	            failure[i+1] = last+1; 
	            last++;
	        }

if we have P[i] == P[last], then next time we have a failure, we should pick last = failure[last]. Then we go back to check if P[i] =? P[last].

After we have the failure array, we could use

for (int i = 0; i < haystack.length(); i++) {
            char h = haystack.charAt(i);
            if (h == needle.charAt(n)) {
                if (n == needle.length() - 1) // if last char is a match
                    return haystack.substring(i - n);
                n++;
            }
            else if (n != 0) {
                n = failure[n];
                i--; // try again see if match
            }
            else {
                n = 0;
            }
        }

to check if we have the match or not.

public class Solution {
    public String strStr(String haystack, String needle) {
        if (needle.length() == 0)
            return haystack;
        if (haystack.length() == 0)
            return null;
        int [] failure = getNext(needle);
        int n = 0; //start from nothing matches
        for (int i = 0; i < haystack.length(); i++) {
            char h = haystack.charAt(i);
            if (h == needle.charAt(n)) {
                if (n == needle.length() - 1) // if last char is a match
                    return haystack.substring(i - n);
                n++;
            }
            else if (n != 0) {
                n = failure[n];
                i--; // try again see if match
            }
            else {
                n = 0;
            }
        }
        return null;
    }
    
    public int [] getNext (String needle) {
	    int [] failure = new int [needle.length()];
	    failure[0] = 0;
	    if (needle.length() == 1)
            return failure;
	    failure[1] = 0; // if 1 fail, back to 0
	    int last = 0;
	    for(int i = 1; i < failure.length-1; i++) {
	        if (needle.charAt(i) == needle.charAt(last)) {
	            // failure[i] records the location of the next comparison
	            // if char at i and last are equal, in the case future failure happen
	            // we back up to (last + 1) if (i+1) fail because i and last has the same prefix
	            failure[i+1] = last+1; 
	            last++;
	        } 
	        else if (last != 0) {
	            last = failure[last]; // back off to the last possible comparison
	            i--;//try again see if can match
	        }
	        else{
	            failure[i+1] = last;
	        }
	    }
	    return failure;
	} 
}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s