Leetcode-Java: Implement strStr()

Implement strStr()

Implement strStr().
Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Naive Way:从一堆稻草中找一根针，这简直比大海捞针还难，因为稻草跟针长得很像。是否可以直接比较呢，对每一个字符都遍历一遍，这样的算法复杂度是O(n^2)。因为我之前做过，我貌似知道这样做虽然很傻，但是可以通过。

public class Solution {
    public int strStr(String haystack, String needle) {
        for(int i = 0;i <= haystack.length()-needle.length();i++){
            int j = 0;
            for(;j < needle.length();j++)
                if(haystack.charAt(i+j)!=needle.charAt(j))
                    break;
            if(j == needle.length())
                return i;
        }
        return -1;
    }
}

Improved Way:怎能满足于如此傻的方法呢，这个世界一定有比这个更好的方法才对。是的，那就是KMP算法了，专门用于从稻草中捞出一根针的算法。

我看了一下维基百科的例子，觉得很好理解，于是根据例子自己写了一个算法。下面是那个例子的详解

reference to http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

To illustrate the algorithm's details, we work through a (relatively artificial) run of the algorithm, where W = "ABCDABD" and S = "ABC ABCDAB ABCDABCDABDE". At any given time, the algorithm is in a state determined by two integers:

m, denoting the position within S where the prospective match for W begins,
i, denoting the index of the currently considered character in W.

In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run, like

             1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W: ABCDABD
i: 0123456

We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if they match. However, in the fourth step, we get S[3] = ' ' and W[3] = 'D', a mismatch. Rather than beginning to search again at S[1], we note that no 'A' occurs between positions 0 and 3 in S, except at 0; hence, having checked all those characters previously, we know that there is no chance of finding the beginning of a match if we check them again. Therefore, we move on to the next character, setting m = 4 and i = 0.

             1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:     ABCDABD
i:     0123456

We quickly obtain a nearly complete match "ABCDAB" when, at W[6] (S[10]), we again have a discrepancy. However, just prior to the end of the current partial match, we passed an "AB", which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character. Thus, not only do we omit previously matched characters of S, but also previously matched characters of W.

             1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:         ABCDABD
i:         0123456

This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of W and begin searching at the next character of S: m = 11, reset i = 0.

             1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:            ABCDABD
i:            0123456

Once again, we immediately hit upon a match "ABCDAB", but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before, we set m = 15, to start at the two-character string "AB" leading up to the current position, set i = 2, and continue matching from the current position.

             1         2  
m: 01234567890123456789012
S: ABC ABCDAB ABCDABCDABDE
W:                ABCDABD
i:                0123456

This time we are able to complete the match, whose first character is S[15].

我的理解就是在匹配当前Index的同时，留一个心眼找从中间开始的和needle匹配的match，然后给它一个index记录这个中间开始的匹配的位置（在needle上的位置），一旦主线匹配失败，就转到支线匹配，同时，要把主线和支线的地位交换，支线成为了主线，旧的主线去记录新的中间可能出现的匹配。

这个算法的复杂度是O(n)。

public class Solution {
    public int strStr(String haystack, String needle) {
        int p = 0; // for haystack
        int i = 0, j = 0; // for needle
        int state = 0; // state=0 means i being compared, state=1 means j is being compared
        while(p < haystack.length() && i < needle.length() && j < needle.length()){
            switch(state){
                case 0:
                    if(i!= 0){
                        // if a new match start in the middle
                        if(haystack.charAt(p)==needle.charAt(j))
                            j++;
                        else if(haystack.charAt(p)==needle.charAt(0))
                            j = 1;
                        else
                            j = 0;
                    }
                    if(haystack.charAt(p)==needle.charAt(i)){
                        i++;
                    }else{
                        i = 0;
                        state = 1;
                    }
                    break;
                case 1:
                    if(j!=0){
                        // if a new match start in the middle
                        if(haystack.charAt(p)==needle.charAt(i))
                            i++;
                        else if(haystack.charAt(p)==needle.charAt(0))
                            i = 1;
                        else
                            i = 0;
                    }
                    if(haystack.charAt(p)==needle.charAt(j)){
                        j++;
                    }else{
                        j = 0;
                        state = 0;
                    }
                    break;
                default:
                    break;
            }
            p++;
            // if a total match is caught
            if(i==needle.length() || j==needle.length())
                return p-needle.length();
        }
        return needle.length()==0?0:-1;
    }
}

这个方法可以在OJ上通过，但是是错误的。因为我只存了两个指针，当出现第三个和开始一致的情况，就会出错。按照这个思路需要存贮n个指针，轮流替换，这样的话每次一个指针更新，其他所有指针都要遍历一遍，算法复杂度就成了O(n^2)，真是个烂算法。

自己写KMP的流程。
首先，构建一个针对于needle的table，这个table要记录一旦haystack[m+i] 与 needle[i] 不匹配时，m要返回到哪里，i要返回到哪里。

示例：
index    0 1 2 3 4 5 6
needle A B D A B C D
table    -1 0 0   0 1 2   0

比如匹配到i = 5的C处，发现不匹配，那么haystack的指针就要减少2个，needle的指针就要从2开始匹配，从2开始匹配这点显而易见，因为都已经匹配到C了，说明前两个一定是AB，haystack的当前位置肯定不是C，那么看看是不是之前具有相同pattern AB的下一个位置处的D，同时，因为之前是用[m+i]来记录haystack的指针的，此时needle已经是从第2个开始了，需要保持[m+i]中的 i=2, 而之前是i=5,那么m就要相应的增加3 （因为i减少了3）。

根据例子可以依照规律构建table。

算法复杂度是O(n+m), space是O(m)

public class Solution {
    public int strStr(String haystack, String needle) {
        // KMP
        if(needle.length() == 0){return 0;}
        int t[] = new int[needle.length()];
        // construct table
        t[0] = -1;
        for(int i = 1;i < t.length;i++){
            if(i==1){t[i] = 0;}
            else{
                if(needle.charAt(i-1)==needle.charAt(t[i-1])){
                    t[i] = t[i-1]+1;
                }else{
                    t[i] = 0;
                }
            }
        }
        // kmp search
        int m = 0, i = 0;
        while(m <= haystack.length()-needle.length()){
            if(haystack.charAt(m+i)==needle.charAt(i)){
                i++;
            }else{
                if(i > 0){
                    m = m+i-t[i];
                    i = t[i];
                }else{
                    m++;
                }
            }
            if(i==needle.length())
                return m;
        }
        return -1;
    }
}

Leetcode-Java

Labels

Tuesday, February 3, 2015

Implement strStr()

Implement strStr()

No comments:

Post a Comment