KMP: Pattern search algorithm in JAVA
For text of length N and pattern of length M, naive algorithm will match text string with pattern by incrementing index 1 by 1. Whenever a mismatch occurs after K matches, we will discard previous results and start matching with pattern from next index.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
void naiveStringMatching(char[] textString, char[] patternString){ int N = textString.length; int M = patternString.length; for(int i=0; i<N-M+1; i++){ int j; for(j=0; j<M; j++){ if(textString[i+j] != patternString[j]) break; } if(j == M) System.out.println("Match found at index "+ i); } } |
Complexity of naive will be O(m*(n-m+1)) in worst case.
In KMP (Knuth–Morris–Pratt algorithm), it uses information from last mismatch to decide next match start index and bypasses previously matched characters. To retrieve information from previous partial match, KMP does preprocessing of pattern string. It will create a partial_match[] table of size same as pattern length. Value in partial_match[] table at index i represents longest length of proper prefix which is also a proper suffix.
Let’s understand these two terms proper prefix and proper suffix.
For pattern “androidsrc”, proper prefixes are “a”, “an”, “and”, “andr”, “andro”, “androi”, “android”, “androids” and “androidsr”. Proper suffixes are “ndroidsrc”, “ndroidsrc”, “droidsrc”, “roidsrc”, “oidsrc”, “idsrc”, “dsrc”, “src”, “rc” and “c”.
Lets calculate partial_match[] table for pattern “ABCDABD”. For calculating partial_match at index i, we are only bothered about pattern 0 to i-1.
For sub pattern of length 1 “A” as it has no proper suffix or prefix.
partial_match[0] will always be 0.
For sub pattern of length 2 i.e. “AB” proper prefix is A and proper suffix is B. No match.
partial_match[1] = 0
For sub pattern of length 3 i.e. til index 2, “ABC” will have proper prefixes “AB” and “B”. Proper suffixes will be “BC” and “C”. No match.
partial_match[2] = 0
For sub pattern of length 4 i.e. till index 3, “ABCD” Proper prefixes will be “ABC”, “AB” and “A”. Proper suffixes will be “BCD”, “BC” and “D”. No match.
partial_match[3] = 0
For sub pattern of length 5 i.e. till index 4, “ABCDA” Proper prefixes will be “ABCD”, “ABC”, “AB” and “A”. Proper suffixes will be “BCDA”, “CDA”, “DA” and “A”.
We have a match “A” of length=1 at index 4
partial_match[4] = 1
For sub pattern of length 6 i.e. till index 5, “ABCDAB” Proper prefixes will be “ABCDA”, “ABCD”, “ABC”, “AB” and “A”. Proper suffixes will be
“BCDAB”, “CDAB”, “DAB”, “AB” and “B.” We have a match “AB” of length 2.
partial_match[5] = 2
For sub pattern of length 7 i.e. till index 6, “ABCDABD” We can see all proper suffixes will end with “D” while only one proper prefix “ABCD” will end
with D and we don’t have such proper suffix. So partial_match at index 6 will be 0.
partial_match[6] = 0
For pattern “ABCDABD”, partial_match will be {0, 0, 0, 0, 1, 2, 0}. We will use this preprocessed table to skip already matched characters.
Java code for performing computation of partial_match[] table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
int[] computePartialMatchTable(char[] patternString) { // TODO Auto-generated method stub int patternLength = patternString.length; int partial_match[] = new int[patternLength]; // value for partial_match at index 0 will always be 0 as no proper // suffix or prefix exist partial_match[0] = 0; int length = 0; int currentIndex = 1; while (currentIndex < patternLength) { if (patternString[currentIndex] == patternString[length]) { // match is found length = length + 1; partial_match[currentIndex] = length; currentIndex = currentIndex + 1; } else { // for mismatch case if (length != 0) { length = partial_match[length - 1]; } else { partial_match[currentIndex] = 0; currentIndex = currentIndex + 1; } } } return partial_match; } |
Lets consider our text string as “ABC ABCDAB ABCDABCDABDE”, we will search out pattern now using KMP. textIndex is current index in text to be searched while patternIndex is current index in pattern to be matched.
First mismatch is found at textIndex = 4, we will refer partial_match[] table rather that starting pattern matching from start we will fallback our patternIndex = 4 to partial_match[patternIndex-1] i.e. partial_match[3] which is 0. In this case we can not skip as there is no proper suffix that matches
proper prefix.
Second mismatch will occur at textIndex = 10, characters matched so far in pattern are from 0 to 5 i.e. currently patternIndex = 6. For next step, we will refer partial_match[patternIndex-1] i.e. partial_match[5] which is 2 i.e. we know there is no need to match till index 2. As there exist a proper prefix “AB” of length 2 which is also a proper suffix.
In our next step, we will start with patternIndex = 2 and textIndex = 10, we will have mismatch as ” ” in text string does not match “C” in pattern. This time we can not skip as partial_table[patternIndex-1] is 0.
Next step we will start from textIndex = 11 and patternIndex = 0, mismatch will occur at index 17 after matching “ABCDAB” and patternIndex 6. partial_match[5] is 2. In next step on starting with patternIndex = 2 and textIndex = 17 we will have a match.
Java code for KMP algorithm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
private static void printPatternIndexKMP(char[] textString, char[] patternString) { // TODO Auto-generated method stub int textLength = textString.length; int patternLength = patternString.length; int partial_match[] = computePartialMatchTable(patternString); int currentIndexText = 0; int currentIndexPattern = 0; while (currentIndexText < textLength) { if (textString[currentIndexText] == patternString[currentIndexPattern]) { // so far matched // currentIndexPattern-patternLength+currentIndexText currentIndexPattern = currentIndexPattern + 1; currentIndexText = currentIndexText + 1; } if (currentIndexPattern == patternLength) { System.out.println("Match found at index "+ (currentIndexText - patternLength)); currentIndexPattern = partial_match[currentIndexPattern - 1]; } else if (currentIndexText < textLength && textString[currentIndexText] != patternString[currentIndexPattern]) { if (currentIndexPattern != 0) { // if no match and currentIndexPattern is not zero we will // fallback to values in partial match table // for match of largest common proper suffix and prefix // till currentIndexPattern-1 currentIndexPattern = partial_match[currentIndexPattern - 1]; } else { // if currentIndexPattern is zero // we increment currentIndexText for fresh match currentIndexText = currentIndexText + 1; } } } } |
If we observe, KMP wisely used pattern partial_match[] table to skip index. KMP algorithm performs pattern search in linear time i.e. O(n).