KMP algorithm
KMP algorithm is a classic algorithm to search a pattern inside a text. I perosnally think this algorithm is diffiuclt to understand, it's simple but quite abstract.
Naive Approach
At each character of text, compare with pattern letter by letter from its begining. When a mismatch is found, move the character position by one and start all over again.
Running time: O(n * k)
where n is the length of text, k is the length of pattern
#include <stdio.h>
#include <string.h>
int contain(char *t, char *p){
int i, j;
int result = 0;
// start comparsion at each character
for (i = 0; i < strlen(t) - strlen(p) + 1; i++){
result = 1;
// compare each character at text and each character at pattern
// all letters have to be the same
for (j = 0; j < strlen(p); j++){
result = result && (t[i + j] == p[j]);
}
if (result)
return result;
}
return result;
}
KMP Algorithm
Searching principle
It looks a little bit magic. When a mismatch is found, the words itself tell you where you should restart you comparsion and keep the text position. Instead of movining the character position in text one by one, it can move much more quickly and without repeating to compare those character in texts which are proved to be matched. The letters before the mismtached character in the pattern could be possiblily matched with previous characters in the text.
When a mismatch is found at [i + 1]
character, the letters from [1...i]
are already matched. Suffix of [1...i]
could be also a prefix of [1...i]
. Suffix itself is matched, but its next character is not matched. When Suffix is equal to prefix, it can act likes a prefix, how about trying the next character after the prefix? You only need to roll back the comparsion at the character just behind the prefix.
[Prefix][t]...[Suffix][m]
m is wrong, let's try t
Considering the following example, ababc is the pattern, ababababc is the text. And i is the text pointer, j is the pattern pointer.
Step 1: Start at the begining, when a match is found, keep moving both pointers.
i ababababc j ababc
i ababababc j ababc
Step 2:
Look at the above figure, a mistached is found at the 5-th character, a and c. i
stays at that position, the letters befroe c is abab
, its longest suffix and prefix is ab
, so we need to rollback j
to character just behind ab
, which is a
. You can now see that the eariler suffix: ab
acts like prefix now
i ababababc j ababc
Step 3: Keep moving until and mismatched is found.
i ababababc j ababc
Step 4:
Look at the above figure, a and c is mismatched again. i
stays at that position, the letters befroe c is abab
, its longest suffix and prefix is ab
, so we need to rollback j
to character just behind ab
, which is a. You can now see that the eariler suffix: ab
acts like prefix now
i ababababc j ababc
Step 5:
Keep moving until j
reach the end.
i ababababc j ababc
Failure function
The remaining quesition is: how can we find the the position of longest prefix which is also a suffix (LPS) ? It is done by dynamic programming. By examing the previous LPS, the letter at i-th position can get its LPS position easily.
Consider a pattern [1...i-1][i]
, can be seen as [suffix...prefix][a]
where a is the mismatched character. It could probabily look like:
Case 1: No LPS of previous letters
[....no any suffix at all][a]
=> LPS of [a] = -1
Case 2: the next character of the LPS of previous letters is equal to the mismatched letter
[suffix[a]..prefix][a]
: LPS of [1...i-1] + 1 is equal to a ?
=> LPS of [a] = LPS of [1..i-1] + 1
Example : aabaab
To calculate the position of LPS of the last b, F(5), it can be viewd as [aabaa]b
LPS of [aabaa]
is aa
with index 2, aa
+ 1 = b
?
YES !! therefore LPS of the last b
= 2 + 1 = 3
Case 3: the next character of the LPS of the LPS... of the LPS of previous letters is equal to the mismatched letter
[su[a]fix[b]..prefix][a]
: ( LPS of LPS...of LPS of [1...i-1] ) + 1 is equal to a ?
=> LPS of [a] = ( LPS of LPS...of LPS of [1..i-1] ) + 1
Example: adcaadcad
To calculate the position of LPS of the last c, F(8), it can be viewd as [adcaadca]d
LPS of [adcaadca]
is adca
with index 3, adca
+ 1 = d
?
NO !! find the LPS of the LPS recursively.
LPS of adca
is a
with index 0, is a
+ 1 = d
?
YES !!! therefore F(8) = 0 + 1
Running time analysis:
Best Case: the text pointer always keep moving forward, and the pattern pointer always keep staying at the original position. n times will be executed
Worst Case: (in theory, althoguh it is not a possible case) Condition 1: the text pointer can move forward for n time Condition 2: the pattern pointer rolls back and keeps follow the text pointer, the pattern pointer can only walk at most at text pointer walks If two conditions happen evenly, at most 2n times will be executed.
Overall Running time: O(n + k) where n is the length of text where k is the length of pattern
From wikipedia
Here is another way to think about the runtime: Let us say we begin to match W and S at position i and p. If W exists as a substring of S at p, then W[0..m] = S[p..p+m]. Upon success, that is, the word and the text matched at the positions (W[i] = S[p+i]), we increase i by 1. Upon failure, that is, the word and the text does not match at the positions (W[i] ≠ S[p+i]), the text pointer is kept still, while the word pointer is rolled back a certain amount (i = T[i], where T is the jump table), and we attempt to match W[T[i]] with S[p+i]. The maximum number of roll-back of i is bounded by i, that is to say, for any failure, we can only roll back as much as we have progressed up to the failure. Then it is clear the runtime is 2n.
Code Sample in C
The failure function starts with 0 index -1 means no LPS
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
// f is expected enough to have capacity as big as "p"
// like: int f[strlen(p)]
void failure(char *p, int *f){
f[0] = -1;
for (int i = 1; i < strlen(p); i++){
int r = f[i - 1];
// [1....i-1][a]
// [suffix...prefix][a]
// is it looks like [suffix[a]..prefix][a] ?
if (p[i] == p[r + 1] ){
f[i] = r + 1;
} else if (r < 0){
f[i] = -1;
} else {
// is it looks like [su[a]fix[b]..prefix][a] ?
// recursivly find the suffix of suffix
int k = f[r]; // the is the last index of "suffix of the suffix"
while ( k > -1 && (p[i] != p[k + 1]) ){
k = f[k];
}
if (p[i] == p[k + 1]) k++;
f[i] = k;
}
}
}
// LPS = longest prefix and suffix
// it uses two integer variable to "simulate a for loop"
int contain(char *t, char *p){
int i, j;
i = 0;
j = 0;
// compute f(n), the latest index of the suffix
int *f = (int*)malloc(strlen(p) * sizeof(int));
failure(p, f);
// scan though the text with i as index variable
while (i < strlen(t)){
// letter is matched, continue to the next one please
if (t[i] == p[j]){
i++;
j++;
// yes, you has done it!
if (j == strlen(p)){
return 1;
}
}
// not mtached, how many I should move for i and j?
else if (t[i] != p[j]){
// no avaiable LPS, just move ahead
if (j == 0){
i++;
} else {
// start over at the last index of suffix before j-th letter + 1
j = f[j - 1] + 1;
}
}
}
return 0;
}