Skip to content

Commit

Permalink
Merge pull request #62 from 4ndrelim/branch-RefactorKMP
Browse files Browse the repository at this point in the history
Branch refactor kmp
  • Loading branch information
4ndrelim authored Feb 10, 2024
2 parents d3f6b76 + 5aeb10b commit ba79246
Show file tree
Hide file tree
Showing 3 changed files with 104 additions and 70 deletions.
123 changes: 62 additions & 61 deletions src/main/java/algorithms/patternFinding/KMP.java
Original file line number Diff line number Diff line change
Expand Up @@ -6,111 +6,112 @@
/**
* Implementation of KMP.
* <p>
* Illustration of getPrefixIndices: with pattern ABCABCNOABCABCA
* Here we make a distinction between position and index. The position is basically 1-indexed.
* Note the return indices are still 0-indexed of the pattern string.
* Illustration of getPrefixTable: with pattern ABCABCNOABCABCA
* We consider 1-indexed positions. Position 0 will be useful later in as a trick to inform that are no prefix matches
* Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
* Pattern: A B C A B C N O A B C A B C A ...
* Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ...
* Read: ^ an indexing trick; consider 1-indexed characters for clarity and simplicity in the main algor
* Read: ^ 'A' is the first character of the pattern string,
* there is no prefix ending before its index, 0, that can be matched with.
* Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively.
* Read: ^ Can be matched with an earlier 'A'. So we store 1.
* Prefix is the substring from idx 0 to 1 (exclusive). Note consider prefix from 0-indexed.
* Realise 1 can also be interpreted as the index of the next character to match against!
* Read: ^ ^ Similarly, continue matching
* Read: ^ ^ No matches, so 0
* Read: ^ ^ ^ ^ ^ ^ Match with prefix until position 6!
* Read: ^ where the magic happens, we can't match 'N'
* at position 7 with 'A' at position 15, but
* we know ABC of position 1-3 (or index 0-2)
* exists and can 'restart' from there.
* <p>
* <p>
* Pattern: A B C A B C N O A B C A B C A ...
* Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ... CAN BE READ AS NUM OF CHARS MATCHED
* Read: ^ -1 can be interpreted as invalid number of chars matched but exploited for simplicity in the main algor.
* Read: ^ 'A' is the first character of the pattern, there is no prefix ending before itself, to match.
* Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively.
* Read: ^ can be matched with an earlier prefix, 'A'. So we store 1, the number of chars matched.
* Realise 1 can also be interpreted as the index of the next character to match against!
* Read: ^ ^ Similarly, continue matching
* Read: ^ ^ No matches, so 0
* Read: ^ ^ ^ ^ ^ ^ Match with prefix, "ABCABC", until 6th char
* of pattern string.
* Read: ^ where the magic happens, we can't match 'N'
* at position 7 with 'A' at position 15, but
* we know "ABC" exists as an earlier sub-pattern
* from 1st to 3rd and start matching the 4th
* char onwards.
* <p>
* Illustration of main logic:
* Pattern: ABABAB
* String : ABABCABABABAB
* <p>
* A B A B C A B A B A B A B
* Read: ^ to ^ Continue matching where possible, leading to Pattern[0:4] matched.
* unable to match Pattern[4]. But notice that last two characters of String[0:4]
* form a sub-pattern with Pattern[0:2] Maybe Pattern[2] == 'C' and we can 're-use' Pattern[0:2]
* Read: ^ try ^ by checking if Pattern[2] == 'C'
* A B A B C A B A B A B A B
* Read: ^ to ^ Continue matching where possible, leading to 1st 4 characters matched.
* unable to match Pattern[4]. But notice that last two characters
* form a sub-pattern with the 1st 2, Maybe Pattern[2] == 'C' and we can 're-use' "AB"
* Read: ^ ^ check if Pattern[2] == 'C'
* Read: Turns out no. No previously identified sub-pattern with 'C'. Restart matching Pattern.
* Read: ^ to ^ Found complete match! But rather than restart, notice that last 4 characters
* Read: form a prefix sub-pattern of Pattern, which is Pattern[0:4] = "ABAB", so,
* Read: ^ ^ Start matching from Pattern[4] and finally Pattern[5]
* Read: ^ ^ Found complete match! But rather than restart, notice that last 4 characters
* Read: of "ABABAB" form a prefix sub-pattern of Pattern, which is "ABAB", so,
* Read: ^ reuse ^ ^ then match 5th and 6th char of pattern which happens to be "AB"
*/
public class KMP {
/**
* Find and indicate all suffix that match with a prefix.
* Captures the longest prefix which is also a suffix for some substring ending at each index, starting from 0.
* Does this by tracking the number of characters (of the prefix and suffix) matched.
*
* @param pattern to search
* @return an array of indices where the suffix ending at each position of they array can be matched with
* corresponding a prefix of the pattern ending before the specified index
* @return an array of indices
*/
private static int[] getPrefixIndices(String pattern) {
private static int[] getPrefixTable(String pattern) {
// 1-indexed implementation
int len = pattern.length();
int[] prefixIndices = new int[len + 1];
prefixIndices[0] = -1;
prefixIndices[1] = 0; // 1st character has no prefix to match with
int[] numCharsMatched = new int[len + 1];
numCharsMatched[0] = -1;
numCharsMatched[1] = 0; // 1st character has no prefix to match with

int currPrefixMatched = 0; // num of chars of prefix pattern currently matched
int pos = 2; // Starting from the 2nd character, recall 1-indexed
int pos = 2; // Starting from the 2nd character
while (pos <= len) {
if (pattern.charAt(pos - 1) == pattern.charAt(currPrefixMatched)) {
currPrefixMatched += 1;
// note, the line below can also be interpreted as the index of the next char to match
prefixIndices[pos] = currPrefixMatched; // an indexing trick, store at the pos, num of chars matched
numCharsMatched[pos] = currPrefixMatched;
pos += 1;
} else if (currPrefixMatched > 0) {
// go back to a previous known match and try to match again
currPrefixMatched = prefixIndices[currPrefixMatched];
currPrefixMatched = numCharsMatched[currPrefixMatched];
} else {
// unable to match, time to move on
prefixIndices[pos] = 0;
numCharsMatched[pos] = 0;
pos += 1;
}
}
return prefixIndices;
return numCharsMatched;
}

/**
* Main logic of KMP. Iterate the sequence, looking for patterns. If a difference is found, resume matching from
* a previously identified sub-pattern, if possible. Length of pattern should be at least one.
*
* Main logic of KMP. Iterate the sequence, looking for patterns. If a mismatch is found, resume matching from
* a previously identified sub-pattern, if possible. Here we assume length of pattern is at least one.
* @param sequence to search against
* @param pattern to search for
* @return start indices of all occurrences of pattern found
*/
public static List<Integer> findOccurrences(String sequence, String pattern) {
assert pattern.length() >= 1 : "Pattern length cannot be 0!";

int sLen = sequence.length();
int pLen = pattern.length();
int[] prefixIndices = getPrefixIndices(pattern);
int[] prefixTable = getPrefixTable(pattern);
List<Integer> indicesFound = new ArrayList<>();

int s = 0;
int p = 0;
int sTrav = 0;
int pTrav = 0;

while (s < sLen) {
if (pattern.charAt(p) == sequence.charAt(s)) {
p += 1;
s += 1;
if (p == pLen) {
// occurrence found
indicesFound.add(s - pLen); // start index of this occurrence
p = prefixIndices[p]; // reset
while (sTrav < sLen) {
if (pattern.charAt(pTrav) == sequence.charAt(sTrav)) {
pTrav += 1;
sTrav += 1;
if (pTrav == pLen) { // matched a complete pattern string
indicesFound.add(sTrav - pLen); // start index of this occurrence
// recall the number of chars matched in p can be read as the index of the next char in p to match
pTrav = prefixTable[pTrav]; // start matching from a repeated sub-pattern, if possible
}
} else {
p = prefixIndices[p];
if (p < 0) { // move on
p += 1;
s += 1;
pTrav = prefixTable[pTrav];
if (pTrav < 0) { // move on; using -1 trick
pTrav += 1;
sTrav += 1;
}
// ALTERNATIVELY
// if pTrav == 0 i.e. nothing matched, move on
// sTrav += 1
// continue
//
// pTrav = prefixTable[pTrav]
}
}
return indicesFound;
Expand Down
38 changes: 33 additions & 5 deletions src/main/java/algorithms/patternFinding/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Knuth-Moris-Pratt Algorithm

KMP match is a type of pattern-searching algorithm.
## Background
KMP match is a type of pattern-searching algorithm that improves the efficiency of naive search by avoiding unnecessary
comparisons. It is most notable when the pattern has repeating sub-patterns.
<br>
Pattern-searching problems is prevalent across many fields of CS, for instance,
in text editors when searching for a pattern, in computational biology sequence matching problems,
in NLP problems, and even for looking for file patterns for effective file management.
Expand All @@ -11,9 +14,31 @@ Typically, the algorithm returns a list of indices that denote the start of each
![KMP](../../../../../docs/assets/images/kmp.png)
Image Source: GeeksforGeeks

## Analysis
### Intuition
It's efficient because it utilizes the information gained from previous character comparisons. When a mismatch occurs,
the algorithm uses this information to skip over as many characters as possible.

**Time complexity**:
Considering the string pattern: <br>
<div style="text-align: center;">
"XYXYCXYXYF"
</div>
and string:
<div style="text-align: center;">
XYXYCXYXYCXYXYFGABC
</div>

KMP has, during its initial processing of the pattern, identified that "XYXY" is a repeating sub-pattern.
This means when the mismatch at F (10th character of the pattern) and C (10th character of the string) occurs,
KMP doesn't need to start matching again from the very beginning of the pattern. <br>
Instead, it leverages the information that "XYXY" has already been matched.

Therefore, the algorithm continues matching from the 5th character of the pattern string (C in "XYXYCXYXYF"). <br>
It checks this against the 10th character of the string (C in "XYXYCXYXYCXYXYFGABC"). <br>
Since they match, the algorithm continues from there without re-checking the initial "XYXY".

## Complexity Analysis
Let k be the length of the pattern and n be the length of the string to match against.
**Time complexity**: O(n+k)

Naively, we can look for patterns in a given sequence in O(nk) where n is the length of the sequence and k
is the length of the pattern. We do this by iterating every character of the sequence, and look at the
Expand All @@ -27,7 +52,10 @@ O(n) traversal of the sequence. More details found in the src code.
**Space complexity**: O(k) auxiliary space to store suffix that matches with prefix of the pattern string

## Notes

A detailed illustration of how the algorithm works is shown in the code.
1. A detailed illustration of how the algorithm works is shown in the code.
But if you have trouble understanding the implementation,
here is a good [video](https://www.youtube.com/watch?v=EL4ZbRF587g) as well.
2. A subroutine to find Longest Prefix Suffix (LPS) is commonly involved in the preprocessing step of KMP.
It may be useful to interpret these numbers as the number of characters matched between the suffix and prefix. <br>
Knowing the number of characters of prefix would help in informing the position of the next character of the pattern to
match.
13 changes: 9 additions & 4 deletions src/test/java/algorithms/patternFinding/KmpTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,16 @@ public void testEmptySequence_findOccurrences_shouldReturnStartIndices() {
@Test
public void testNoOccurence_findOccurrences_shouldReturnStartIndices() {
String seq = "abcabcabc";
String pattern = "noway";
String patternOne = "noway";
String patternTwo = "cbc";

List<Integer> indices = KMP.findOccurrences(seq, pattern);
List<Integer> expected = new ArrayList<>();
Assert.assertEquals(expected, indices);
List<Integer> indicesOne = KMP.findOccurrences(seq, patternOne);
List<Integer> expectedOne = new ArrayList<>();
Assert.assertEquals(expectedOne, indicesOne);

List<Integer> indicesTwo = KMP.findOccurrences(seq, patternTwo);
List<Integer> expectedTwo = new ArrayList<>();
Assert.assertEquals(expectedTwo, indicesTwo);
}

@Test
Expand Down

0 comments on commit ba79246

Please sign in to comment.