From 6e47efe2e4aeb6ffc9593949ff4df222224f7b57 Mon Sep 17 00:00:00 2001 From: 4ndrelim Date: Tue, 16 Jan 2024 00:41:38 +0800 Subject: [PATCH 1/3] docs: Refactor KMP docs --- .../java/algorithms/patternFinding/README.md | 33 ++++++++++++++++--- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/src/main/java/algorithms/patternFinding/README.md b/src/main/java/algorithms/patternFinding/README.md index f81f1e76..ed232b1d 100644 --- a/src/main/java/algorithms/patternFinding/README.md +++ b/src/main/java/algorithms/patternFinding/README.md @@ -1,6 +1,9 @@ # Knuth-Moris-Pratt Algorithm -KMP match is a type of pattern-searching algorithm. +## Background +KMP match is a type of pattern-searching algorithm that improves the efficiency of naive search by avoiding unnecessary +comparisons. It is most notable when the pattern has repeating sub-patterns. +
Pattern-searching problems is prevalent across many fields of CS, for instance, in text editors when searching for a pattern, in computational biology sequence matching problems, in NLP problems, and even for looking for file patterns for effective file management. @@ -9,9 +12,31 @@ It is hence crucial that we develop an efficient algorithm. ![KMP](../../../../../docs/assets/images/kmp.png) Image Source: GeeksforGeeks -## Analysis - -**Time complexity**: +### Intuition +It's efficient because it utilizes the information gained from previous character comparisons. When a mismatch occurs, +the algorithm uses this information to skip over as many characters as possible. + +Considering the string pattern:
+
+ "XYXYCXYXYF" +
+and string: +
+ XYXYCXYXYCXYXYFGABC +
+ +KMP has, during its initial processing of the pattern, identified that "XYXY" is a repeating sub-pattern. +This means when the mismatch at F (10th character of the pattern) and C (10th character of the string) occurs, +KMP doesn't need to start matching again from the very beginning of the pattern.
+Instead, it leverages the information that "XYXY" has already been matched. + +Therefore, the algorithm continues matching from the 5th character of the pattern string (C in "XYXYCXYXYF").
+It checks this against the 10th character of the string (C in "XYXYCXYXYCXYXYFGABC").
+Since they match, the algorithm continues from there without re-checking the initial "XYXY". + +## Complexity Analysis +Let k be the length of the pattern and n be the length of the string to match against. +**Time complexity**: O(n+k) Naively, we can look for patterns in a given sequence in O(nk) where n is the length of the sequence and k is the length of the pattern. We do this by iterating every character of the sequence, and look at the From 7d67812c16fde3da595ad13e2970666b2c75a6c8 Mon Sep 17 00:00:00 2001 From: 4ndrelim Date: Tue, 16 Jan 2024 01:42:03 +0800 Subject: [PATCH 2/3] refactor: KMP class for better clarity --- .../java/algorithms/patternFinding/KMP.java | 123 +++++++++--------- .../java/algorithms/patternFinding/README.md | 7 +- .../algorithms/patternFinding/KmpTest.java | 13 +- 3 files changed, 76 insertions(+), 67 deletions(-) diff --git a/src/main/java/algorithms/patternFinding/KMP.java b/src/main/java/algorithms/patternFinding/KMP.java index 0bb5d30b..cfcf65c8 100644 --- a/src/main/java/algorithms/patternFinding/KMP.java +++ b/src/main/java/algorithms/patternFinding/KMP.java @@ -6,111 +6,112 @@ /** * Implementation of KMP. *

- * Illustration of getPrefixIndices: with pattern ABCABCNOABCABCA - * Here we make a distinction between position and index. The position is basically 1-indexed. - * Note the return indices are still 0-indexed of the pattern string. + * Illustration of getPrefixTable: with pattern ABCABCNOABCABCA + * We consider 1-indexed positions. Position 0 will be useful later in as a trick to inform that are no prefix matches * Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 - * Pattern: A B C A B C N O A B C A B C A ... - * Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ... - * Read: ^ an indexing trick; consider 1-indexed characters for clarity and simplicity in the main algor - * Read: ^ 'A' is the first character of the pattern string, - * there is no prefix ending before its index, 0, that can be matched with. - * Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively. - * Read: ^ Can be matched with an earlier 'A'. So we store 1. - * Prefix is the substring from idx 0 to 1 (exclusive). Note consider prefix from 0-indexed. - * Realise 1 can also be interpreted as the index of the next character to match against! - * Read: ^ ^ Similarly, continue matching - * Read: ^ ^ No matches, so 0 - * Read: ^ ^ ^ ^ ^ ^ Match with prefix until position 6! - * Read: ^ where the magic happens, we can't match 'N' - * at position 7 with 'A' at position 15, but - * we know ABC of position 1-3 (or index 0-2) - * exists and can 'restart' from there. - *

- *

+ * Pattern: A B C A B C N O A B C A B C A ... + * Return: -1 0 0 0 1 2 3 0 0 1 2 3 4 5 6 4 ... CAN BE READ AS NUM OF CHARS MATCHED + * Read: ^ -1 can be interpreted as invalid number of chars matched but exploited for simplicity in the main algor. + * Read: ^ 'A' is the first character of the pattern, there is no prefix ending before itself, to match. + * Read: ^ ^ 'B' and 'C' cannot be matched with any prefix which are just 'A' and 'AB' respectively. + * Read: ^ can be matched with an earlier prefix, 'A'. So we store 1, the number of chars matched. + * Realise 1 can also be interpreted as the index of the next character to match against! + * Read: ^ ^ Similarly, continue matching + * Read: ^ ^ No matches, so 0 + * Read: ^ ^ ^ ^ ^ ^ Match with prefix, "ABCABC", until 6th char + * of pattern string. + * Read: ^ where the magic happens, we can't match 'N' + * at position 7 with 'A' at position 15, but + * we know "ABC" exists as an earlier sub-pattern + * from 1st to 3rd and start matching the 4th + * char onwards. *

* Illustration of main logic: * Pattern: ABABAB * String : ABABCABABABAB *

- * A B A B C A B A B A B A B - * Read: ^ to ^ Continue matching where possible, leading to Pattern[0:4] matched. - * unable to match Pattern[4]. But notice that last two characters of String[0:4] - * form a sub-pattern with Pattern[0:2] Maybe Pattern[2] == 'C' and we can 're-use' Pattern[0:2] - * Read: ^ try ^ by checking if Pattern[2] == 'C' + * A B A B C A B A B A B A B + * Read: ^ to ^ Continue matching where possible, leading to 1st 4 characters matched. + * unable to match Pattern[4]. But notice that last two characters + * form a sub-pattern with the 1st 2, Maybe Pattern[2] == 'C' and we can 're-use' "AB" + * Read: ^ ^ check if Pattern[2] == 'C' * Read: Turns out no. No previously identified sub-pattern with 'C'. Restart matching Pattern. - * Read: ^ to ^ Found complete match! But rather than restart, notice that last 4 characters - * Read: form a prefix sub-pattern of Pattern, which is Pattern[0:4] = "ABAB", so, - * Read: ^ ^ Start matching from Pattern[4] and finally Pattern[5] + * Read: ^ ^ Found complete match! But rather than restart, notice that last 4 characters + * Read: of "ABABAB" form a prefix sub-pattern of Pattern, which is "ABAB", so, + * Read: ^ reuse ^ ^ then match 5th and 6th char of pattern which happens to be "AB" */ public class KMP { /** - * Find and indicate all suffix that match with a prefix. + * Captures the longest prefix which is also a suffix for every substring starting at index 0. Does this + * by tracking the number of characters (of the prefix and suffix) matched. * * @param pattern to search - * @return an array of indices where the suffix ending at each position of they array can be matched with - * corresponding a prefix of the pattern ending before the specified index + * @return an array of indices */ - private static int[] getPrefixIndices(String pattern) { + private static int[] getPrefixTable(String pattern) { + // 1-indexed implementation int len = pattern.length(); - int[] prefixIndices = new int[len + 1]; - prefixIndices[0] = -1; - prefixIndices[1] = 0; // 1st character has no prefix to match with + int[] numCharsMatched = new int[len + 1]; + numCharsMatched[0] = -1; + numCharsMatched[1] = 0; // 1st character has no prefix to match with int currPrefixMatched = 0; // num of chars of prefix pattern currently matched - int pos = 2; // Starting from the 2nd character, recall 1-indexed + int pos = 2; // Starting from the 2nd character while (pos <= len) { if (pattern.charAt(pos - 1) == pattern.charAt(currPrefixMatched)) { currPrefixMatched += 1; // note, the line below can also be interpreted as the index of the next char to match - prefixIndices[pos] = currPrefixMatched; // an indexing trick, store at the pos, num of chars matched + numCharsMatched[pos] = currPrefixMatched; pos += 1; } else if (currPrefixMatched > 0) { // go back to a previous known match and try to match again - currPrefixMatched = prefixIndices[currPrefixMatched]; + currPrefixMatched = numCharsMatched[currPrefixMatched]; } else { // unable to match, time to move on - prefixIndices[pos] = 0; + numCharsMatched[pos] = 0; pos += 1; } } - return prefixIndices; + return numCharsMatched; } /** - * Main logic of KMP. Iterate the sequence, looking for patterns. If a difference is found, resume matching from - * a previously identified sub-pattern, if possible. Length of pattern should be at least one. - * + * Main logic of KMP. Iterate the sequence, looking for patterns. If a mismatch is found, resume matching from + * a previously identified sub-pattern, if possible. Here we assume length of pattern is at least one. * @param sequence to search against * @param pattern to search for * @return start indices of all occurrences of pattern found */ public static List findOccurrences(String sequence, String pattern) { - assert pattern.length() >= 1 : "Pattern length cannot be 0!"; - int sLen = sequence.length(); int pLen = pattern.length(); - int[] prefixIndices = getPrefixIndices(pattern); + int[] prefixTable = getPrefixTable(pattern); List indicesFound = new ArrayList<>(); - int s = 0; - int p = 0; + int sTrav = 0; + int pTrav = 0; - while (s < sLen) { - if (pattern.charAt(p) == sequence.charAt(s)) { - p += 1; - s += 1; - if (p == pLen) { - // occurrence found - indicesFound.add(s - pLen); // start index of this occurrence - p = prefixIndices[p]; // reset + while (sTrav < sLen) { + if (pattern.charAt(pTrav) == sequence.charAt(sTrav)) { + pTrav += 1; + sTrav += 1; + if (pTrav == pLen) { // matched a complete pattern string + indicesFound.add(sTrav - pLen); // start index of this occurrence + // recall the number of chars matched in p can be read as the index of the next char in p to match + pTrav = prefixTable[pTrav]; // start matching from a repeated sub-pattern, if possible } } else { - p = prefixIndices[p]; - if (p < 0) { // move on - p += 1; - s += 1; + pTrav = prefixTable[pTrav]; + if (pTrav < 0) { // move on; using -1 trick + pTrav += 1; + sTrav += 1; } + // ALTERNATIVELY + // if pTrav == 0 i.e. nothing matched, move on + // sTrav += 1 + // continue + // + // pTrav = prefixTable[pTrav] } } return indicesFound; diff --git a/src/main/java/algorithms/patternFinding/README.md b/src/main/java/algorithms/patternFinding/README.md index ed232b1d..bf5cd3e8 100644 --- a/src/main/java/algorithms/patternFinding/README.md +++ b/src/main/java/algorithms/patternFinding/README.md @@ -50,7 +50,10 @@ O(n) traversal of the sequence. More details found in the src code. **Space complexity**: O(k) auxiliary space to store suffix that matches with prefix of the pattern string ## Notes - -A detailed illustration of how the algorithm works is shown in the code. +1. A detailed illustration of how the algorithm works is shown in the code. But if you have trouble understanding the implementation, here is a good [video](https://www.youtube.com/watch?v=EL4ZbRF587g) as well. +2. A subroutine to find Longest Prefix Suffix (LPS) is commonly involved in the preprocessing step of KMP. +It may be useful to interpret these numbers as the number of characters matched between the suffix and prefix.
+Knowing the number of characters of prefix would help in informing the position of the next character of the pattern to +match. diff --git a/src/test/java/algorithms/patternFinding/KmpTest.java b/src/test/java/algorithms/patternFinding/KmpTest.java index 1d795d64..0647817b 100644 --- a/src/test/java/algorithms/patternFinding/KmpTest.java +++ b/src/test/java/algorithms/patternFinding/KmpTest.java @@ -34,11 +34,16 @@ public void testEmptySequence_findOccurrences_shouldReturnStartIndices() { @Test public void testNoOccurence_findOccurrences_shouldReturnStartIndices() { String seq = "abcabcabc"; - String pattern = "noway"; + String patternOne = "noway"; + String patternTwo = "cbc"; - List indices = KMP.findOccurrences(seq, pattern); - List expected = new ArrayList<>(); - Assert.assertEquals(expected, indices); + List indicesOne = KMP.findOccurrences(seq, patternOne); + List expectedOne = new ArrayList<>(); + Assert.assertEquals(expectedOne, indicesOne); + + List indicesTwo = KMP.findOccurrences(seq, patternTwo); + List expectedTwo = new ArrayList<>(); + Assert.assertEquals(expectedTwo, indicesTwo); } @Test From 5aeb10bfd47192c4080b6751e274869c342a6357 Mon Sep 17 00:00:00 2001 From: 4ndrelim Date: Tue, 6 Feb 2024 01:23:09 +0800 Subject: [PATCH 3/3] docs: Improve wording for KMP helper method --- src/main/java/algorithms/patternFinding/KMP.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/main/java/algorithms/patternFinding/KMP.java b/src/main/java/algorithms/patternFinding/KMP.java index cfcf65c8..f55ca1a3 100644 --- a/src/main/java/algorithms/patternFinding/KMP.java +++ b/src/main/java/algorithms/patternFinding/KMP.java @@ -42,8 +42,8 @@ */ public class KMP { /** - * Captures the longest prefix which is also a suffix for every substring starting at index 0. Does this - * by tracking the number of characters (of the prefix and suffix) matched. + * Captures the longest prefix which is also a suffix for some substring ending at each index, starting from 0. + * Does this by tracking the number of characters (of the prefix and suffix) matched. * * @param pattern to search * @return an array of indices