Introduction
Pattern matching is an essential task in computer science, crucial for text processing, data mining, and search engines. Among the various algorithms developed for this purpose, the Boyer-Moore algorithm stands out for its efficiency, especially with large texts. Named after Robert Boyer and J Strother Moore, this algorithm uses preprocessing of the pattern to allow fast scanning of the text by skipping sections that do not match, based on character comparisons from the end of the pattern. This article will guide you through the implementation of the Boyer-Moore algorithm in C++ with practical examples to illustrate its effectiveness and output.
Prerequisites
Before diving into the implementation, it’s beneficial to have:
- Basic understanding of algorithms: Knowledge of string matching and basic algorithm principles.
- Familiarity with C++ programming: Proficiency in C++ syntax, arrays, and functions.
- Understanding of data structures: Specifically, arrays and their manipulation.
Boyer-Moore Algorithm Structure
Definition and Formula
The Boyer-Moore algorithm preprocesses the pattern to create two arrays: the bad character heuristic and the good suffix heuristic. These heuristics guide the algorithm on how far to shift the pattern when a mismatch occurs, leading to significant performance improvements over naive methods.
Bad Character Heuristic
For a given character in the pattern, the bad character heuristic determines how far the pattern can be shifted if a mismatch occurs. This heuristic is computed as follows:
\[
\text{Shift} = \max(1, j – \text{lastOccurrence}[t])
\]
Where \( j \) is the current position in the pattern, and \(\text{lastOccurrence}[t]\) is the position of the last occurrence of the mismatched character in the pattern.
Good Suffix Heuristic
The good suffix heuristic determines the shift based on the longest suffix of the pattern that matches a substring of the text. This ensures that the pattern is shifted to the right position to find potential matches.
1. Implementing the Boyer-Moore Algorithm
1.1 Boyer-Moore Class Definition
Below is the class definition for the Boyer-Moore algorithm, including methods for preprocessing the pattern and searching within the text.
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
class BoyerMoore {
public:
BoyerMoore(const std::string& pattern);
void search(const std::string& text);
private:
void preprocessBadCharacter();
void preprocessGoodSuffix();
std::string pattern;
int m;
std::vector<int> badCharShift;
std::vector<int> goodSuffixShift;
std::vector<int> borderPos;
};
BoyerMoore::BoyerMoore(const std::string& pat) : pattern(pat), m(pat.size()) {
badCharShift.resize(256, m);
goodSuffixShift.resize(m + 1, m);
borderPos.resize(m + 1, 0);
preprocessBadCharacter();
preprocessGoodSuffix();
}
void BoyerMoore::preprocessBadCharacter() {
for (int i = 0; i < m; ++i) {
badCharShift[pattern[i]] = m - i - 1;
}
}
void BoyerMoore::preprocessGoodSuffix() {
int i = m, j = m + 1;
borderPos[i] = j;
while (i > 0) {
while (j <= m && pattern[i - 1] != pattern[j - 1]) {
if (goodSuffixShift[j] == m) {
goodSuffixShift[j] = j - i;
}
j = borderPos[j];
}
i--;
j--;
borderPos[i] = j;
}
j = borderPos[0];
for (i = 0; i <= m; i++) {
if (goodSuffixShift[i] == m) {
goodSuffixShift[i] = j;
}
if (i == j) {
j = borderPos[j];
}
}
}
void BoyerMoore::search(const std::string& text) {
int n = text.size();
int s = 0;
while (s <= n - m) {
int j = m - 1;
while (j >= 0 && pattern[j] == text[s + j]) {
j--;
}
if (j < 0) {
std::cout << "Pattern found at index " << s << std::endl;
s += goodSuffixShift[0];
} else {
s += std::max(goodSuffixShift[j + 1], badCharShift[text[s + j]] - m + 1 + j);
}
}
}
1.2 Example Usage
Let’s demonstrate the usage of the Boyer-Moore algorithm with a simple example where we search for a pattern in a text.
int main() {
BoyerMoore bm("EXAMPLE");
std::string text = "HERE IS A SIMPLE EXAMPLE";
bm.search(text);
return 0;
}
1.3 Output for Example 1
Pattern found at index 17
2. Additional Examples
2.1 Example 2: Pattern in the Middle
In this example, we will search for a pattern that appears in the middle of the text.
int main() {
BoyerMoore bm("TEST");
std::string text = "THIS IS A TEST TEXT FOR TESTING";
bm.search(text);
return 0;
}
Output for Example 2
Pattern found at index 10
Pattern found at index 24
2.2 Example 3: Pattern with Multiple Occurrences
Here, we search for a pattern that has multiple occurrences in the text.
int main() {
BoyerMoore bm("ABCD");
std::string text = "ABCD ABCD ABCD ABCD";
bm.search(text);
return 0;
}
Output for Example 3
Pattern found at index 0
Pattern found at index 5
Pattern found at index 10
Pattern found at index 15
Conclusion
The Boyer-Moore algorithm is a powerful tool for pattern matching, leveraging two key heuristics—the bad character heuristic and the good suffix heuristic—to skip sections of the text that do not match the pattern. This results in significant performance improvements, especially with large texts.