Talk:Boyer–Moore string-search algorithm

(Redirected from Talk:Boyer–Moore string search algorithm)
Latest comment: 7 months ago by Bamgm14 in topic Python Code is incorrect

Second Table

edit

The second table is poorly explained and may be incorrect. Can anyone look into this see if its done correctly on this page? Or better describe it. I have reread the description several times and still can't figure it out. Is the table correct?


No its not correct, I'm about to correct it right now -- Hurda


I have to disagree here, the original version was correct, it should be reverted. Consider the following scenario - you are looking for "ANPANMAN" in "xxxxAMANPANMANyy":
xxxxAMANPANMANyy
ANPANMAN
If you now shift by 8 characters (as the current version of the second table says), you'll miss:
xxxxAMANPANMANyy
        ANPANMAN
The previous version of the second table correctly suggests shift by 6 characters leading to a find:
xxxxAMANPANMANyy
      ANPANMAN  —Preceding unsigned comment added by 82.100.5.250 (talk) 22:37, 13 July 2008 (UTC)Reply 
Oki My fault, I did not notice this, I have reverted the changes --Hurda

Concrete Examples of second table

edit

For the sake of newbies and clearity, I think that you should include a concrete example showing the search string and the text being searched for the string. The ANPANMAN example isn't clearly showing the formation of the second table. I am trying to clearly understand it, maybe I'll add the example.

First Table

edit

The first table ( ) gives the next offset at which a match could possibly occur given the last (nonmatching) character encountered while comparing the pattern with the text; the second ( ) gives a similar offset based on the number of characters that matched correctly. When a mismatch occurs, the largest of these two values is used to skip ahead to the next position where a match could potentially occur.

Does anyone know the reason that the somewhat obscure references of   and   are used here? Were those perhaps the way they were referenced in the original publications about the algorithm? I'm replacing this text with a (hopefully!) more straightforward explanation but I'm wondering why they were referenced that way. -- Antaeus Feldspar 05:04, 19 Dec 2004 (UTC)
Yes   and   are the names a lot of technical papers give to these functions. -- 14:03, 13 Feb 2007 (UTC)

who and when

edit
Does anyone know when the Boyer-Moore algorithm was first published, so we can state it? I added the full names of Boyer and Moore. RJFJR 03:16, Jan 12, 2005 (UTC)

Yes, 1977. See R. BOYER AND S. MOORE, A fast string matching algorithm, CACM, 20 (1977), pp. 762-772.

What about Gosper? Sedgewick, in "Algorithms" says Gosper independently invented it at about the same time. I've occasionally heard the algorithm called Boyer-Moore-Gosper. Anyone know more? 203.164.223.88 11:05, 17 December 2006 (UTC)Reply

No need for the second jump table.

edit

Worse-case performance is contradictory: Reference for 3n complexity proof

edit

I cannot find the paper that proves it online, however http://www.cs.nyu.edu/cs/faculty/cole/papers/CHPZ95.ps refers to it. Someone may wish to edit the page to provide that information.

The only online sources for the 1991 paper I could find required payment, but the CHPZ295.ps paper does have this information:
that in 1977 it was shown that the Boyer-Moore algorithm (BM) makes at most 6n comparisions, this was reduced to 4n in 1980 by Guibas and Odlyzko (ref GO80), and that in 1991 Cole et al had further proven the upper bound to be  . I don't understand the omega notation in this form. It goes on to cite paoers that show simple variants of BM have upper bounds of 2n. GO80 is A new proof of the linearity of the Boyer-Moore string searching algorithm, SIAM Journal of Computing 9 (1980) p672-682. And when I expanded the article's 3*N paragraph I noticed the preceeding paragraph talks about N*M which looks wrong, and contradicts the 3*N. -Wikibob 05:23, 18 December 2005 (UTC)Reply

See the following two lines in the article:
"The worst-case performance of the algorithm to find all matches is approximately N*M."
and
"The worst-case to find all occurrences in a text needs approximately 3*N comparisons, hence the complexity is O(n), regardless whether the text contains a match or not."
I believe that, depending on the period of the pattern x, each of the above statements is true. If period(x) > |x|/2, then the worst case is O(3N) -> O(N). Otherwise, the worst case is O(N*M). I haven't posted this to the article page yet because I'm still learning about the algorithm. --Tekhnofiend 21:46, 2 December 2006 (UTC)Reply

I beleive that the original paper has the good suffix rule correctly (strong form it is called in the wikipedia text), that is what is needed for linearity. Here is an excerpt from the paper:
"Roughly speaking, we can generalize the kind of reasoning used above and slide pat down by some amount so that the discovered occurence of subpat in string is aligned with the rightmost occurence of subpat in pat which is not preceded by the character preceding its terminal occurence in pat."
(See section 2 second column, third para on page 763 in Communications of the ACM Volume 20 Number 10, October 1977)
The above was in the introductory section for the algorithm. In section 4 (page 765, left column, first para) a definition using strict mathematical notation says:
"[pat(j+1) ... pat(patlen)] and [pat(k) ... pat(k + patlen - j - 1)] unify and either k <= 1 or pat(k - 1) not equal to pat(j)" (not equal with a sign in the original)
Appears completely matching the strong good suffix rule regarding the difference of the unmatched character. My suspicion is that someone made a mistake in an implementation and that started this rumor about the original paper having it wrong. I plan to make the correction, please comment here if you think I missed something.--Sustik (talk) 01:26, 3 February 2010 (UTC)Reply

It is my understanding that Boyer-Moore runs in   time in the worst case - consider a pattern and text which both consist of the same repeated character. The pattern will then shift by only one index between each phase, and all characters in the pattern will be explicitly checked for equality every phase. The   limit is explicitly stated in the textbook Algorithms on Strings, Trees, and Sequences by Dan Gusfield in chapter 2. In chapter 3 it is explained that the comparison operation is made more efficient using the Apostolico-Giancarlo algorithm, which provably makes   comparisons in the worst case. Pending an in-depth look at the Boyer-Moore implementation described here, I will edit the article to reflect this information; it is my understanding the linear-time limits were only proved when the pattern does not occur in the text.
Andrew Helwer (talk) 18:43, 2 April 2012 (UTC)Reply
In a 2010 book by Knuth "Selected Papers on Design of Algorithms" Chapter 9 it is not completely clear but I believe the text should be interpreted as: If there is no match, or if you only report the first occurrence of a match, then the algorithm runs in worst-case linear time. If you report all occurrences of a match then the worst case is quadratic. The distinction between first occurrence and all occurrences may be why there is some confusion. Aureooms (talk) 20:34, 12 May 2021 (UTC)Reply

Wikipedia self-reference

edit

I'm thinking that "WIKIPEDIA" should not be used as an example string, for the purpose of avoiding self-references. ~ Booyabazooka 20:06, 4 April 2006 (UTC)Reply

Would be nice to have both example tables use the same word. 12.150.184.6 16:58, 10 April 2006 (UTC)Reply

I agree that WIKIPEDIA should not be used as an example word, but that we should use the same word throughout. I'll fix it. Deco 06:03, 28 May 2006 (UTC)Reply

About the example code

edit

The example code I wrote on this page (the one with boyermoore_needlematch function) is quite slow.
I created it by literally translating the specs (the explanation how the algorithm works) into C code. I'm afraid it does not satisfy readers who are in search of a fast search algorithm example code. I would like it to be replaced with a code that is fast, but that is not unreadable.
The suffixes and preBmGs functions at [1] are very obfuscated in my opinion; they (and derivatives thereof) are not suitable for example code.
Is it possible to come up with a reasonably fast example that is not obfuscated? For comparison, the code at Boyer-Moore-Horspool algorithm is quite simple. It is this extra step, i.e. calculation of the "skip" array that is at question. --Bisqwit 08:59, 11 December 2006 (UTC)Reply


The python example code is wrong. For some inputs it doesn't have correct output and also its worst time complexity is O(nm) and not O(n+m). The galils rule is not implemented correctly. — Preceding unsigned comment added by 188.167.15.119 (talk) 21:20, 24 November 2016 (UTC)Reply

better way to explain

edit

I think the Boyer-Moore algorithm might be easier to understand if we:

  • explain how the first table works all by itself, Skip1 -- the Boyer-Moore-Horspool algorithm
  • explain how the second table works all by itself, Skip2 -- (or would it work all by itself? Is there a name for this algorithm? I suspect it might work well for searching bit-sequences or DNA base-pair sequences)

And finally summarize it by

  • the Boyer-Moore algorithm, whenever it hits a mismatch, skips ahead the *maximum* of Skip1 and Skip2 (since all shorter skips have been ruled out).

--68.0.120.35 14:49, 22 March 2007 (UTC)Reply

First Table

edit

Shouldn't the value of N be 3 in the first table in the article? If it's zero, then you get an infinite loop.

The second table

     ANPANMAN
     --------
           n  1
   aN         8
       mAN    3
   nMAN       6
  aNMAN       6
 pANMAN       6
nPANMAN       6

aNPANMAN 6 First 2 entries are slightly confusing. n 1 'n 1' means to match a character that is not 'n' then shift 1 to the left Second entry says 'aN 8' which means not the character a followed by 'N' needs 8 shifts to the left. But if you shift 8 to the left you don't match anything (the pattern is 8 characters long). This I understand - 'mAN 3' - not the character m followed by AN [pattern and partial pattern match]

ANPANMAN
     mAN<-fail    shift 0, partial pattern mAN matches and fails
    mAN <-fail    shift 1, 
   mAN  <-fail    shift 2
  mAN   <-success shift 3
nMAN 6
       <pattrn>

ANPANMAN----------------

           nMAN
          nMAN
         nMAN
        nMAN
       nMAN
      nMAN
     nMAN    <-success shift 6 from right, partial pattern nMAN only matches AN of pattern

==========Boyer-Moore algorithm in Robert Sedgewick 'Algorithms' is slightly different? ANPANMAN N=0 A=1 M=2 N=3 A=4 P=5 N=6 A=7 Ignore duplicates and use lowest gives N=0 A=1 M=2 P=5 (note: if you increment all these then you get the second table) All other letters =8, the length of the pattern It compares the last character of the pattern and the character of the text to search

 STING   <--pattern

CONSISTING<--text G compares with T T gives a shift value of 3 (T is 3 from right of pattern, assume last character position is 0) We move the pattern 3 to the right

    STING

CONSISTING Shift Table G=0 N=1 I=2 T=3 S=4 any other=5 Use the character in the text as an index to the skip table array. skip(0 to 255), 256 ascii characters? Initialise all to length of pattern Then go through pattern from right and set skip array index character to its position in the pattern. E.g. Asc("T")=84 'get ascii value of T skip(Asc("T"))=position in pattern string from right

    STING

CONSISTIGG----- G Compares with G N Compares with G Skip(Asc("G"))=0 Because we compare the second character in the pattern from the right we add 2, so we skip 2 to the right?

    STING

CONSIGTING---- S Compares with G Skip(Asc("G"))=0 + position of pattern compare character (in this case 5)? Sorry mangled text

Revised example programs

edit

I revised the example functions for the Boyer-Moore and the Boyer-Moore-Horspool search algorithms and added a Turbo-Boyer-Moore example. However, I believe this code requires some refitting before it is suitable for an educational example. These are written in C++, and are fairly optimal. I have tested them extensively to ensure they produce correct results. --Bisqwit 08:27, 21 August 2007 (UTC)Reply

Common functions

edit
#include <vector>

typedef std::vector<size_t> occtable_type;
typedef std::vector<size_t> skiptable_type;

/* This function compares the two strings starting at ptr1 and ptr2,
 * which are assumed to be strlen bytes long, and returns a size_t
 * indicating how many bytes were identical, counting from the _end_
 * of both strings.
 * It is an utility function used by the actual Boyer-Moore algorithms.
 */
const size_t backwards_match_len(
    const unsigned char* ptr1,
    const unsigned char* ptr2,
    size_t strlen,
    size_t maxlen,
    size_t minlen)
{
    size_t result = minlen;
    while(result < maxlen && ptr1[strlen-1-result] == ptr2[strlen-1-result])
        ++result;
    return result;
}

Building of the tables

edit
/* This function creates an occ table to be used by the search algorithms. */
/* It only needs to be created once per a needle to search. */
const occtable_type
    CreateOccTable(const unsigned char* needle, size_t needle_length)
{
    occtable_type occ(UCHAR_MAX+1, needle_length); // initialize a table of UCHAR_MAX+1 elements to value needle_length

    /* Populate it with the analysis of the needle */
    /* But ignoring the last letter */
    if(needle_length >= 1)
    {
        const size_t needle_length_minus_1 = needle_length-1;
        for(size_t a=0; a<needle_length_minus_1; ++a)
            occ[needle[a]] = needle_length_minus_1 - a;
    }
    return occ;
}

/* This function creates a skip table to be used by the search algorithms. */
/* It only needs to be created once per a needle to search. */
const skiptable_type
    CreateSkipTable(const unsigned char* needle, size_t needle_length)
{
    skiptable_type skip(needle_length, needle_length); // initialize a table of needle_length elements to value needle_length
    
    if(needle_length <= 1) return skip;
    
    /* I have absolutely no idea how this works. I just copypasted
     * it from http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
     * and have since edited it in trying to seek clarity and efficiency.
     * In particular, the skip[] table creation is now interleaved within
     * building of the suff[] table, and the assumption that skip[] is
     * preinitialized into needle_length is utilized.
     * -Bisqwit
     */
    
    const size_t needle_length_minus_1 = needle_length-1;
    
    std::vector<ssize_t> suff(needle_length);
    
    suff[needle_length_minus_1] = needle_length;
    
    ssize_t f = 0;
    ssize_t g = needle_length_minus_1;
    size_t j = 0; // index for writing into skip[]
    for(ssize_t i = needle_length-2; i >= 0; --i) // For each suffix length
    {
        if(i > g)
        {
            const ssize_t tmp = suff[i + needle_length_minus_1 - f];
            if (tmp < i - g)
            {
                suff[i] = tmp;
                goto i_done;
            }
        }
        else
            g = i;

        f = i;
        
        g -= backwards_match_len(
                needle,
                needle + needle_length_minus_1 - f,
                g+1, // length of string to test
                g+1, // maximum match length
                0 // assumed minimum match length
              );

        suff[i] = f - g;
    i_done:;
    
        if(suff[i] == i+1) // This "if" matches sometimes. Less so on random data.
        {
            // Probably, this only happens when the needle contains self-similarity.
            size_t jlimit = needle_length_minus_1 - i;
            while(j < jlimit)
                skip[j++] = jlimit;
        }
    }
    /* j may be smaller than needle_length here. It doesn't matter, because
     * if we ran the skip[] initialization loop until j < needle_length (when i=-1),
     * we would only set the value as needle_length, which it already is.
     */

    for (size_t i = 0; i < needle_length_minus_1; ++i)
        skip[needle_length_minus_1 - suff[i]] = needle_length_minus_1 - i;

    return skip;
}
edit
/* A Boyer-Moore-Horspool search algorithm. */
/* If it finds the needle, it returns an offset to haystack from which
 * the needle was found. Otherwise, it returns haystack_length.
 */
const size_t SearchInHorspool(const unsigned char* haystack, const size_t haystack_length,
    const occtable_type& occ,
    const unsigned char* needle,
    const size_t needle_length)
{
    if(needle_length > haystack_length) return haystack_length;
    if(needle_length == 1)
    {
        const unsigned char* result = (const unsigned char*)std::memchr(haystack, *needle, haystack_length);
        return result ? result-haystack : haystack_length;
    }
    
    const size_t needle_length_minus_1 = needle_length-1;
    
    const unsigned char last_needle_char = needle[needle_length_minus_1];
    
    size_t haystack_position=0;
    while(haystack_position <= haystack_length-needle_length)
    {
        const unsigned char occ_char = haystack[haystack_position + needle_length_minus_1];
        
        if(last_needle_char == occ_char
        && std::memcmp(needle, haystack+haystack_position, needle_length_minus_1) == 0)
        {
            return haystack_position;
        }

        haystack_position += occ[occ_char];
    }
    return haystack_length;
}
edit
/* A Boyer-Moore search algorithm. */
/* If it finds the needle, it returns an offset to haystack from which
 * the needle was found. Otherwise, it returns haystack_length.
 */
const size_t SearchIn(const unsigned char* haystack, const size_t haystack_length,
    const occtable_type& occ,
    const skiptable_type& skip,
    const unsigned char* needle,
    const size_t needle_length)
{
    if(needle_length > haystack_length) return haystack_length;

    if(needle_length == 1)
    {
        const unsigned char* result =
            (const unsigned char*)std::memchr(haystack, *needle, haystack_length);
        return result ? result-haystack : haystack_length;
    }
    
    const size_t needle_length_minus_1 = needle_length-1;
    
    size_t haystack_position=0;
    while(haystack_position <= haystack_length-needle_length)
    {
        const size_t match_len = backwards_match_len(
            needle,
            haystack+haystack_position,
            needle_length, // length of string to test
            needle_length, // maximum match length
            0 // assumed minimum match length
           );
        if(match_len == needle_length) return haystack_position;
        
        const size_t mismatch_position = needle_length_minus_1 - match_len;
        
        const unsigned char occ_char = haystack[haystack_position + mismatch_position];
        
        const ssize_t bcShift = occ[occ_char] - match_len;
        const ssize_t gcShift = skip[mismatch_position];
        
        size_t shift = std::max(gcShift, bcShift);
        
        haystack_position += shift;
    }
    return haystack_length;
}
edit
/* A Turbo Boyer-Moore search algorithm. */
/* If it finds the needle, it returns an offset to haystack from which
 * the needle was found. Otherwise, it returns haystack_length.
 */
const size_t SearchInTurbo(const unsigned char* haystack, const size_t haystack_length,
    const occtable_type& occ,
    const skiptable_type& skip,
    const unsigned char* needle,
    const size_t needle_length)
{
    if(needle_length > haystack_length) return haystack_length;

    if(needle_length == 1)
    {
        const unsigned char* result =
            (const unsigned char*)std::memchr(haystack, *needle, haystack_length);
        return result ? result-haystack : haystack_length;
    }
    
    const size_t needle_length_minus_1 = needle_length-1;
    
    size_t haystack_position = 0;
    size_t ignore_num = 0, shift = needle_length;
    while(haystack_position <= haystack_length-needle_length)
    {
        size_t match_len;
        if(ignore_num == 0)
        {
            match_len = backwards_match_len(
                needle,
                haystack+haystack_position,
                needle_length, // length of string to test
                needle_length, // maximum match length
                0 // assumed minimum match length
               );
            if(match_len == needle_length) return haystack_position;
        }
        else
        {
            match_len =
                backwards_match_len(needle, haystack+haystack_position,
                    needle_length, // length of string to test
                    shift, // maximum match length
                    0 // assumed minimum match length
                   );
            
            if(match_len == shift)
            {
                match_len =
                    backwards_match_len(needle, haystack+haystack_position,
                        needle_length, // length of string to test
                        needle_length, // maximum match length
                        shift + ignore_num // assumed minimum match length
                      );
            }
            if(match_len >= needle_length) return haystack_position;
        }
        
        const size_t mismatch_position = needle_length_minus_1 - match_len;
        
        const unsigned char occ_char = haystack[haystack_position + mismatch_position];
        
        const ssize_t bcShift = occ[occ_char] - match_len;
        const size_t gcShift  = skip[mismatch_position];
        const ssize_t turboShift = ignore_num - match_len;
        
        shift = std::max(std::max((ssize_t)gcShift, bcShift), turboShift);
        
        if(shift == gcShift)
            ignore_num = std::min( needle_length - shift, match_len);
        else
        {
            if(turboShift < bcShift && ignore_num >= shift)
                shift = ignore_num + 1;
            ignore_num = 0;
        }
        haystack_position += shift;
    }
    return haystack_length;
}

Multiple byte character sets?

edit

The example implementation in the article uses an auto-allocated array called occ to hold the first table. But how big will occ be for a string encoded in UTF-16 or UTF-32? Should the article clarify that the example is intended for use with 8-bit encodings? I'd imagine that occ would have to be an associative array in such cases because it will be filled so sparsely. --Damian Yerrick (talk | stalk) 21:55, 28 December 2007 (UTC)Reply

That is a valid point. In 16-bit or 32-bit implementations, I would probably use an associative array of 256-element arrays, i.e. something like this:
      std::map<unsigned, array<size_t,256> > occ;
      /* To index: */
      unsigned ch = 0x672C; /* U+672C: Unicode CJK symbol of book */
      occ[ch >> 8][ch & 0xFF] += 1;
For the reason that even in multilanguage content, characters tend to be clustered around some particular region of unicode. (For example, in Russian text; most of the symbols fall to the cyrillic range.) Then again, how effective is Boyer-Moore multibyte situations? --Bisqwit (talk) 09:18, 2 January 2008 (UTC)Reply

memset

edit

It appears that the behavior of memset from the C standard library isn't well defined across platforms except

  1. when filling memory with a value of 0,
  2. when filling arrays of signed char or unsigned char, or
  3. when your platform's ABI guarantees that a type has a specific representation.

Thegeneralguy (talk · contribs) edited the sample code to use memset to fill an array of ssize_t objects with -1 values. This is not portable, and Bisqwit (talk · contribs) reverted the edit. Two months later, Thegeneralguy made the same change again, and I reverted it again. I guess Thegeneralguy's rationale for the changes is option 3: memset with fill value of -1 works on most widespread 16-, 32-, and 64-bit platforms, all of which use a two's complement representation for ssize_t.

So here's my question: Do we require strictly conforming C, or do we relax C and assume two's complement? I propose we use ISO C, which works everywhere and is easier to understand, as opposed to what I view as premature optimization. --Damian Yerrick (talk | stalk) 23:23, 7 March 2008 (UTC)Reply

Now that you bring this up (I was unaware of this portability issue) I would be in favor of removing memset. However, memset is not premature optimization in the sense that something like this (filling up memory with a particular value) is the purpose of memset; memset(occ,-1,sizeof(occ)) should probably be used if it is decided to be used. If there is a wikipedia policy on this default to that.Thegeneralguy (talk) 23:46, 7 March 2008 (UTC)Reply

I think the point of example code is to document the algorithm, and in my opinion, a for loop documents the purpose (setting each element of the array to some integer value) better than memset (setting each byte on a memory area to some byte value) does. That said, has anyone had a look at the replacement example code posted on this talk page? --Bisqwit (talk) 13:54, 11 March 2008 (UTC)Reply

Correct code?

edit

Is this code actually correct? It looks pretty buggy (or, at best, relies on a lucky bug to be functional). It isn't referenced--isn't it WP:OR and/or WP:NOT#HOWTO, then? -- Mikeblas (talk) 15:38, 4 May 2008 (UTC)Reply

  • Can you please elaborate on what in it looks "pretty buggy" and what is the "lucky bug" it relies on? I wrote it based on the explanation in the article. A replacement suggestion is provided on this talk page, but nobody has commented on it. :-/ --Bisqwit (talk) 22:26, 4 May 2008 (UTC)Reply
    • I've removed the implementation since it's your own creation and therefore OR. The bug involves sign wrapping in the matching loop, and the exit condition for that loop. I'm sorry it took so long for me to come back to the article... -- Mikeblas (talk) 23:57, 11 January 2009 (UTC)Reply
      • I'm shocked at this definition of "original research". So for the purposes of teaching and demonstrating, one cannot use their own words to put into phrases (and in this case, code) to effectively explain it to Wikipedia readers, if that particular phrase (or code) does not exist anywhere in the web? Yet, simultaneously, one cannot verbatim-quote articles from other sites, because that would a copyright violation. There is a contradiction. I'm also not convinced of the existence of the "bug". Even after you wrote that, I checked the code and I cannot find evidence of what you claim. --Bisqwit (talk) 09:17, 13 January 2009 (UTC)Reply
        • I'm sorry that you're surprised. You can read about the policies at WP:OR and WP:V, and kind of at WP:REF. -- Mikeblas (talk) 21:53, 13 January 2009 (UTC)Reply
        • Here's how I see it: the code is there to illustrate the algorithm, and there are lots of user-made illustrations in Wikipedia, so including them is obviously generally accepted. But the claim that this code correctly implements the algorithm is original research - no reliable source supports this claim. While it's of course very useful to find a working implementation directly on the Wikipedia page, without having to follow the external links, I feel WP is not right the place to publish such things. I'd prefer having only an explanation and pseudocode (the latter can be cited from a textbook or a paper) in the article. A correct implementation would then be found by following external links. Offliner (talk) 02:14, 14 January 2009 (UTC)Reply
  • I just implemented the actual given code in java and it definitly returns the correct results. But I figured out that the given code is not implemented correctly. result[i] = size; should be result[i] = -1; (size_t is an unsigned int, so it should be at least result[i] = 0;)
  • Me Checked the code again and it must definitefly be result[i] -1, but now size_t must be changed to int. (Alternative: set to 0 and i+1, all other references must take care of that) —Preceding unsigned comment added by 155.56.68.215 (talk) 11:27, 28 May 2010 (UTC)Reply
    • Quite right: by using size instead of -1, a random pair of strings I tested locally required 50% more comparisons than the correct algorithm. I believe that the incorrect algorithm was O(m*n), not O(n). I've corrected the sample code, and converted size_t's to int's where necessary. Alex.mccarthy (talk) 22:49, 15 June 2010 (UTC)Reply
      • The article text speaks of setting the "other" characters to the length of the string, not -1. It would help understanding if the code reflected the article, or at least if there was an explanation why -1 works instead of the string length. kostmo (talk) 03:38, 9 September 2010 (UTC)Reply
static void prepare_badcharacter_heuristic(const char *str, size_t size,
		int result[ALPHABET_SIZE]) {

	size_t i;

	for (i = 0; i < ALPHABET_SIZE; i++)
		result[i] = -1; // used to be size (which was wrong). converted size_t to int to support signed values

	for (i = 0; i < size; i++)
		result[(size_t) str[i]] = i;
}
  • The problem with this code is that it doesn't produce the same result as listed in the article itself. Working through this manually reveals that the ANPANMAN example with this code as written would yield the following results:
Character Shift
A 6
M 5
N 7
P 2
all other characters -1

The question is, which gets corrected, the code or the table, or both? Shyran (talk) 15:19, 19 October 2010 (UTC)Reply

Agreed - the code produces a different table than the article describes, which is very confusing. I believe the code creates an equivalent table to the table (I've seen this done in other BM implementations) described in the article:

Character Article Code Sum
A 1 6 7
M 2 5 7
N 3 7 10????
P 5 2 7
all other characters 8 -1 7

You'll see the tables are equivalent in that the "value from one table" == ("search string length" - 1 - "value in the other table"). Apparently except for 'N', which may indicate a bug in the code.

Its very confusing for the code to not match the article. The code should be an exemplar of the algorithm as described in the article, not an optimized variation of the algorithm. The article either needs to explain the deviation in the code or the code should be modified to be match the algorithm as described in the article. Aff0 (talk) 00:35, 31 January 2011 (UTC)Reply

First table...?

edit

This looks wrong to me (am preparing for an exam, but it goes against what I've read elsewhere).

Could someone check it? 131.111.1.66 (talk) 16:55, 4 May 2009 (UTC)Reply

The paragraph just before the table sections introduce the tables: "one table calculates ...; the other makes ...." However, I believe the tables are described in detail in the subsequent sections in the opposite order then they were just introduced, i.e. the first table mentioned in the introducing paragraph above ("one table calculates ...") is actually the second table described in section entitled "The Second Table" and the second table mentioned in the introducing paragraph above ("the other makes ...") is the first table. If I'm correct about this, then this is especially confusing since the tables are identified as first and second rather than by more descriptive labels like "good suffix shift table" or "bad character shift table". — Preceding unsigned comment added by Aff0 (talkcontribs) 00:48, 31 January 2011 (UTC)Reply

Key or Needle

edit

The article should use the same term throughout, now it uses key in the introduction and later uses needle without introducing the term. (I like needle better) —Preceding unsigned comment added by 85.230.77.206 (talk) 09:30, 30 March 2010 (UTC)Reply

Alternate algorithms that preprocess the target.

edit

"unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly"

Can we link to an example of such an algorithm, for completeness? RJFJR (talk) 16:39, 17 August 2010 (UTC)Reply

Sure! A good example is the Suffix tree. I'll add it in.
Andrew Helwer (talk) 18:09, 2 April 2012 (UTC)Reply

O(n/m) != "sub-linear"

edit

O(n/m), where `n` is the length of the haystack and `m` is the length of the needle, is still linear in `n`. Rewording the initial paragraph so as not to be wrong. —Preceding unsigned comment added by Jemfinch (talkcontribs) 17:09, 23 August 2010 (UTC)Reply

Fast BM implementation (including closed source and commercial).

edit

I am doing benchmark of substring search algorithms (at volnitsky.com/project/str_search) and BM looks quite slow. Does anyone knows where to find faster BM implementation(s)? BTW, BM C source posted here is incorrect - returns wrong results if compiled with optimization ( gcc-4.5.1). — Preceding unsigned comment added by Leonid Volnitsky (talkcontribs) 09:36, 2 February 2011 (UTC)Reply

Python version

edit

Can be found at:

http://code.activestate.com/recipes/117223-boyer-moore-horspool-string-searching/

Python

edit
# bmh.py
#
# An implementation of Boyer-Moore-Horspool string searching.
#
# This code is Public Domain.
#
def BoyerMooreHorspool(pattern, text):
    m = len(pattern)
    n = len(text)
    if m > n: return -1
    skip = []
    for k in range(256): skip.append(m)
    for k in range(m - 1): skip[ord(pattern[k])] = m - k - 1
    skip = tuple(skip)
    k = m - 1
    while k < n:
        j = m - 1; i = k
        while j >= 0 and text[i] == pattern[j]:
            j -= 1; i -= 1
        if j == -1: return i + 1
        k += skip[ord(text[k])]
    return -1

if __name__ == '__main__':
    text = "this is the string to search in"
    pattern = "the"
    s = BoyerMooreHorspool(pattern, text)
    print 'Text:',text
    print 'Pattern:',pattern
    if s > -1:
        print 'Pattern \"' + pattern + '\" found at position',s
## end of http://code.activestate.com/recipes/117223/ }}}

(code is released under the PSF license (http://www.opensource.org/licenses/Python-2.0)

Ooops, sorry this is the Boyer Moore Horspool algorithm. Will move this to that page. — Preceding unsigned comment added by 64.251.74.12 (talk) 21:14, 21 July 2011 (UTC)Reply

Miscellaneous gripes

edit

The 'fast' implementation outlined in the original paper really should be mentioned on this page, as well as Hume and Sunday's improvements to delta1 in "Fast String Searching". The code from the October edit is mine, by the way. I'll be slowly merging the comments into the main body (I think that needs an overhaul, too). I just thought the page urgently needed something...erm, correct on it.

The previous algorithm was a) confusing and inefficient. Why was the search pattern being reversed? Was the original author trying to reuse code from a KMP implementation? There's a comment in the middle that indicates confusion over how (if?) the algorithm works. Does it? b) flat out wrong in at least one place. The delta1 offset is relative to the *offset* (in the old implementation, s+j or whatever), not the start of the potential match (s). The implementation in the original paper doesn't even bother keeping track of the start - it's not necessary. Consider the example search text .....LTT-THAT and search pattern TT-THAT

TT-THAT .....LTT-THAT

T will match, then A != L, and L does not occur in the pattern, so delta1[L] is 7 (length of TT-THAT). The algorithm should move the text 7 relative to the mismatched character (i.e. 6 relative to the start of the block), not the start of the block. You can see here that moving 7 relative to the start will move past a valid match.

Source 2 (movsd) seems...extremely seedy. I don't trust a website that can mix up the names for the bad character rule and good suffix rule. The site also doesn't adequately explain what you are supposed to do if your search pattern has repeated sequences where the repeated block is more than one character long (surely a very, very, very rare case). In fact, I'm tempted to write off the entire page as wholly unreliable, and carefully review any content on this page that seemed to be derived from it (so far, it seems like most of the wiki article). Good grief.

Wikipedia is an encyclopedia, not an algorithms cookbook. I request that people looking to edit the code make sure they don't break anything in the name of "efficiency". Also, if the edit is for efficiency, make sure it's actually, well, more efficient. At the very least, make sure your edits increase or at least maintain clarity.

Reverted 3 changes in the delta2 table construction from user Snowgene. is_prefix() does not always return true (obviously, not every suffix of the pattern is a prefix), so last_prefix_index may very well get "fixed" at a previous value p. That is, the p in last_prefix_index = p+1 is not necessarily the p where delta2[p] = last_prefix_index + (patlen-1 - p).


Jy2wong (talk) 12:18, 17 October 2011 (UTC)jy2wongReply

I rewrote the majority of the article

edit

What are your thoughts? I think it is a bit more clear now. I used ideas from Dan Gusfield's book Algorithms on Strings, Trees, and Sequences. I also added more citations, and put the Performance section on strong theoretical ground - previously there were some pretty big misconceptions about runtimes vis a vis pattern occurring in the text or not.

Any suggestions, criticism? This is my first large-scale edit! Andrew Helwer (talk) 03:56, 8 April 2012 (UTC)Reply

While the article is technically correct it doesn't explicitly state the key insight of the algorithm. The insight the algorithm offers is to look at the end of the pattern, and then skip quickly through the text using information obtained from the pattern in making jumps along the text. The first diagram in the article shows the brute force method, which serves to confuse. I've made some additions to the article. They are clumsy and additional editing is welcomed. Kd4ttc (talk) 17:14, 27 January 2014 (UTC)Reply

Algorithm tweak in Java section

edit

I added a minor tweak for correctness to the Java section. In theory, if you only consider the 'bad character table,' you have the Boyer-Moore-Horspool algorithm, however as it stands, the Java section is incomplete, and even enters infinite loops on some trivial examples. This stems from using the variable i both as the alignment index and as the cursor for checking characters. For example, if the haystack is "ab ab abc def" and the needle is "xy abc" the program gets caught infinitely looping, because it forgets that it has already checked the potential match ending at abc and it tries to match it again, succeeds for 4 characters, fails, then hops back to where it started. The k in the outer loop retains this information that is otherwise lost by decrementing the variable.

I am almost certain that the shift from the offset table will always be larger in such cases, but it seems to me that for correctness sake, that shouldn't be assumed unless it is explicitly stated in comments or in the article. For instance, I was trying to build up my understanding of the algorithm by programming it piecemeal and testing as I went along. I was quite puzzled when my incomplete BM algorithm (though complete BMH, as far as I knew) worked in some cases, but froze in others. For that reason, I decided to fix this code (At least in the Java section.)

Bill Lava (talk) 04:46, 20 August 2012 (UTC)Reply

C data types

edit

The choice of data types for the C implementation is rather odd, to say the least. There is no reason to be using exact-bit-width types here; the architecturally natural (fundamental) integer types should be used instead. 121a0012 (talk) 07:36, 18 January 2013 (UTC)Reply

The C and Java implementations were present before I came across this article, so I just left them there. They are untested (or don't work at all) for all I know. I personally wrote (and extensively tested, see here: https://github.com/ahelwer/IvoryTower) the Python implementation, so I'm available to answer any questions you might have about it. Andrew Helwer (talk) 06:20, 23 January 2013 (UTC)Reply
Actually from a quick look at both the C and Java implementations, the preprocessing is not done in steps linear in the length of the pattern, but rather quadratic. This is due to use of the suffix length function... it isn't canonical Boyer-Moore, which preprocesses the pattern in linear time. On other other hand if your pattern is sufficiently short you may not care too much, as making the preprocess run in linear time adds a ton of complexity (see the Python implementation). Preprocessing for the bad suffix rule in linear time is so complicated I left the description out of the article, as it could pretty much have an article to itself. Andrew Helwer (talk) 06:30, 23 January 2013 (UTC)Reply

Typo?

edit

A match or occurrence of P occurs at an alignment if P is equivalent to T[(k-n+1)..k].

Shouldn't that be equivalent to T[(k-m+1)..k]?

No, because the length of the pattern is n. Length of the text is m.

Bug in python code

edit

There is a bug inside the python code which I cannot figure out by myself. If I search for pattern "abcabba" in text "abaabcabbababbcababa" it won't find the only occurence at position 3 (starting with index 0). It just returns an empty list. Unfortunately I have no idea where to correct the code. SeriousPlaygrounder (talk) 19:31, 23 June 2014 (UTC)Reply

Thanks for the bug report, I'll look into this. Andrew Helwer (talk) 00:54, 2 August 2014 (UTC)Reply
Just confirmed & reproduced on my end. Will debug this weekend. Andrew Helwer (talk) 01:30, 21 February 2015 (UTC)Reply
The right code for good suffix table like that. Edited by tr0217@163.com.
def suffix(s):
    """
    suff[i] = k, the length of the longest suffix of s[:i](the sub string of s ending at i)
    that matches a suffix of s 
    """
    m=len(s)
    g = m-1
    suff = {}
    suff[g]=m
    
    for i in range(m-2,-1,-1):
        if(i>g and suff[m-1-f+i]<i-g):
            suff[i] = suff[m-1-f+i]
        else:
            if(i<g):
                g = i
            f = i
            while(g>=0 and s[g]==s[m-1-f+g]):
                g-=1
            suff[i] = f - g

    return suff


def PreBmGs(s):
    m = len(s)
    suff = suffix(s)
    bmGs = {}
    for i in range(0,m,1):
        bmGs[i] = m

    for i in range(m-1,-1,-1):
        if(i+1 == suff[i]):
            for j in range(0,m-1-i,1):
                if(m == bmGs[j]):
                    bmGs[j] = m-1-i;


    for i in range(0,m-1,1):
        bmGs[m - 1 - suff[i]] = m - 1 - i;

    return bmGs;
I believe that this bug has been fixed with the change I've made to fix an off-by-one error. I've submitted a pull request here: https://github.com/ahelwer/WikiCode, as all of the tests now pass, including the one demonstrating the bug described here. Joshthewall (talk) 13:43, 16 April 2020 (UTC)Reply

A bug has been identified and fixed in function "fundamental_preprocess" at 15:15, 11 May 2016 (UTC).--Libo.yin.1 (talk) 08:17, 12 May 2016 (UTC)Reply

The Good Suffix Rule

edit

In the last section (second to last sentence) of the 'Description' subheading it is stated that:

"If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T."

The word 'proper' is italicised; so I assume a 'proper prefix' is different in some way to your standard, run of the mill prefix. The article does not specify what this difference might be or if indeed there is a difference at all. Could this be clarified?

JPAIvan (talk) 23:22, 19 July 2014 (UTC)Reply

The proper prefix of a string is a prefix which is not equal to the string itself, similar to the definition of a proper subset. This should perhaps be defined in-article. Andrew Helwer (talk) 00:56, 2 August 2014 (UTC)Reply

Does this algorithm work for alphabet size 2?

edit

See Alpha Skip Search, Combinatorial Pattern Matching 1998, for an algorithm that also works on base two, text length N, pattern length M, with average runtime of O( (N/M + M)logM ), with log base alphabet size.

Preprocess the pattern by creating a lookup table of M-logM+1 keys of positions in the pattern, where each key is a sequence of logM letters in the pattern. time O(MlogM).

Main: Skip through the text by length M-logM+1, trying to match key of size logM to the lookup table. If it is in the lookup table, try matching the pattern naively, at the indicated start position.

Often the key will not even be in the lookup table, in which case the algorithm just skips over M-logM+1 letters of the text and tries again. The expected running time is O( (N/M)logM ), even for alphabets of size 2. Daniel Pehoushek 24.33.70.144 (talk) 19:10, 24 August 2014 (UTC)Reply

Good Suffix Rule Description

edit

The description says "... It is the reason comparisons begin at the end of the pattern rather than the start ..."

IMHO this is misleading. The comparison also begins at the end rather than the start to enable the 'Bad Character Rule' to work. So, IMHO, it is misleading to say "It is the reason comparisons ..." in the context of "Good Suffix"; the entire algorithm would be much less efficient, and arguably not recognisable as Boyer-Moore without "comparisons begin at the end".

As further evidence, the Boyer–Moore–Horspool algorithm _only_ uses the 'Bad Character" rule and its table.

It might be reasonable to write something like: "The good suffix rule is markedly more complex in both concept and implementation than the bad character rule. Like the 'Bad Character Rule', it also exploits the algorithm's feature of comparisons beginning at the end of the pattern and proceeding towards the pattern's start ..."

Gbulmeruk (talk) 11:29, 17 November 2015 (UTC)Reply

Misleading description of bad character rule

edit

The article describes what is called "extended bad character rule", not the basic "bad character rule" (see the original paper, and the book https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm#cite_note-ASTS-3 cited in the article). Perhaps this should be clarified.

BartoszKP (talk) 11:29, 4 December 2016 (UTC)Reply

I agree, the Preprocessing section describes an algorithm that differs from the ones provided in the implementation examples, and one that will give results invalidated by the good suffix rule. This ties in to the need to explain Boyer-Moore-Horspool; first provide the extended bad character rule and make it stand on its own, then explain how switches on this version can lead to alignments that would have been ignored should the good suffix rule be applied and provide the basic bad character rule for the Boyer-Moore version. — Preceding unsigned comment added by 109.26.232.78 (talk) 09:39, 6 June 2019 (UTC)Reply

The Galil rule

edit

Thus if the comparisons get down to position k1 of T, an occurrence of P can be recorded without explicitly comparing past k1.

Why? And what does it mean "past k1"? Aren't we scanning from the end? This whole section is mighty confusing... Can anyone explain it clearly without leaving room for imagination? --84.245.121.159 (talk) 23:56, 18 November 2020 (UTC)Reply

Python Code is incorrect

edit

The Python Code for this algo is incorrect.

```>>> from Wiki import string_search

>>> from BMAlgo import BMA           

>>> with open("test2.txt", 'r', errors = 'ignore') as f:

...     strs = f.read()

...     A = BMA(strs, "caba")

...     B = string_search('caba', strs)

...     print(A)

...     print(B)

...     print(A == B)

...               

[12168657]

[12168657, 12781460]

False

>>> strs[12168657: 12168657 + 4]

'caba'

>>> strs[12781460: 12781460 + 4]

'aaba'

>>>``` Bamgm14 (talk) 03:47, 5 April 2024 (UTC)Reply