Coincidentally, I was investigating a slightly more general version of that ques...

Coincidentally, I was investigating a slightly more general version of that question last week, and I found this paper getting a 3× speedup, and more on parallel machines: http://research.microsoft.com/pubs/208237/asplos302-mytkowic...

The summary is that once you've computed the KMP state transition function, you compute its prefix sum under composition and then ask at which textual indices it assumes a final state. This abstract description of the KMP algorithm is sufficiently general to cover realizations with different time and space complexity, including not just the purely serial uniprocessor O(N)-time constant-space realization, but also potentially the 3× efficient SIMD realization described in the paper, efficient O(log N)-time parallel realizations and GPGPU realizations; and it is sufficiently specific to automatically derive an efficient realization.

An even more interesting question, to me, is what kind of formalism would make it easy to rigorously (and perhaps even automatically) derive KMP and/or BM from the abstract definition of the string search problem and an objective of reducing computation time.