I recently had a similar (less well fleshed-out) idea, but built the search corpus during the generation of my static site (this was made easy by the fact that I wrote the generator too). The results aren't nearly so impressive, but it was an interesting experience that lead to all sorts of interesting ways to improve searching, like stemming, lemmatization and (removing) stop words (this was mostly to reduce the size of the text corpus).
For best results, host the JS yourself.
I'll either serve each library separately, bundled, or bundled with my own client code. at least until es6 modules are ubiquitous.
In enterprise situations, it's a little different. I work for a company that pays Akamai a significant sum to make sure all of that's sorted out for anything large. Custom client code usually is served locally -- depending on the project.
Feedback: the JS to include should be minified, and the instructions should say to either place the script tag at the bottom of the 'body' tag (not in the head), or to add the 'async' attribute to the script tag (or do what Google Analytics does and include a bit of inline JS to async-load the script).
(Otherwise rendering of the page is blocked by a 3rd party script, which is a pretty bad situation).
Why? I mean sure you save a little traffic, but it could just as well be gzipped.
Gzipped Original: 7633
Gzipped Minified: 4109
Minification is worth it. Make the source code available on github.
I was left looking for a replacement when Searchpath.io closed shop. It was the simplest static site search I was aware of. Just add a single JS script to your site ($75/yr for full version).
Google's search widget is next best thing. The results are obviously good, but now you can't opt out of ads, so you may be serving up ads to competitors.
There are some nice search tools out there, but they are overly complex and too expensive for small, static sites.
I played around with lunr.js, but ended up settling on Tipue Search (http://www.tipue.com/search/) Once the index is created, it's pretty good.
One solution I've been working with is to create an index at build time for static sites and push to SaaS ElasticSearch and use AWS Lambda to make a query interface to it. It works well, albeit not free to run. This solution beats it in cost and probably perceived performance too.
 https://sheep.horse/tagcloud.html (Statically generated client side tag cloud)
This search script can be reused between different static site generators, compared to generating index data for each site.
Even for a small blog the index might become decent size, though. I thought maybe a heavily pruned TF-IDF index could be workable though?
That doesn't sound like too bad of an idea, given the state of js these days.
Yes, you can cut off irrelevant terms, those with low tf-idf score (the, a, is, are...) or you could filter out stopwords. Or both. Me personally, I would never remove low-scoring terms. Consider the state of the word "is". It's mostly irrelevant, unless you are searching for information about a certain terrorist group.
One small index type (as in number of bytes) is a trie. Another is a huffman tree. They store each term less than once.
However, I do wonder whether the RSS feed aspect might be limiting here. I mean, how often do people manually generate an RSS feed on a static website?
I'd say the answer is 'pretty much never'. They're annoying to create/edit by hand or with a static site builder, and hence they're usually quite rare on the types of websites this tool was made for.
So while I think the widget works well when used properly, I feel like the vast majority of static sites won't be able to use it simply because they don't have an RSS feed.
If you're using a SSG (I presume most are), there's things like jekyll-feed
Of course, once a site reaches a certain size, client side solutions become impractical.
 https://sheep.horse/rss.xml for example
Could see this being extremely helpful - thanks (och jättestora tackar) for sharing!
The way it works is by fetching an RSS list of links, downloading each linked page, and then matching a search string against the non-HTML content. If you're searching for something after the pages have all been downloaded it'll be very quick, but if the links in the RSS feed take a long time to load, or you do a search immediately the page has loaded, it'll feel really slow.
Also if someone used it on a mobile site it'd require the user to download all of the content for the website in order to search. They might not be too happy about that.
Which is fairly common with a blog...stories in rss, pages are not. Might be good to point out that caveat so that users can choose to add those pages to the rss feed.
The bigger limiter, mentioned by other comments, is that it downloads every page in the rss feed to do the search. So it's workable for a smallish niche type site with a low page count, but not workable beyond that really.
That said, there's no way (yet) to DRM or otherwise prevent the user from obtaining the entire front end source, so it might be moot.
Side question: what am I allowed and not allowed to do with webpage sources under US copyright law? If I cache the site and open a local copy, have I illegally reproduced a copyrighted work? What if I burned it to a CD? What if I e-mail a copy to a friend? What if I host my own version?
> cached locally
That falls under US fair use laws I believe
> burned a CD
Again, fair use I believe
> emailed a copy / hosted own version
Likely is infringing, but not definitively. Distribution is generally not fair use, but there are some exceptions.
You can't say anything definitively falls under fair use. Fair use is a case by case determination by a court limited to each specific instance (so if you do the same exact thing a second time it could be found to no longer be fair use).