
Show HN: A search widget for static web sites - z3t4
http://www.webtigerteam.com/websearch/
======
nprescott
Using the RSS feed for the content seems clever, and useful as a means to
build the static search widget outside the content generation phase of site
building (or for adding after the fact).

I recently had a similar (less well fleshed-out) idea[0], but built the search
corpus during the generation of my static site (this was made easy by the fact
that I wrote the generator too). The results aren't nearly so impressive, but
it was an interesting experience that lead to all sorts of interesting ways to
improve searching, like stemming, lemmatization[1] and (removing) stop words
(this was mostly to reduce the size of the text corpus).

[0]: [https://idle.nprescott.com/2017/text-search-on-a-static-
blog...](https://idle.nprescott.com/2017/text-search-on-a-static-blog.html)

[1]: [https://nlp.stanford.edu/IR-
book/html/htmledition/stemming-a...](https://nlp.stanford.edu/IR-
book/html/htmledition/stemming-and-lemmatization-1.html)

~~~
gjstein
Your post was a great read, and I might try doing something similar on my
blog. A suggestion though: I typed in a query with an uppercase letter and it
failed to find any matches. Perhaps .lower() your query if it's something
you'd like to maintain? Thanks!

------
jstanley
Be careful loading JS from sites you don't control. If the host goes down,
your search stops working. If the host becomes malicious or compromised, your
site becomes malicious or compromised.

For best results, host the JS yourself.

~~~
tedmiston
The latter is a solved problem with SRI which most CDNs support.

[https://developer.mozilla.org/en-
US/docs/Web/Security/Subres...](https://developer.mozilla.org/en-
US/docs/Web/Security/Subresource_Integrity)

~~~
Franciscouzo
SRI is not a matter of servers/CDNs to support, it's run completely on the
browser. I'd argue it's not a bulletproof solution, since only ~60% of
browsers support it ([https://caniuse.com/#feat=subresource-
integrity](https://caniuse.com/#feat=subresource-integrity)).

~~~
tedmiston
Thanks for sharing. It looks like the active major browsers that don't support
it in their latest version are Edge and Mobile Safari. I wonder why they
haven't done so yet.

------
allover
Nice, simple solution!

Feedback: the JS to include should be minified, and the instructions should
say to either place the script tag at the bottom of the 'body' tag (not in the
head), _or_ to add the 'async' attribute to the script tag ( _or_ do what
Google Analytics does and include a bit of inline JS to async-load the
script).

(Otherwise rendering of the page is blocked by a 3rd party script, which is a
pretty bad situation).

~~~
onion2k
On that note, the script _really_ needs a subresource integrity hash[1]. As
it's included at the moment the author could modify the code at webtigerteam
to do _anything_. By putting an integrity hash in the script tag people could
use this without worrying about future changes doing bad things.

[1] [https://developer.mozilla.org/en-
US/docs/Web/Security/Subres...](https://developer.mozilla.org/en-
US/docs/Web/Security/Subresource_Integrity)

~~~
MichaelGG
... by exchanging it for worrying about the feature breaking every time a
minor fix is pushed. If you're going to go through the hassle of manually
pinning versions, then just download and host the code alongside your site.

------
shanecleveland
A clever solution that would work well in the right scenarios. Nice work.

I was left looking for a replacement when Searchpath.io closed shop. It was
the simplest static site search I was aware of. Just add a single JS script to
your site ($75/yr for full version).

Google's search widget is next best thing. The results are obviously good, but
now you can't opt out of ads, so you may be serving up ads to competitors.

There are some nice search tools out there, but they are overly complex and
too expensive for small, static sites.

I played around with lunr.js, but ended up settling on Tipue Search
([http://www.tipue.com/search/](http://www.tipue.com/search/)) Once the index
is created, it's pretty good.

------
throwaway2016a
This is a really neat solution. I like it.

One solution I've been working with is to create an index at build time for
static sites and push to SaaS ElasticSearch and use AWS Lambda to make a query
interface to it. It works well, albeit not free to run. This solution beats it
in cost and probably perceived performance too.

------
AndrewStephens
This is a clever idea but it really doesn't scale to larger sites since relies
on having access to the full text of every page on the client.

Using the rss feed to find pages is neat, but you might be better off just
modifying your static site generator to produce a nicely formatted single file
that your Javascript can read directly.

I don't have local search (it is hard to beat google) but I did implement a
client-side tag cloud [0] for my statically generated blog. The embedded
javascript is generated from the site's content each time I build the site.

[0] [https://sheep.horse/tagcloud.html](https://sheep.horse/tagcloud.html)
(Statically generated client side tag cloud)

~~~
z3t4
I was _concerned about downloading several small text-files_ too, but if you
visit for example nytimes.com there's 350 requests on first load. And Facebook
has auto-plying videos ... Downloading a bunch of static web pages will feel
instant.

This search script can be reused between different static site generators,
compared to generating index data for each site.

------
LamaOfRuin
Seems like lunr ([https://lunrjs.com/](https://lunrjs.com/)) or elasticlunr
([http://elasticlunr.com/](http://elasticlunr.com/)) have the better approach,
allowing only the index to be distributed, and constructing the index to be
done once/updated when new pages are published.

------
tyingq
That's well done. Curious why you fetch the rss at page load versus doing it
when a search is performed though.

~~~
yodon
Pulling the list on page load makes sure the code responds fast when the user
starts a search

~~~
swiley
The user should be able to consent (or not) to expensive operations like that.

~~~
marcuslager
If they have js enabled they already did.

~~~
tyingq
True, but there is some validity to the point. On a large site you might burn
a lot of your mobile data allotment with one search. That's not intuitive or
obvious. Might be nice to implement limits and a more button in the code so it
doesn't scrape the whole site at once.

------
jimktrains2
I'm wrote my own site generator a while back and have always been considering
adding in a simple js search, but the client would download the index, which
is generated at compile time.

Even for a small blog the index might become decent size, though. I thought
maybe a heavily pruned TF-IDF index could be workable though?

~~~
marcuslager
>the client would download the index

That doesn't sound like too bad of an idea, given the state of js these days.

Yes, you can cut off irrelevant terms, those with low tf-idf score (the, a,
is, are...) or you could filter out stopwords. Or both. Me personally, I would
never remove low-scoring terms. Consider the state of the word "is". It's
mostly irrelevant, unless you are searching for information about a certain
terrorist group.

One small index type (as in number of bytes) is a trie. Another is a huffman
tree. They store each term less than once.

~~~
jimktrains2
The disadvantage(?) of a trie is that you can't rank by relevance.

~~~
marcuslager
There is no such disadvantage. Scoring usually comes after the index (tree)
lookup. Once you have terms you can weight them. The trie is just a lookup
mechanism.

------
steffoz
DatoCMS ([https://www.datocms.com](https://www.datocms.com)) just released
search for static websites in alpha. When the static site gets published, it
automatically spiders it and you can integrate a JS widget on your site.

------
CM30
Well, this seems pretty useful. Certainly worked well when I tested it on the
Web Tiger Team site.

However, I do wonder whether the RSS feed aspect might be limiting here. I
mean, how often do people manually generate an RSS feed on a static website?

I'd say the answer is 'pretty much never'. They're annoying to create/edit by
hand or with a static site builder, and hence they're usually quite rare on
the types of websites this tool was made for.

So while I think the widget works well when used properly, I feel like the
vast majority of static sites won't be able to use it simply because they
don't have an RSS feed.

~~~
lucideer
It should be very easy to automate RSS feed generation with any consistently
laid out static site (even if it's hand written).

If you're using a SSG (I presume most are), there's things like jekyll-feed[0]

[0] [https://github.com/jekyll/jekyll-feed](https://github.com/jekyll/jekyll-
feed)

------
luxpir
That's really clever and quite possibly the fastest search I've ever used.

Could see this being extremely helpful - thanks (och jättestora tackar) for
sharing!

~~~
onion2k
_That 's really clever and quite possibly the fastest search I've ever used._

The way it works is by fetching an RSS list of links, downloading each linked
page, and then matching a search string against the non-HTML content. If
you're searching for something after the pages have all been downloaded it'll
be very quick, but if the links in the RSS feed take a long time to load, or
you do a search immediately the page has loaded, it'll feel really slow.

Also if someone used it on a mobile site it'd require the user to download all
of the content for the website in order to search. They might not be too happy
about that.

------
ams6110
Doesn't adding "site:example.com" to google search do the same thing (assuming
google has indexed your static site)?

------
Walkman
Did not find whole words on the same page. Not there yet.

------
devendramistri
Its not working, on your page, searched "Ycombinator people" but no results,
though it is present in links page.

~~~
tyingq
Seems that links page is not in the rss feed. It only searches pages
referenced in the rss feed.
[https://www.webtigerteam.com/johan/rss_en.xml](https://www.webtigerteam.com/johan/rss_en.xml)

Which is fairly common with a blog...stories in rss, pages are not. Might be
good to point out that caveat so that users can choose to add those pages to
the rss feed.

------
foxhop
thanks for this!

------
yodon
GPL'd code so not usable by most commercial projects

~~~
tyingq
Given that it's relatively standalone, I don't see the issue. It wouldn't
attach the GPL to your other code...there's no linking, etc. You would have to
contribute any changes to this specific code back, but there's no secret sauce
there.

The bigger limiter, mentioned by other comments, is that it downloads every
page in the rss feed to do the search. So it's workable for a smallish niche
type site with a low page count, but not workable beyond that really.

~~~
nerdponx
I feel like this is the browser equivalent of dynamic linking.

That said, there's no way (yet) to DRM or otherwise prevent the user from
obtaining the entire front end source, so it might be moot.

Side question: what am I allowed and not allowed to do with webpage sources
under US copyright law? If I cache the site and open a local copy, have I
illegally reproduced a copyrighted work? What if I burned it to a CD? What if
I e-mail a copy to a friend? What if I host my own version?

~~~
jsjohnst
IANAL, but...

> cached locally

That falls under US fair use laws I believe

> burned a CD

Again, fair use I believe

> emailed a copy / hosted own version

Likely is infringing, but not definitively. Distribution is generally not fair
use, but there are some exceptions.

~~~
nerdponx
I was under the impression that ripping an MP3 from a CD constituted an
illegal reproduction, so it's a little surprising to me that the law doesn't
apply in the other direction.

~~~
jsjohnst
Depends on where you live. For example, in the UK I heard it's illegal again
(has gone back and forth over the years).

