Hacker News new | comments | ask | show | jobs | submit login
Show HN: A search widget for static web sites (webtigerteam.com)
88 points by z3t4 on July 7, 2017 | hide | past | web | favorite | 59 comments

Using the RSS feed for the content seems clever, and useful as a means to build the static search widget outside the content generation phase of site building (or for adding after the fact).

I recently had a similar (less well fleshed-out) idea[0], but built the search corpus during the generation of my static site (this was made easy by the fact that I wrote the generator too). The results aren't nearly so impressive, but it was an interesting experience that lead to all sorts of interesting ways to improve searching, like stemming, lemmatization[1] and (removing) stop words (this was mostly to reduce the size of the text corpus).

[0]: https://idle.nprescott.com/2017/text-search-on-a-static-blog...

[1]: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-a...

Your post was a great read, and I might try doing something similar on my blog. A suggestion though: I typed in a query with an uppercase letter and it failed to find any matches. Perhaps .lower() your query if it's something you'd like to maintain? Thanks!

That is a really nice way of doing it. If you are going to the trouble of statically generating your site, you might has well statically generate a well-structured index at the same time.

I'm doing this as well, the python whoosh library is doing a reasonable job there.

Be careful loading JS from sites you don't control. If the host goes down, your search stops working. If the host becomes malicious or compromised, your site becomes malicious or compromised.

For best results, host the JS yourself.

it always baffles me how nonchalantly devs will inject third-party js without researching smaller/faster/more secure self-hosted solutions. even social sharing widgets are easily replaced [1]. meanwhile, everyone loses their shit over the obvious implications [2]. this applies to using CDNs for everything, too.

[1] https://github.com/heiseonline/shariff

[2] https://www.theguardian.com/technology/2017/jul/03/facebook-...

The latter is a solved problem with SRI which most CDNs support.


SRI is not a matter of servers/CDNs to support, it's run completely on the browser. I'd argue it's not a bulletproof solution, since only ~60% of browsers support it (https://caniuse.com/#feat=subresource-integrity).

Thanks for sharing. It looks like the active major browsers that don't support it in their latest version are Edge and Mobile Safari. I wonder why they haven't done so yet.

Agreed. It might not always be the most performant option, but it's reliable. I only use CDNs in dev environments, for quick tests, or for non-public facing minor temporary projects. If they go permanent I switch to serving the libraries locally. More often than not when I am dealing with a CDN in these situations, it's from CDNjs alone -- if I can't get it from them, I take the extra few minutes to serve it.

I'll either serve each library separately, bundled, or bundled with my own client code. at least until es6 modules are ubiquitous.

In enterprise situations, it's a little different. I work for a company that pays Akamai a significant sum to make sure all of that's sorted out for anything large. Custom client code usually is served locally -- depending on the project.

Nice, simple solution!

Feedback: the JS to include should be minified, and the instructions should say to either place the script tag at the bottom of the 'body' tag (not in the head), or to add the 'async' attribute to the script tag (or do what Google Analytics does and include a bit of inline JS to async-load the script).

(Otherwise rendering of the page is blocked by a 3rd party script, which is a pretty bad situation).

On that note, the script really needs a subresource integrity hash[1]. As it's included at the moment the author could modify the code at webtigerteam to do anything. By putting an integrity hash in the script tag people could use this without worrying about future changes doing bad things.

[1] https://developer.mozilla.org/en-US/docs/Web/Security/Subres...

... by exchanging it for worrying about the feature breaking every time a minor fix is pushed. If you're going to go through the hassle of manually pinning versions, then just download and host the code alongside your site.

>the JS to include should be minified

Why? I mean sure you save a little traffic, but it could just as well be gzipped.

Minifying does more.

Original: 26282 Minified: 10562 Gzipped Original: 7633 Gzipped Minified: 4109

Not only does minifying save bandwidth it will also make the script parse and run faster! We are talking about 30% speed increase! It's huge! We'll save ONE MILLISECOND! Seriously though, minifying does only make sense on huge scale, I would need to serve that script 15,000 times per second to saturate a GbE link. And for those poor users who pay outrageous costs for downstream by the MB, they would save around 0.003 cent if it was minified. Lets instead consider the costs for minifying the script, like being able to more easily debug it in production. And maybe someone wants to actually view the source code, figure out how it works, and tinker with it, now that's worth something!

You're forgetting about mobile phones. Seriously, bytes matter because mobile networks suck, especially in emerging markets.

Minification is worth it. Make the source code available on github.

Hi, I am currently paying 15$ for 75MB of data. Thank you very much to everyone who obsessively optimise their web code, as it is worth it for weirdos like me.

A clever solution that would work well in the right scenarios. Nice work.

I was left looking for a replacement when Searchpath.io closed shop. It was the simplest static site search I was aware of. Just add a single JS script to your site ($75/yr for full version).

Google's search widget is next best thing. The results are obviously good, but now you can't opt out of ads, so you may be serving up ads to competitors.

There are some nice search tools out there, but they are overly complex and too expensive for small, static sites.

I played around with lunr.js, but ended up settling on Tipue Search (http://www.tipue.com/search/) Once the index is created, it's pretty good.

This is a really neat solution. I like it.

One solution I've been working with is to create an index at build time for static sites and push to SaaS ElasticSearch and use AWS Lambda to make a query interface to it. It works well, albeit not free to run. This solution beats it in cost and probably perceived performance too.

This is a clever idea but it really doesn't scale to larger sites since relies on having access to the full text of every page on the client.

Using the rss feed to find pages is neat, but you might be better off just modifying your static site generator to produce a nicely formatted single file that your Javascript can read directly.

I don't have local search (it is hard to beat google) but I did implement a client-side tag cloud [0] for my statically generated blog. The embedded javascript is generated from the site's content each time I build the site.

[0] https://sheep.horse/tagcloud.html (Statically generated client side tag cloud)

I was concerned about downloading several small text-files too, but if you visit for example nytimes.com there's 350 requests on first load. And Facebook has auto-plying videos ... Downloading a bunch of static web pages will feel instant.

This search script can be reused between different static site generators, compared to generating index data for each site.

Seems like lunr (https://lunrjs.com/) or elasticlunr (http://elasticlunr.com/) have the better approach, allowing only the index to be distributed, and constructing the index to be done once/updated when new pages are published.

That's well done. Curious why you fetch the rss at page load versus doing it when a search is performed though.

The search text input is disabled until the RSS is fetched. So it's because of UI design. If there's an error, the search box will stay disabled. Users are more OK with something being disabled, then something appearing to work, but doesn't. It also makes the search slightly faster.

Pulling the list on page load makes sure the code responds fast when the user starts a search

You could defer it until they start typing in the input box. Still better than waiting until they click search, and doesn't waste bandwidth for the average case.

I guess, but it then needs to load every page referenced in the RSS so you're preloading only one file of a much larger count. And you're adding the weight of the RSS file for every visitor, even if they never do a search.

The user should be able to consent (or not) to expensive operations like that.

If they have js enabled they already did.

True, but there is some validity to the point. On a large site you might burn a lot of your mobile data allotment with one search. That's not intuitive or obvious. Might be nice to implement limits and a more button in the code so it doesn't scrape the whole site at once.

Yes, I want this answer.

I'm wrote my own site generator a while back and have always been considering adding in a simple js search, but the client would download the index, which is generated at compile time.

Even for a small blog the index might become decent size, though. I thought maybe a heavily pruned TF-IDF index could be workable though?

>the client would download the index

That doesn't sound like too bad of an idea, given the state of js these days.

Yes, you can cut off irrelevant terms, those with low tf-idf score (the, a, is, are...) or you could filter out stopwords. Or both. Me personally, I would never remove low-scoring terms. Consider the state of the word "is". It's mostly irrelevant, unless you are searching for information about a certain terrorist group.

One small index type (as in number of bytes) is a trie. Another is a huffman tree. They store each term less than once.

The disadvantage(?) of a trie is that you can't rank by relevance.

There is no such disadvantage. Scoring usually comes after the index (tree) lookup. Once you have terms you can weight them. The trie is just a lookup mechanism.

DatoCMS (https://www.datocms.com) just released search for static websites in alpha. When the static site gets published, it automatically spiders it and you can integrate a JS widget on your site.

Well, this seems pretty useful. Certainly worked well when I tested it on the Web Tiger Team site.

However, I do wonder whether the RSS feed aspect might be limiting here. I mean, how often do people manually generate an RSS feed on a static website?

I'd say the answer is 'pretty much never'. They're annoying to create/edit by hand or with a static site builder, and hence they're usually quite rare on the types of websites this tool was made for.

So while I think the widget works well when used properly, I feel like the vast majority of static sites won't be able to use it simply because they don't have an RSS feed.

It should be very easy to automate RSS feed generation with any consistently laid out static site (even if it's hand written).

If you're using a SSG (I presume most are), there's things like jekyll-feed[0]

[0] https://github.com/jekyll/jekyll-feed

Generating the annoying-to-edit support files is the whole point of a static site generator, and many do generate RSS, Atom, and site maps [0].

Repurposing RSS in this way is clever, but a better solution would be to modify the static site generator to generate a single file with the entire sites content that the javascript could stream and search in one hit.

Of course, once a site reaches a certain size, client side solutions become impractical.

[0] https://sheep.horse/rss.xml for example

depends if you are using a static site builder or not - for example hugo will easily automatically create RSS of your content with very little config.

That's really clever and quite possibly the fastest search I've ever used.

Could see this being extremely helpful - thanks (och j├Ąttestora tackar) for sharing!

That's really clever and quite possibly the fastest search I've ever used.

The way it works is by fetching an RSS list of links, downloading each linked page, and then matching a search string against the non-HTML content. If you're searching for something after the pages have all been downloaded it'll be very quick, but if the links in the RSS feed take a long time to load, or you do a search immediately the page has loaded, it'll feel really slow.

Also if someone used it on a mobile site it'd require the user to download all of the content for the website in order to search. They might not be too happy about that.

Doesn't adding "site:example.com" to google search do the same thing (assuming google has indexed your static site)?

Did not find whole words on the same page. Not there yet.

Its not working, on your page, searched "Ycombinator people" but no results, though it is present in links page.

Seems that links page is not in the rss feed. It only searches pages referenced in the rss feed. https://www.webtigerteam.com/johan/rss_en.xml

Which is fairly common with a blog...stories in rss, pages are not. Might be good to point out that caveat so that users can choose to add those pages to the rss feed.

thanks for this!

GPL'd code so not usable by most commercial projects

Given that it's relatively standalone, I don't see the issue. It wouldn't attach the GPL to your other code...there's no linking, etc. You would have to contribute any changes to this specific code back, but there's no secret sauce there.

The bigger limiter, mentioned by other comments, is that it downloads every page in the rss feed to do the search. So it's workable for a smallish niche type site with a low page count, but not workable beyond that really.

I feel like this is the browser equivalent of dynamic linking.

That said, there's no way (yet) to DRM or otherwise prevent the user from obtaining the entire front end source, so it might be moot.

Side question: what am I allowed and not allowed to do with webpage sources under US copyright law? If I cache the site and open a local copy, have I illegally reproduced a copyrighted work? What if I burned it to a CD? What if I e-mail a copy to a friend? What if I host my own version?

IANAL, but...

> cached locally

That falls under US fair use laws I believe

> burned a CD

Again, fair use I believe

> emailed a copy / hosted own version

Likely is infringing, but not definitively. Distribution is generally not fair use, but there are some exceptions.

IANAL, but...

You can't say anything definitively falls under fair use. Fair use is a case by case determination by a court limited to each specific instance (so if you do the same exact thing a second time it could be found to no longer be fair use).

I was under the impression that ripping an MP3 from a CD constituted an illegal reproduction, so it's a little surprising to me that the law doesn't apply in the other direction.

Depends on where you live. For example, in the UK I heard it's illegal again (has gone back and forth over the years).

Do you run webpack to minimize your code for download? Does that process of combining it all into one JS file trigger the GPL license virality (is it even legal to minimize GPL'd code with MIT licensed code)? IANAL but I sure wouldn't want to risk it. Maybe LGPL you could get away with, but GPL seems like a really bad idea for front end JavaScript libraries.

If it's online and not Affero GPL you don't even need to give back your improvements, provided you don't distribute your improvements to others.

Affero doesn't apply here because this is stuff that downloads and runs on the user's machine. You absolutely are distributing the code to every visitor. That makes it straight GPL.

Good! If you are commercial and you care then offer to pay for a commercial license and support the developer

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact