Hacker News new | past | comments | ask | show | jobs | submit login
Abusing AWS Lambda to make an Aussie search engine (boyter.org)
129 points by boyter on Sept 27, 2021 | hide | past | favorite | 28 comments



This is an awesome writeup, thanks for sharing!

A couple thoughts occurred to me as I read the post:

- Lambda functions deployed using Docker images can be up to 10GB.[1] Would that change your math here? I'm curious what the tradeoff would be vs parallelizing more function executions searching smaller datasets on both cost and performance.

- Great notes on the anti-competitive nature of the current market. If there was an open standard on crawling, maybe we'd see more innovation here.

- Cool use of a bloom filter!

1. https://aws.amazon.com/blogs/compute/working-with-lambda-lay...


Maybe. Based on my experiments with lambda I doubt they have enough CPU to deal with the additional space. At the current size it's bumping against the limits of lambda as is. Might be possible to switch to something like C++/Rust/C/Zig however to help with this.


Neat!

> Incidentally searching around for prior art I found this blog post https://www.morling.dev/blog/how-i-built-a-serverless-search... about building something similar using lucene, but without storing the content and only on a single lambda.

See also:

Sqlite FTS (no server): https://news.ycombinator.com/item?id=27016630

EdgeSearch (Workers): https://github.com/wilsonzlin/edgesearch

EdgeSql (Workers): https://github.com/lspgn/edge-sql

InfiniCache (Lambda): https://news.ycombinator.com/item?id=25788893

quickwit (Lambda): https://news.ycombinator.com/item?id=27074481

> I had been working on a bloom filter based index based on the ideas of bitfunnel which was developed by Bob Goodwin, Michael Hopcroft, Dan Luu, Alex Clemmer, Mihaela Curmei, Sameh Elnikety and Yuxiong He and used in Microsoft Bing https://danluu.com/bitfunnel-sigir.pdf.

Reminds me of wavelet trees: https://alexbowe.com/succinct-debruijn-graphs/ https://alexbowe.com/wavelet-trees/


The writeup is very detailed and entertaining, and it's refreshing to see some competition to Google!

I tried 2 queries. Vegemite worked great:

https://bonzamate.com.au/?q=vegemite

The (unofficial) national anthem, less so.

https://bonzamate.com.au/?q=%22men+at+work%22+%22down+under%...

Have you seen the CommonCrawl dataset? It's a great source if you're looking for a full-web index to do some Big Data analysis on, and scale up your search engine to something closer to DuckDuckGo. The CommonCrawl data is 320 TB in size though, so I hope you've got plenty of free disk space!

https://commoncrawl.org/

You can also add "Show HN" to your titles in future, if it's a project you worked on. Then it'll show up in the "show" subset (linked from the orange top bar) and usually attract more encouragement because Hacker News readers usually really like original authors coming to hang out :)


Yes, I still need to fix the highlighting for it, which I think is causing the issues for the men at work issue. Something I am aware off and when I get some time will resolve.

I have seen common crawl. It's one of those things I should look into more when I get the time. Honestly thats the limiting factor for everything I do these days.


Having created a data crawler too, the discussion about gated websites behind CloudFlare is painfully true. Unless you're a big player, you just can't enter this market. It's a real shame.

Supposedly stating you're a custom crawler [1] (like how Google has GoogleBot) and not a generic bot helps with this. I'm somehow very doubtful of this.

[1] JGC responded to a tweet a long time ago, can't find it now


CloudFlare is cancer, they've blocked off most of the internet and deliberately block non-Google crawlers. There's been discussions on it here before.


It depends how their user's configure the settings. You can use Cloudflare and effectively turn all blocking off very easily. There is a reason their customers tend to leave it on, or in my case, increase the default security level. Sorry Tor users, but 100% of the Tor traffic I've gotten has been malicious WP vulnerability scanning


Okay, I understand, but I am so sick of identifying stoplights, trucks, and airplanes. It would be nice to have someone say, "Oh, you'd like to read my interesting information anonymously? Go right ahead!"


For webmasters, the damage these rogue bots, tor, hack attempts can just bring down the whole business. I think more and more smaller websites will be opting for cloudflare in coming days.


Webmasters? What is this, 2010?

But on a serious note - you can't exactly rely on Cloudflare to not bring down a business. For me, it is a very nice convenience to throw up in front of my WP sites for caching and to help limit the bad traffic I get, but that's about it


Great writeup. Writing your own search index and crawler from scratch is a big undertaking, but sounds like the sort of thing you might have to do due to the constraints of Lambda. For the search index, the blog you link to (https://www.morling.dev/blog/how-i-built-a-serverless-search...) does use Apache Lucene within Lambda (compiled into a native binary via Quarkus and GraalVM to make startup time viable, although not a distributed index and doesn't need to be because of its relatively small size), and for the crawler, it sounds like you were nearly there with Colly (except for memory issues).

On the non-tech side:

>"In February 2021 the Australian Greens Party called for a publicly owned search engine to be created and be independent and accountable like the ABC."

Which is an interesting idea, given search has become something of a utility.


I like this idea and I do wonder how it might fair on specialized niche search engines too, which likely would have much smaller indexes. I'm always looking for "simpler" architectures for small but custom web search, and I like what the author put together here!


Im actually looking into this myself. Adding facets isn't that hard to what's already there, and for smaller sites I imagine this could be a very inexpensive way to add search, albeit with slightly higher cost to add or remove items.


oooo, neat! — reminds me of some similar work I did building a search engine which went a step further and ran on S3.

https://blog.oxplot.com/the-art-of-barebackness/


This works brilliantly, i tried a few searches and it came up with pretty good results, and was quick too! Do you mind providing instructions on your site as to how to make it your address bar search on various browsers? Very impressive.


Thanks. I find it useful for finding things that are specific to Australia I am interested in.

I think it should be picked up in your browser already as there is the open search definition in there. If you are using Chrome and visit chrome://settings/searchEngines you should see it already in other search engines, but the query URL needed is https://bonzamate.com.au/?q=%s


You're right, it does automatically go there!


Very cool and very detailed write up!


Rather than compiling the index into the source, there's two options that jump out:

1. Put it in a layer instead. A bit more space and easier to maintain.

2. Put it in EFS and cache to /tmp. You get almost a half gig of temp storage which persists across executions.


I don't think based on my tests that lambda has enough CPU to deal with that. The current size I am using is pushing what they are able to process. Thats just based on my tests however.

I do know that the same index on a physical machine can process 200 million items in ~100ms just to give an idea. On lambda it was more like 100,000.

Keep in mind I stopped looking at that point. If someone is able to prove otherwise id be happy to switch over to this, but then again I don't have a real need currently.


What is the maximum memory size you tested the AWS Lambda functions with? From your article it seems you did try with 1024MB. Did you try with more memory as well? In case you don't know, you need at least 1769MB of memory to get a full vCPU [1].

When running locally on your physical machine, is your code only utilizing a single CPU core or does it make use of multiple cores? If it's the latter one, increasing the memory for the AWS Lambda function even further should improve performance as well.

Also if you hit a performance plateau with a memory size of less than 1769MB for your AWS Lambda functions, you could be bottlenecked by something else. I can imagine details like memory bandwidth playing a role there as well.

[1]: https://docs.aws.amazon.com/lambda/latest/dg/configuration-f...


I tried up to the maximum a few times. The code is single threaded. It is indeed also memory bandwidth bottlenecked. Which I think might be the main issue with it in lambda.

My guess based on my tests is that memory access and cpu are the limiting factors in lambda for this but that because it encourages scale out less of an issue.


How do you handle typos and spelling mistakes? I wrongly wrote "steve irwin cocrodile" and got zero results. Then I corrected to "steve irwin crocodile" and got many results.


Very impressive! I tried variations of “bom four day forecast” but no hits - is bom.gov.au indexed ?


It's because they don't have a HTTPS cert. I limited to just HTTPS in the interest of keeping everything easier to work with.

I might break that rule for bom though. Although I wish they would just let myself add a cert for them. I'll even do it for free if they let me!


The newer weather.bom.gov.au is behind HTTPS. It's fairly javascript heavy though, which might make it hard to index.


I’ll give that one a shot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: