Hacker News new | past | comments | ask | show | jobs | submit login

Modern servers have a lot of compute resources and memory compared to 15 years back, before doing any optimizations I would start out with a naive full text search, which will most likely be fast enough. A full code search using regex is also more powerful for users. Imagine if you could use Regex when searching on Github or Google!? Source code use relative little space, If you strip out the markup/JS from web sites, they also take up relative little space. The only problem however is to educate users on how to do effective searches.

I think you might be underestimating the size of the document corpus that you'd be running over.

Lets say there are 5 billion url's with an average of 10 KiB of data (if you take out JS/CSS/images etc), and one server has 50 GB or ram you would need 1000 servers, which is very small considered Google probably have one million servers deployed. I just tried to text search your comment on Google and it found your post! So Google is already doing full text search, and does it in less then one second (0.69 to be precise). There are probably many reasons why they don't allow Regex, probably because it would be very easy to "scrape" resources such as e-mail addresses, credit card numbers, etc. It would however be cool if Google would allow you to search structured data, for example find 100 recipes that has eggs in it :P Silly example, but the possibilities are endless!

But that's exactly my point -- when you get to the stage where you have 1,000 servers with 50G of RAM each, you have gotten to the point where an optimization like an inverted index is completely sensible. The design you propose has to do a full regex scan over 50 TB of RAM for every. single. user. query. For Pete's sake! This is definitely the realm where the computational costs make it worthwhile to spend engineering resources to optimize, especially if you are going to serve lots of users.

Maybe you should clarify what you meant by "naive full text search" because that usually means scanning all documents character-by-character to match the characters in the query, which is definitely not what Google does.

Even with modern computers doing a naive regex search on 1GiB+ of full text would take a while. Memory isn't that fast. The trigrams helps you avoid needing to read every document on every search.

Even a traditional rdbms will start become slow with datasets in the 10-100GiB range.

Note that products like bigquery, snowflake, redshift...etc might be able to support but that is not Cheap or relatively fast

I think you can expect 1TB/s per machine. With one thousand servers you could search the entire web in 0.05 seconds. Not all data-sets are that big though. You should at least try the most naive method before applying any optimizations. Dividing everything up in words is a smart machine optimization, but it gets silly ... Imagine for example cutting out all words from a bunch of paper articles, then write down the name of the article at the back. Then sort all words and put them in different jars. Now when you want to find an article you pick a piece from a word-jar, then just look at the back-side! But now to the hard problem, what if the jar has thousands of small pieces in it ?

The fastest main memory bandwidth supported on modern servers is a dual socket AMD Epyc [0] with a theoretical maximum of 320GB/s. One TB/s is non-sense.

That said, I approve of naive brute-force and one should always benchmark their systems against the brute force solution.

[0] https://en.wikichip.org/wiki/amd/epyc/7551

You also have to take into account the trade-offs! In my paper-clips in a jar example, lets say we want to find the string "var foo = bar" you take one clip from the "var" jar, but almost all files has "var" in it ... and there are also many files that has "foo" and "bar" in it! How do you sort the answers from there !? With a simple text/regex search you wouln't have that problem!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact