Hacker News new | past | comments | ask | show | jobs | submit login

> 15% of Google searches per day are unique, as in, Google has never seen them before.

That is impossible, and therefore wrong (I'm wrong, please see below). To know if a search is unique, as in Google has never seen them before, Google must be able to decide if a query it receives was seen before or not. Even if we assume Google needed only one bit for each message it has ever seen, and assuming it only saw 15% of new messages each day since its creation more than 20 years ago, it would need to store more than 2^1471 bits.

What could be true is that each day 15% of all searches are unique on that day.

Edit: I'm wrong. The 15% of completely unique messages per day are in regards to the messages per day, and not in regards to all messages it has ever seen, therefore exponential growth doesn't apply. To see that, assume Google just received one search query each day for 20 years but it was unique random gibberish, then Google could easily save that even though 100% of all messages per day are unique.

This is somewhat a faulty analysis. One could easily use a high accuracy bloom filter to store whether a search has definitely not been seen before, and that would be an estimate on the lower bound of the error margin.

Yup. This was actually an interview question I got from a former Google search engineer.

Where are you getting these numbers? Google says they get ~2 trillion searches per year. 40 trillion searches over 20 years (way too many) would be 2^44 searches. https://searchengineland.com/google-now-handles-2-999-trilli...

(And they don’t even need to store all searches for all time for this, thanks to Bloom filters.)

The whole point was that 2^1471 is wrong.

It is roughly 1.15^(365*20). That it is wrong was clear from its size. I wanted to use it's falseness to show that the assumptions are incorrect. Which they are, just not how I understood initially.

How are you computing that number? It's definitely wrong.

Assume Google receives 1 trillion queries per year, and has been around for 20 years. Using a bloom filter you can achieve a 1% error rate with ~10 bits per item. So a 200 terabyte bloom filter would be more than sufficient to estimate the number of unique queries.

A Bloom filter is just way overkill.

If you have a list of 20 trillion query strings, and each query string is on average < 100 bytes, you're looking at a three line MapReduce and < 1 PiB of disk to create a table which has the frequency of every query ever issued. Add a counter to your final reduce to count how often the # times seen is 1.

uh, is this sarcasm?

A bloom filter is the most appropriate data structure for this use-case. How is it overkill when it uses less space and is faster to query?

Actually the bloom filter was just an approachable example. There are much more clever and space efficient solutions to this problem, such as HyperLogLog [1] (speculating purely based on the numbers in that article, it looks like a few megabytes of space would be far more than sufficient). See the Wikipedia page on the "Count-distinct problem" [2].

1: https://en.wikipedia.org/wiki/HyperLogLog 2: https://en.wikipedia.org/wiki/Count-distinct_problem

My initial approach was also technically wrong; it tells you the fraction of queries which happen once.

To find the fraction of queries each day which are new, you would want to add a second field to your aggregation (or just change the count), the first date the query was seen. After you get the first date each query was seen, sum up the total number of queries first seen on each date, compare it to the traffic for each date.

You could still hand the problem to a new hire (with the appropriate logs access), expect them to code up the MapReduce before lunch (or after if they need to read all the documentation), farm out the job to a few thousand workers, and expect to have the answer when you come back from lunch.

I don't think it's necessarily impossible to calculate. Using probabilistic data structures arranged in a clever way, it's likely possible to calculate with some degree of accuracy.

I haven't thought this through, but take all the queries as they're made and create a bloom filter for every hour of searches. Depending when this process was started, an analytics group could then take a day of unique searches, and run them against this probabilistic history, and get a reasonable estimation with low error. Although the people who work on this sort of thing probably know it far better than I.

The real question though might be assuming the 15% is right, do we care about those 15%, are they typo's that don't merge, are they semantically different, are they bots search for dates or hashes, etc.

I believe that they're unique in a sense that nobody has typed in that exact query previously.

Of course, Google knows better but to treat every search query literally. Slight deviations and synonyms work for the majority of the people, even if us techies highly oppose them and look for alternative solutions (like DDG) that still treat our searches quite literally.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact