That is impossible, and therefore wrong (I'm wrong, please see below). To know if a search is unique, as in Google has never seen them before, Google must be able to decide if a query it receives was seen before or not. Even if we assume Google needed only one bit for each message it has ever seen, and assuming it only saw 15% of new messages each day since its creation more than 20 years ago, it would need to store more than 2^1471 bits.
What could be true is that each day 15% of all searches are unique on that day.
Edit: I'm wrong. The 15% of completely unique messages per day are in regards to the messages per day, and not in regards to all messages it has ever seen, therefore exponential growth doesn't apply. To see that, assume Google just received one search query each day for 20 years but it was unique random gibberish, then Google could easily save that even though 100% of all messages per day are unique.
(And they don’t even need to store all searches for all time for this, thanks to Bloom filters.)
Assume Google receives 1 trillion queries per year, and has been around for 20 years. Using a bloom filter you can achieve a 1% error rate with ~10 bits per item. So a 200 terabyte bloom filter would be more than sufficient to estimate the number of unique queries.
If you have a list of 20 trillion query strings, and each query string is on average < 100 bytes, you're looking at a three line MapReduce and < 1 PiB of disk to create a table which has the frequency of every query ever issued. Add a counter to your final reduce to count how often the # times seen is 1.
A bloom filter is the most appropriate data structure for this use-case. How is it overkill when it uses less space and is faster to query?
To find the fraction of queries each day which are new, you would want to add a second field to your aggregation (or just change the count), the first date the query was seen. After you get the first date each query was seen, sum up the total number of queries first seen on each date, compare it to the traffic for each date.
You could still hand the problem to a new hire (with the appropriate logs access), expect them to code up the MapReduce before lunch (or after if they need to read all the documentation), farm out the job to a few thousand workers, and expect to have the answer when you come back from lunch.
I haven't thought this through, but take all the queries as they're made and create a bloom filter for every hour of searches. Depending when this process was started, an analytics group could then take a day of unique searches, and run them against this probabilistic history, and get a reasonable estimation with low error. Although the people who work on this sort of thing probably know it far better than I.
The real question though might be assuming the 15% is right, do we care about those 15%, are they typo's that don't merge, are they semantically different, are they bots search for dates or hashes, etc.
Of course, Google knows better but to treat every search query literally. Slight deviations and synonyms work for the majority of the people, even if us techies highly oppose them and look for alternative solutions (like DDG) that still treat our searches quite literally.