Hacker News new | past | comments | ask | show | jobs | submit login
To Break Google’s Monopoly on Search, Make Its Index Public (bloomberg.com)
859 points by JumpCrisscross on July 15, 2019 | hide | past | favorite | 597 comments

Ex-Google-Search engineer here, having also done some projects since leaving that involve data-mining publicly-available web documents.

This proposal won't do very much. Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS. It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job.

(For comparison, when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up. Difficulty of running a MapReduce over that corpus was actually a little harder than running a Hadoop job over CommonCrawl, because there's less documentation available.)

The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006. The ones about the search & clickthrough data being important are closer, but I suspect that if you made those public you still wouldn't have an effective Google competitor.

The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better. Same reason I still buy Quilted Northern toilet paper despite knowing that it supports the Koch brothers and their abhorrent political views, or drink Coca-Cola despite knowing how unhealthy it is.

If you really want to open the search-engine space to competition, you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name. (Needless to say, you'd also need to get rid of Chrome & Toolbar integration.) Same with all the other monopolies that plague the American business landscape. Once you get to a certain age, the majority of the business value is in the brand, and so the only way to keep the monopoly from dominating its industry again is to take away the brand and distribute the productive capacity to successor companies on relatively even footing.

I think it is possible to make way, way better search engine because Google Search is no longer as good as it used to, at least for me.

I can no longer find anything remotely good quality, I discover new and quality stuff from social media like Twitter and HN.

The search results seem to be too general and too mainstream. Nothing new to discover, just a shortcut to the few websites like Reddit, StackOverflow for more techie thing and Wikipedia and the few mainstream news websites for the rest.

I usually end up to search HN, Reddit or StackOverflow directly as the resulting quality is better as I can get easily specific. Getting specific is harder on Google because it just omits or misinterprets my search query keywords quite often.

The reason for that is because Google's building for a mainstream audience, because the mainstream (by definition) is much bigger than any niche. They increase aggregate happiness (though not your specific happiness) a lot more by doing so.

It's probably possible to build a search engine for a specific vertical that's better than Google. However, you face a few really big problems that make this not worthwhile:

1) Speaking from experience, it's very difficult to define what "better" means when you don't have exemplars of what queries are likely and what the results should be. The reason search engines are a product is that they let us find things we didn't know existed before; if we don't know they exist, how can we tweak the search engine to return them?

2) People go to a search engine because it has the answers for their question, no matter what their question is. If you had a specific search engine for games, and another for celebrities, and another for flights, and another for hotels, and another for books, and another for power tools, and another for current events, and another for technical documentation, and another for punditry, and another for history, and another to settle arguments on the Internet, then pretty soon you'd need a search engine to find the appropriate search engine. We call this "Google", and as a consumer, it's really convenient if they just give us the answer directly rather than directing us to another search engine where we need to refine our query again.

3) Google makes basically 80% of their revenue from searches for commercial products or services (insurance, lawyers, therapists, SaaS, flowers, etc.) The remainder is split between AdSense, Cloud, Android, Google Play, GFiber, YouTube, DoubleClick, etc. (may be a bit higher now). Many queries don't even run ads at all - when was the last time you saw an ad on a technical programming query, or a navigational query like [facebook login]? All of these are cross-subsidized by the commercial queries, because there's a benefit to Google from it being the one place you go to look for answers. If you build a niche site just to give good answers to programming queries or celebrity searches or current events, there's no business model there.

> It's probably possible to build a search engine for a specific vertical that's better than Google.

Funny, I don't disagree with this, but my perception has been that Google seems to detect when I've switched roles from one type of programmer to another. I don't know if that's organic from the topics I'm looking up or not, but if I'm looking up a generic string search, it seems to return whatever language I've been searching for recently. (very recently in fact)

My point is, it seems like the search engine intuitively understands my "vertical" already. Maybe it's just because developer searches are probably pretty optimized.

I think its totally possible, two examples already:

Google Ads (used to?) lets you target by "bahaviour" vs "in-market". They can tell the difference between someone who is passionate about beds, maybe involved in the bed business (behavior) and the people who are making the once-in-a-decade purchase of a bed (in-market).

Google can tell devices apart on the same google account and keep together search threads. I might be programming on my desktop making engineering searches but at the same time I'm googling memes on my phone; both logged into the same account.

> Speaking from experience, it's very difficult to define what "better" means when you don't have exemplars of what queries are likely and what the results should be.

Better is a search engine that takes your queries more literal. This is what everybody means when they say Google used to be better. The query keywords and no second guessing.

When you insist on Google using verbatim mode or something, you often don't get results. Which is bullshit because I remember 10 years ago, queries like these had me plowing through the results, so much that you actually had to refine the query -- you can't do that in Google any more, at least it's not refining, it's more like re-wording and re-rolling the dice. But it all feels very random and you don't get a feel for what's out there.

I mean sure there is a place for a search engine like this, if it works well. And in its own way, Google works well.

I sometimes do want my query to be loosely interpreted like I'm an idiot, and I head straight for the Google. Ever since I saw the "that guy wot gone painted them melty clocks"-meme, for certain types of queries I have indeed found that if I formulate my question like I got brain damage, I get superior results. Because that is the kind of audience Google wants you to be.

But sometimes you don't feel like the lowest common denominator and you don't want to be treated as such. And there should be a place for that, too.

Very interesting perspective. I completely understand your point. It used to be a tool, not it is more like a system with a mind of its own. I might need both.

Why do you say there is no business model in a search niche? StackOverflow and pleny of listing sites (Tripadvisor, Yelp, Zillow, Capterra to name a few) have been successfully built in this exact premise and the user experience of searching for restaurants, real state or software on these sites is usually much better than searching directly on Google due to the availability of custom filters and the amount of domain-specific metadata that the global search engines cannot read. While it's true that most of these sites heavily rely on SEO to drive inbound traffic from the big G, there is no doubt that they are perfectly viable businesses.

StackOverflow and those other sites aren't search engines. They may have search engines in them but not many people use them (the only time I reach StackOverflow, booking.com etc is via search engine referral). They're user content hosting and curation sites.

Technically you are correct, in the sense that they do not crawl the web like Google or Bing do. But from a user perspective, they do provide a very useful service of aggregation, discovery and comparison of structured data that is way more effective than using Google search queries, if you know the type of information you are looking for.

It's the corpus that matters, mostly. The StackExchange sites are Q&A formatted and with an SKG graph (such as in solr), you can do topic extraction on questions OR answers which then leads to being able to match other answers (with links) to other questions, among other things. With related topics, many other things come to life.

Sure, they have a business reason to do exactly what they do but I think as people grow up they specialize and the general stuff that fits everybody becomes useless. Google tries to personalize search results but that so far yielded echo chambers, not personalized discoveries.

I can't get better products by searching Google, I can get the best-spammed products or most promoted products only.

The fact that I am getting low-quality service and Google is printing money means that there is a place for good a good service and if that service cannot emerge due to Google's practices, it probably means that the regulators need to take action.

Or maybe the search is dead, long live social media.

The gist is, I am not happy with a service but the company that makes that product makes a lot of maney. Can't tell if I am an anomaly or if other people feel the same way because Google is a monopoly and maybe the regulators should make it possible to compete with Google and see if there's a space for a better service.

Yes yes, I am the product but I am the product only if I am happy with the stuff I'm getting in return.

... in return for you being the product? Haha. I don't think Google sees their end of that "transaction" being an actual transaction. You're an individual, and Google doesn't deal with those.

How would google's practices stop me from creating a search engine?

Keep in mind when Google started, Yahoooooo! Was the big player and Google overtook them by simply being better

Everything turns into an echo chamber eventually.

> navigational query like [facebook login]

Definitely have seen malicious adds for "facebook login", though that was probably 2016 or 2017.

I see comments like this all the time. Am I alone in that search results, for me, have gotten significantly _better_ since a couple years ago?

I can't help but think it's partially due to people using tools _specifically designed_ to make Google's job harder (FF SandBoxes, uBlock, etc) and not understanding the implications of using them... and then blaming Google for returning "bad" results.

I get a lot more seo spam than I used to, but the results are still quite good. I think we should give google some credit for that at least.

Like, a lot more seo spam though.

People have gotten really good (i.e. it's their full-time job) at "gaming" Google. That's not to say Google is fallible - every search engine is game-able depending on its algorithm - it's just that these people are _very_ clever.

They don't even need to clever as much as persistent, because of the selection effects.

> specifically designed to make Google's job harder "Better search" doesn't necessarily mean "more personalized search."

Unless you're a very average person, I'd argue it does.

Google have metrics on how much better it makes search - at least when I was there it makes quality a lot better, but not, say, double the quality. I think in the early days of the company they thought personalisation would be a much bigger win than it was - it was big enough that it didn't make sense to turn it off or anything like that, and you can see it in action when people say their results are customised to the programming language they are most recently using. But most of the time it's not doing all that much - and the biggest component of it was basic stuff like location and language.

> Getting specific is harder on Google because it just omits or misinterprets my search query keywords quite often.

I have this problem too. Google often thinks that I made a typo and presents me results for things I didn't searched for or care about and I have no way to force it to search for things I really want.

This is exactly when I switch to brain damage mode querying. You like fixing typos Google? Have some typos. You like figuring out what I really mean Google? Here, I'll formulate my query like a deranged toddler on PCP, best of luck!

Maybe it just feels more successful because it lowers my expectations. But at least you get to mash the keyboard like a maniac, do no corrections, press return and watch it just work.

It's kind of like watching Google do a "customer is always right" squirm.

If there were viable alternatives, people would shift over time.

If I type in “<name> Pentagon” on Google, the first link is LinkedIn. DuckDuckGo doesn’t even list it at all. There’s countless examples where DuckDuckGo just can’t find basic information. DDG is just unreliable beyond it’s silly name.

I'm always confused by this. I have ddg as the default on my home computer and Google is the default on my work. So I'm constantly using both. There aren't really any apparent differences to me in results. I'm not sure what everyone else is searching, but I search everything from how to spell a word that I should definitely know all the way to niche topics in physics.

Maybe it's because I don't have tracking enabled in Google (I'm not logged into my account when at work) and opt out of tracking where I can. Maybe this is the difference between the lack of difference I see and the huge difference so many others see. But I still don't see it as an issue because I generally find what I'm looking for with one search. Might be the third item, but that's not an issue to me.

I hear this so often that I assume something has to be different. I'm curious if others have ideas as to what it might be, or if I correctly identified them.

I use DDG as my default everywhere, and when I don't find something, I'll !g it as a bit of a last resort.

I'd estimate I'm doing that maybe 5% of the time. It seems to be even odds that I find a satisfying match, though, obviously those are all the hard queries.

The hardest queries are trying to dig up details about stories in the news.

I try to use and like DDG, but the results just aren't as good. For example, it seems to be completely unaware of Docker Hub. Like, pages from that entire subdomain never show up. I can search "Docker hub" and it doesn't even show up.

For that specifically, use !dhub or !dockerhub to search the site directly. Really, the magic of DDG is bang queries.

(Search for bang queries with, not surprisingly, "!bang".)

usually I just do !g and that solves the problem ;-)

But also, thank you. I didn't realise there were so many bangs.

I agree, unfortunately the search is really really sub-par and like others said, frequently doesn’t find basic things no matter how specific the keywords I use are.

I feel it might have even been better at one stage ?

Unless you're searching in Russian, DDG is mostly a skin for Bing search results anyways. The major players in the search engine space are Google, Bing/MSN, Yandex, and Baidu - with the latter two being mostly language-specific.

I find DDG has pretty acceptable or even good results most of the time.

The real power is in the "bangs", though; you can use the `!` to immediately jump to the first search result without seeing a search page, or use `!g` to switch to Google for this particular query, among others. It enables a sort of power-user usage that one wouldn't get with Google.

I don’t really get the logic, just use a good search engine in the first place ?

I'm saying that DDG can be "good enough", and that not having to click around on a results page can save you time if you know what you're doing.

I understand that for some people that's not enough of a time savings to make a difference, but I know DDG well enough to be able to `!` things and almost always immediately get to a successful result. I treat it as an extension of my brain at this point.

The logic is when you made DDG your default search for the address bar. Then it becomes the zero-stop jump off point for all the other search engines they have !bang syntax for (which are thousands, I think).

I used to configure those as search keywords in Firefox (and before, Opera), which do roughly the same without the exclamation point. But on a new browser, even just configuring your favourite top 5 searches is a lot of hassle compared to just setting DDG as the default search and using their bangs.

It's for when the good search engine is the site's own page.

If I'm working on python and numpy and I want to look up `argsort`, I know I want to search the numpy page, so !numpy argsort takes me right there.

Any kind of web dev is !mdn whatever and I don't have to scroll through a dozen BS tutorials, I just get the specs.

The !bang feature I use the most is !w for wikipedia, however I don't use wikipedia enough to justify making it my default search engine on the nav bar.

Your browser can assign keywords to custom search engines so you could just type "wiki blah" to see Wikipedia or "jira 123" to load a specific ticket.

What does a viable alternative look like?

I've been using Bing for the past few months; it's not great or terrible but is it "viable" enough for people to shift to over time? Or is it not viable because it's backed by a major corporation?

I'm sure there are search quirks with each engine but I've seen issues with Google too and yet it's the "devil we know" ... so people unconsciously work around them.

I've used Bing for years now. The only time I go back to Google is if I'm searching for something super specific (normally programming related). Bing takes care of most of my search needs.

I wonder if this is due to google possibly ignoring the robot.txt and Bing (which powers DDG) accepting Linkedin's request? https://www.linkedin.com/robots.txt

I've been using DDG almost exclusively & find it's results to be better than Google's with the exception of local businesses & maps. Google still has an advantage there.

Neither DDG or Google return any LinkedIn results for me unless I also add LinkedIn to the search, in which case I get the same results for both search engines.

Google knows what you want before you even ask. You might find that convenient, I find it unsettling.

I guess it’s not as bad as Facebook; at least Google doesn’t spoon feed you.

This ^ times a 1000.

Google simply has the best search product. They invest in it like crazy.

I’ve tried bing multiple times. It’s slow, it spams msn ads in your face on the homepage. Microsoft just doesn’t get the value of a clean UX.

DuckDuckGo results are pretty irrelevant the last time I tried them. There is nothing that comes close to their usability. To make the switchover, it has to be much much better than Google. Chances are that if something is, Google will buy them.

One thing to keep in mind when comparing DuckDuckGo to Google is that people do not use Google with an alternative backup in mind. When you DDG something and it fails, you can always switch to google.

But what about when Google fails? Unlike DDG, there is no culture of switching between search engines when googling. Typically, you'll just rewrite the query for google. And as rewriting the query is an entrenched part of googling, you are less likely to notice this as a failure. It is this training that's the core advantage nostrademons points out.

This right here is why I don't understand people who complain about DDG's search results. If you simply make the commitment to not use Google, for whatever reason that may be, then using DDG becomes exactly the same process of rewriting search queries until you get the thing you're looking for.

I've been using DDG exclusively since I was a contractor at Google years ago and have never had a problem finding things with it...

I don't necessarily agree. The hard part of search is building the index and differentiating _real_ promotion from the _fake_. There's a lot of SEO manipulation that Google does a good job avoiding.

Webspam is a really big problem, yes. It's very unlikely that you'd be able to catch up or keep up in that regard without Google's resources.

Building the index itself is relatively easy. There are some subtleties that most people don't think about (eg. dupe detection and redirects are surprisingly complicated, and CJK segmentation is a pre-req for tokenizing), but things like tokenizing, building posting lists, and finding backlinks are trivial - a competent programmer could get basic English-only implementations of all three running in a day.

I am not even that good of a programmer and I also agree with you that index relatively trivial. Other major issues, besides fighting spam:

- Hardware infrastructure and data center presence for extremely fast search from anywhere in the world. - Near real-time search suggestion. - personalized search results based on past search + geolocation. - Search to get instant results without having to go to a website.

Just to name a few. Google Search is the gold standard of a search engine, not because its Google or because they have been around for a long time and the brand name sticks (I am sure it helps too), but for the simple fact is no search engine is even remotely close to being as good as google and I have tried them all more of the less and given them shot. They are just not good at all.

I also don't understand the hate towards google being in charge of so many products so many people use, ie, Mail, Maps, Chrome, Android, Docs (to name a few). It's simply because they are damn good at it. If its a crime to make a product so good that people continue to use it, then I don't know what else people are supposed to do. As if we are asking google to make shit products, I just don't understand the reasoning.

It has nothing to do with the number of products, it’s what they do with their influence over the market. See AMP and incompatibilities between Gmail & IMAP, for example.

You concentrating on the literal interpretation of the phrase “give access to the index”. This is non-technical article which didn’t go into details, just read it as “give access to index & ranking”.

> Google simply has the best search product.

The best available doesn't necessarily mean the best possible. And Google is far from it, and it's getting worse, not better.

I've definitely noticed a decline in quality from Google results over the past few years in particular. I don't know if that's because SEO has gotten control of the results of if Google's algo is shoving lower quality up higher for revenue but it's become difficult.

Using a bit of Google-fu I'm usually able to find what I need quickly but it's still more of a hassle than it used to be.

There's exponentially more background noise than there used to be

It's easier to return the most relevant 10 results when there's only 10 thousand options than when there's 10 trillion options with 10 thousand new ones created every day.

I work at Google but not on Search.

My guess is that it's because Google Search now also has to cater to queries from Assistant. Being required to handle web, mobile, and assistant probably necessitated tradeoffs in quality of one over another.

More generally I feel like as the company gets bigger it just gets much harder to handle all the complexity and keep things focused.

I don't know why you're getting downvoted, because the quality has 100% tanked over the last few years. I agree that there may be some selection bias between us, but it's at least got some of my normie non-technical friends commenting about it, so it's not completely without merit. I have a couple of theories, one of them is also a warning.

First, I think search results at Google have gotten worse because people are not actually good at finding the best example of what they're looking for. People go with whatever query result exceeds some minimum threshold. This means when Google looks at what people "land on" (e.g. something like the last link of 5 they clicked from the search page, and then which they spend the most time on according to that page's Google Analytics or whatever), they aren't optimizing for what's best, they're optimizing for what is the minimum acceptable result. And so what's happening is years and years of cumulative "Well, I suppose that's good enough" culminating in a perceptible drop in search result quality overall.

Second, Google has clearly been giving greater weight to results that are more recent. You'd think this would improve the quality of the results which "survive the test of time" but again, Google isn't optimizing for "best" results, they're optimizing for "the result which sucks the least among the top 3-5 actual non-ad results people might manage to look at before they are satisfied". So this has the effect of crowding out older results which are actually better, but which don't get shown as much because newer results have temporal weight.

My warning is this, too, which you've surely noticed: Google search has created a "consciousness" of the internet, and in the 90s it used to be that digitizing something was kind of like "it'll be here forever" and for some reason people still today think putting something online gives it some kind of temporal longevity, which it absolutely does not have. I did a big research project at the end of the last decade, and I was looking for links specifically from the turn of the century. And even in 2009, they were incredibly hard to find, and suffered immensely from bitrot, with links not working, and leaning heavily on archive.org. Google has been and is amplifying this tremendously, by twiddling the knob to give more recent results a positive weight in search. Google makes a shitload of money from mass media content companies (e.g. Buzzfeed) and whatever other sources meet the minimum-acceptable-threshold for some query, versus linking to some old university or personal blog site which has no ads whatsoever. So the span of accessible knowledge has greatly shrunk over the last few years. Not only has the playing field of mass media and social media companies shrunk, but the older stuff isn't even accessible anymore. So we're being forced once more into a "television" kind of attention span, by Google, because of ads.

I find the single hardest thing to search for these days is anything more than a few months old on YouTube... They hate older videos, it feels like. Beyond that, I keep seeing suggestions on new content from years ago... it's just weird.

I know it's not google proper, but I'd guess a significant number of their searches are specific to youtube.

I believe they try to put newer content first in order to make a more fair distribution of views. If you order results by popularity on yt, you will see that it uses just an "order by view count desc" (no relationship with like/dislike ratio), which is bad because it keeps popular some not so good quality videos published on first yt years.

Worse still, imho is that it may not be a popular video I'm looking for. I really wish they'd factor in a "I have viewed" for results.

I disagree. It works great for me. Maybe once every few days I will use !g when I can't find something, but I rarely end up finding it on Google either.

I read somewhere that someone used a skin to make ddg look identical to Google. After doing that, they never even thought about using Google again.

Microsoft thinks what they have is Clean UX.....

Microsoft just needs to get their head out of their ass! With the amount of money they have spent- and what they have to show for it- they should should just can the entire Bing team (or whatever they call their search engine team today). Not only have they not sucked- if they just folded - they'd let the monopoly argument against Google ride somewhat.

Sure, it costs $50 to grep it, but how much does it cost to host an in-memory index with all the data?

This is not a proposal to just share the crawl data, but the actual searchable index, presumably at arms length cost both internally & externally.

The same ideas could be extended to the Knowledge Graph, etc.

IMO the goal here should not be to kill Google, but to keep Google on their toes by removing barriers to competition.

The data was about 55TB of compressed HTML last I looked, so that's about 70 r5a.24xlarge instances, each going for $5.424/hour, so about $350/hour or $250K/month. That's not cheap, and definitely not something you'd put on your personal credit card, but it's well within the range of a seed-funded startup. Sizes may vary a bit depending upon the exact index format, but that should be a rough ballpark. With batch jobs being so cheap, you could experiment a bit with your own finances and then seek funding once you can demonstrate a few queries where your results are better than Google. If you actually have a credible threat to Google, you'll have investors breathing down your neck, because it's a $130B market.

API access to either the unranked or ranked index in memory wouldn't do anything useful, BTW. To have a viable startup you need something a lot better than Google, which means that you need algorithms that do something fundamentally different from Google, which means you need to be able to touch memory yourself and not go through an API for every document you might need to examine. Remember, search touches (nearly) every indexed document on every query - if you throw in 200ms request latency for 4B documents your request will take roughly 25 years to complete.

Knowledge Graph is already public - it was an open dataset before it was bought by Google, and a snapshot of its state at the point Google closed it to further additions is still hosted by Google:


(It's only 22G gzipped, too - you can download that onto a personal laptop.)

"Remember, search touches (nearly) every indexed document on every query" - wait, why does that happen?

Doesn't it only touch ones with at least one of the search terms in, or stemmed/varied words relating to some of the terms? And does that via an index?

I struggled with how to word that in a way that's both true, understandable, and doesn't give away any proprietary information. Added "indexed" to clarify but I didn't fix up the numbers, so they're likely an overestimate.

Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index. Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss), and it's usually beneficial to merge the scores only after they have been computed for all query terms, because you have more information about context available then.

There's a reason Google uses an in-memory index: it gives you a lot more flexibility about what information you can use to score documents at query time, which in turn lets you use more of the query as context. With an on-disk index you basically have to precompute scores for each term and can only merge them with simple arithmetic formulas.

> Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index.

But, reading through the other comments, leaving out this part would make it better than Google.

Maybe stemming. I remember when Google added stemming (somewhere in the early 2000s). I was conflicted about it because I didnt want a search engine to second-guess my query (can you imagine??), but I also saw the use because I was already in the habit of trying multiple variations.

Auto spelling correct is a no-no. Just say "did you mean X?" and let people click it if they misspelled X. No sense in querying for both the "typo" and "corrected" keywords, because the "typo" would rank much lower, right?

Similar for synonyms. Either it should be an operator like ~, or maybe it should just offer a list (like the "did you mean" question) of synonyms to help the user think/select similar words to help their query.

> Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss)

You mean like Wand or BMW?

> Knowledge Graph is already public > https://developers.google.com/freebase/

That dump is outdated, not supported, and very incomplete comparing to what google has now.

Perhaps move the google index and the facebook graph to "utility" companies, with google/facebook being frontends/consumers for those companies. Tiered access costs based on query/access volumes could fund the utility, and allow smaller companies to have access with costs based on their scale, if they can monetise as they scale up to cover the costs then they should not be in business.

>The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006.

That's quite a claim considering they were reporting PageRank in their toolbar until 2016, and toolbar PageRank was visible in Google Directory until 2011.

Are you talking about PageRank from the original patent?

It is a seemingly incorrect claim. Google has semi-recently, publicly said they still use PageRank as one of their signals.



They replaced it in 2006 with an algorithm that gives approximately-similar results but is significantly faster to compute. The replacement algorithm is the number that's been reported in the toolbar, and what Google claims as PageRank (it even has a similar name, and so Google's claim isn't technically incorrect). Both algorithms are O(N log N) but the replacement has a much smaller constant on the log N factor, because it does away with the need to iterate until the algorithm converges. That's fairly important as the web grew from ~1-10M pages to 150B+.

> That's fairly important as the web grew from ~1-10M pages to 150B+.

This is the weird thing -- it feels smaller. Back in the early 2000s it really felt like I was navigating an ocean of knowledge. But these days it just feels like a couple of lakes.

(also, I'm pretty sure it was billions already quite early on?)

So what's the name of the new algorithm?

>The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better.

I agree about consumer's habits, but not about quality - i mean google of today is worse search engine than google of 5 years ago.

Now google tries to guess, badly, what you meant, instead of giving you what you asked for. The pleasure of dealing with IT systems is that they give you what you ask them for, not what you meant - it introduces extra error, and worse - one that cannot be fixed by user.

I can rephrase my querry, and google will still interpret it - leading to same batch of useless results.

I can also comment here. I built and still run a petabyte-scale web crawler:


Common Crawl and other sources do in fact have a ton of data that can be used which is very affordable.

The DATA itself stopped being a real competitive advantage probably 2008-2010.

Google's major advantage now is its algorithms and the fact that they've proven it works and is reliable.

Most importantly, its the brand. Google MEANS search in the US and that won't change anytime soon.

PS,... if you need tons of web and social data Datastreamer can hook you up too :)

>"Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS."

Interesting I would have thought that crawling at this scale and finishing in a reasonable amount of time would still be somewhat challenging. Might you have any suggested reading for how this is done in practice?

>"It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job." Curious what type of Hadoop job you might referring to here. Would this be building smaller more specific indexes or simply sharding a master index?

>"Google hasn't used PageRank since 2006." Wow that's a long time now. What did they replace it with? Might you have any links regarding this?

Crawling is tricky but it's been commoditized. CommonCrawl does it for free for you. If you need pages that aren't in the index then you need to deal with all the crawling issues, but its index is about as big as the one most Google research was done on when I was there.

$50 gets you basically a Hadoop job that can run a regular expression over the plain text in a reasonably-efficient programing language (I tested with both Kotlin and Rust and they were in that ballpark). $800 was for a custom MapReduce I wrote that did something moderately complex - it would look at an arbitrary website, determine if it was a forum page, and then develop a strategy for extracting parsed & dated posts from the page and crawling it in the future.

A straight inverted index (where you tokenize the plaintext and store a posting list of documents for each term) would likely be more towards the $50 end of the spectrum - this is a classic information retrieval exercise that's both pretty easy to program (you can do it in a half day or so) and not very computationally intensive. It's also pretty useless for a real consumer search engine - there's a reason Google replaced all the keyword-based search engines we used in the 80s. There's also no reason you would do it today, when you have open-source products like ElasticSearch that'd do it for you and have a lot more linguistic smarts built in. (Straight ElasticSearch with no ranking tweaks is also nowhere near as good as Google.)

Thanks for the detailed response. I appreciate it. I will look into CommonCrawl. Cheers.

IMHO a simpler and probably the only viable way to force competition is to legally force Google to not respond to any query on certain periodic time periods.

For instance, if you were to forbid Google from operating on every odd-numbered day, then 50% of the search engine market and revenues would immediately be distributed among competitors and furthermore users would be forced to test multiple engines and they could find a better one to use even when Google is allowed to operate.

Obviously this has a short-term economic cost if other search engines aren't good enough as well as imposing an arbitrary restriction on business, so it's debatable whether this would be a reasonable course of action

Banning anything is often not a good policy since it usually creates secondary markets.

Depends on how you count the date, this could create markets where people in different countries will sell Google search results to each other. New VPN providers pop up with the promise of 24h Google coverage. Software startups switch to a system where you bing work 16 hours straight, then get the next 32 hours break and repeat. "Breaking news" has a new Oxford definition, since newspapers change plan to publish news 5 minutes before Google opens for search. Electricity price increases for the first 2 hours of the odd-numbered day to combat the spike in demand. Comcast introduces a new fast lane at only $199 a month that has no slow down access to Google. University groups lobby for a new exemption in the law allowing unrestricted weekdays access. Political parties lobby for also blocking Google on the day of debates, regardless of whether it's an odd-numbered day. It's kinda fun to keep going.

Any search engine that was unavailable for 50% of the time would soon have 0% of the market, not 50%.

This can be solved, in the odd-days example, by making either the second most popular or all other search engines operate only on the even days (as well as making the restriction apply to the most popular engine instead of Google in particular).

This has other drawbacks of course.

Actually the omnibox made it really easy to switch to ddg. With an occasional fallback to google.

I have no problem with advertising etc. but the tracking and selling of data is such an idiotic thing. We as consumers should have a global internet-law, and be reimbursed for data leaks or usage outside the scope of the application.

By no problem with ads I mean the original ads of google. Very clear they were ads and not intermingled with the results. Scrolling down for the results is nuts. I will click ads if they’re relevant, regardless of if they’re on the right or in the results. So please stop supporting this fraud against advertisers.

I think that the "fallback to Google" might actually tend to diminish consumer confidence in DDG. Every time you use it, you basically say to yourself "$newRiskyStrategy fails sometimes, we still need $oldReliableStrategy".

Instead, what might help DDG is a plugin that detects when you go past the first or second page of Google search results, and suggests that you might get better results on DDG. It's a little intrusive, but the mental nudge becomes "$oldReliableStrategy has flaws, try $newRiskyStrategy". You get a positive emotional interaction with DDG rather than "forcing" yourself to use it all of the time and "failing back" to Google.

> when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up

This may help to explain the poor quality of some the results on queries I run on Google lately that return content obviously written for ranking in SEO but that have very little value.

I have 2 questions:

- What make"the top 3B+" the top ones?

- How can I "force"a search on the other 150B+ pages?

I find it odd that you claim to be a former Google search engineer and in the end boil down the success of Google search to brand recognition / loyalty. You kinda glossed over the insane complexity of building and maintaining a high quality search engine, really weird comment to be honest.

> you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name.

Just let "google" become the generic term for search, as it's already well on its way.

Page Rank is a synonym for link juice. So when you say Google hasn't used page rank since 2006, can you confirm that you are talking about link juice as opposed the the old toolbar representation of page rank? And assuming you do mean link juice, well why do links still work so well for seo?

Okay, this is a relatively serious proposal to require Google to allow API access to its search index, with the premise that it would democratize the search engine ecosystem. There are some issues with the regulations he proposes (you have to allow throttling to prevent DDoS attacks, and you can't let anyone with API access add content to prevent garbage results), but it's roughly feasible.

The main problem is, I think the author is wrong about what Google's "crown jewel" is. Yes, Google has a huge index, but most queries aren't in the long tail. Indexing the top billion pages or so won't take as long as people think.

The things that Google has that are truly unique are 1) a record of searches and user clicks for the past 20 years and 2) 20 years of experience fighting SEO spam. 1 is especially hard to beat, because that's presumably the data Google uses to optimize the parameters of its search algorithm. 2 seems doable, but would take a giant up-front investment for a new search engine to achieve. Bing had the money and persistence to make that investment, but how many others will?

> 1) a record of searches and user clicks for the past 20 years

From what I can tell, Google cares a lot more about recency.

When I switch over to a new framework or language, search results are pretty bad for the first week, horrible actually as Google thinks I am still using /other language/. I have to keep appending the language / framework name to my queries.

After a week or so? The results are pure magic. I can search for something sort of describing what I want and Google returns the correct answer. If I search for 'array length' Google is going to tell me how to find the length of an array in whatever language I am currently immersed in!

As much as I try to use Duck Duck Go, Google is just too magic.

But I don't think it is because they have my complete search history.

Also people forget that the creepy stuff Google does is super useful.

For example, whatever framework I am using, Google will start pushing news updates to my Google Now (or whatever it is called on my phone) about new releases to that framework. I get a constant stream of learning resources, valuable blog posts, and best practices delivered to me every morning!

It really is impressive.

> Also people forget that the creepy stuff Google does is super useful.

For the same reasons you’re exalting them, I have non-technical friends who asked me how Google knows so much about them (and suggestions on how to avoid it) because they found it too creepy.

I don’t think people forget Google’s results are useful; some just think they’re more creepy than valuable. You seem to have picked your side in that (im)balance, and other people prefer the other side.

There’s also the relevant consideration that no matter how useful they may be, they should have no right to impose themselves on you. By this I mean that one should be free to refuse their creepiness, understanding the price is their usefulness. Yet, Google is the subject of privacy violations all the time, and they are caught time and again lying about what they collect on users.

I don’t think people forget Google’s results are useful; some just think they’re more creepy than valuable. You seem to have picked your side in that (im)balance, and other people prefer the other side.

Just as a general observation without taking either side:

People routinely fail to recognize both sides of a particular thing. It's why we have sayings like "You don't know what you've got til it's gone."

I wish interfaces were more straight up about their intentions and made it easier to implement account level partitions. For work I love Google's magic tracking effects, but at 1 am, hell no.

You can have multiple identities in chrome[0], even guest identities.

[0] https://support.google.com/chrome/answer/2364824

Right. I'm just saying it should be clearer. Ex: I want to have a list of accounts, netflix style, that I'm presented with on an empty chrome window. If in fact multiple identities don't merge data implicitly in anyway than this is just a UI issue.

But I have a hard time believing google truly partitions everything in a multi account setup.

It would be immensely useful if Google understood that normal people have multiple facades that they use in different contexts. Probably several professional (which project / component was I working on again), private but family friendly (planning gifts for relatives, etc), and private but clearly out there (stuff you don't want to shock 60 year old parents / young kids / etc with) profiles.

Also, for incognito stuff, it'd be nice to have read-only basing on stock profiles related to various activities or people.

It is actually possible to operate without relying on Google or any other big tech firm. Who is forcing you into these privacy dilemmas? All of their services are a choice you are making. You don't need to accept any of it if you don't want to.

> You don't need to accept any of it if you don't want to.

Tell that to the people who had their privacy violated by Street View[1]. And the people who specifically disabled location services on their Android devices but were still tracked[2]. Or all the people who have no idea what Google Analytics is and never consented to it, but are profiled by it everyday.

> All of their services are a choice you are making.

I do my best to avoid privacy invading companies, and as a technical user I find it tiring and know I deal with consequences (e.g. broken websites). It perplexes me that comments like yours still pop up. We’re not the only segment of the population that exists; non-technical users are the majority, and they have the same right to privacy as we do, with a modicum of transparency. If even technical people are regularly tripped by privacy invasions we didn’t know about, what chances do non-technical users have?

[1]: https://www.nytimes.com/2013/03/13/technology/google-pays-fi...

[2]: https://qz.com/1131515/google-collects-android-users-locatio...

Street view is debatably invasive. I understand this might seem hand wavy to someone really concerned about privacy issues, but

1.generally speaking I would think VERY few people care about an image of their property being on street views. 2. It's not really illegal to take pictures so even from a legal standpoint it seems like a gray area. 3. I understand there can be individual reasons for not wanting this, but it seems to be a very large net positive. And I would apply that statement to most other tracking and data policies they have.

If they are lying about how their services track people, that is definitely grounds for concern. The transparency can definitely be improved, but still these are people with Android phones and people using Google Analytics. No one is forced to use these things they are free to use any other service or create their own.

And my attitude is out of pragmatism and how I think privacy issues should be handled. I don't have any problem with the way Google uses my data so I don't care to fix a non problem. And I don't see it as their responsibility to change a way of business when anyone is free to use any other service or create their own, since I don't find it offensive.

The first sentence of the linked New York Times story:

> Google on Tuesday acknowledged to state officials that it had violated people’s privacy during its Street View mapping project when it casually scooped up passwords, e-mail and other personal information from unsuspecting computer users.

That answers your first three paragraphs. There’s no “if” to their lying and privacy invasions. They’ve been caught and admitted their actions time and again.

> No one is forced to use these things they are free to use any other service or create their own.

It is here I will respectfully give up on continuing the conversation with you. You’re either ignoring my main point or truly don’t care for the majority of users. Most people don’t understand the ramifications of these choices and for good reason; they are hard to understand. By suggesting non-technical users create their own services and devices, I’m now wondering it you’re trolling me.

> And my attitude is out of pragmatism (…) I don't have any problem with the way Google uses my data

Which is valid, but irrelevant. I’ve already mentioned in the top post different people make different choices. I presented another side and used facts to justify it. If you’re going to answer with mere opinion, you’re not adding to the points made by the original poster.

That snippet of the NYT story omits critical context: The data they captured were random wifi packets (probably for use in Skyhook-type location fixes by way of mapping out where APs are). Sounds like they were doing the equivalent of a wardrive and captured more than the AP advertisement message.

This is information that Google doesn't have any need for (noise) and didn't want in the first place.

They also self-reported the failure, where they could have just nuked it and we wouldn't be having this conversation.

What? You seem to be misunderstanding my statements.

My first points were about the streetview product. Scooping up passwords is obviously not the intent of that product, maybe that was an error or they changed the core product at some point? I can't read the paywalled article.

I'm not suggesting non-technical users create products... you're reading so far out of context. Just because user X can't create a new product does not mean that we should place sanctions on company Y. I'm glad you used facts somewhere else because in this post you just illogically connect a bunch of dots.

Yes some of it is my opinion and alot of this is yours. But a fact is still no one is forcing you to use these products, then you went off about stolen passwords and trolling and resigned yourself from the argument. That sounds like a rationality of a completely one-sided biased individual in itself, respectfully.

Yes everyone agrees transparency is good and lying is bad. Google is not Evil Or Benevolent. They're just people...

"And I don’t use them. I hoped that by continuing to mention non-technical users you’d get it, but this was never about me. You keep bringing up that argument, but read what you replied to in the first post — I recounted the experience of non-technical people I know, not my experience. Stop telling me I have a choice; the point is not us, it’s non-technical users who don’t have the knowledge to make informed choices!"

Haha you are so ridiculous. This was your first post:

There’s also the relevant consideration that no matter how useful they may be, they should have no right to impose themselves on you.

Then you say you don't know why I bring up that you don't need to use Googles services... C'mon man get real. That's why the point about using alternatives or creating new ones is very relevant and this entire thread is about sanctions. Don't start a convo you can't participate in and then just claim you won and leave, that's childish behavior.

> Just because user X can't create a new product does not mean that we should place sanctions on company Y. (…) in this post you just illogically connect a bunch of dots.

That is an insane extrapolation, and the reason I don’t want to continue the conversation with you: you’re answering points I’m not making. I haven’t even hinted at sanctions; I have no idea where you’re getting that from.

> But a fact is still no one is forcing you to use these products

And I don’t use them. I hoped that by continuing to mention non-technical users you’d get it, but this was never about me. You keep bringing up that argument, but read what you replied to in the first post — I recounted the experience of non-technical people I know, not my experience. Stop telling me I have a choice; the point is not us, it’s non-technical users who don’t have the knowledge to make informed choices!

> That sounds like a rationality of a completely one-sided biased individual in itself, respectfully.

Believe what you want. I just don’t want to keep wasting my night arguing with someone that started a discussion but refuses to address the points originally made. Why reply, then?

Maybe I’m not explaining myself well enough, or in the correct way for you to understand, or maybe you’re the one not grasping what I mean. It doesn’t really matter where the problem lies, just that it’s clearly not working.

Maybe if we ever meet in person we can resume this conversation, but tonight it’s not being productive, so I genuinely wish you a good week and sign out here.

> It is possible to operate without relying on any big tech firm.

Some writer from Gizmodo tried that last February. Let's just say that you are technically correct that you don't need any of the big tech firms.


I was curious about this, as I work in multiple languages every day. I almost never use Google though except as last resort if other engines can't find anything. So the result I got for array length was for Javascript. Which is quite high on the hype cycle now, but I only very rarely use it and search anything about it even less frequently.

Sо I wonder how much of the magic you perceive might be just your interests matching the interests of the most other people using Google and thus it's not just Google magically guessing you're into Javascript (for example) now, but Javascript being popular and this being the cause of both Google returning matches for it and you starting to use it? Did you ever do a clean experiment - e.g. try to learn APL or some other relatively obscure language and have Google return all results about APL and none about Javascript?

Going back to OPs point. Google is real good at associating search query to search result. Every time you search and click on something, google learns that association.

So it could very well be that as more users adopt the new language/framework in the first couple of weeks they have taught google those associations.

Google isn’t a search company. They are a distributed machine learning company that make most of their money from learning what people want and showing relevant ads to them.

They have adds to show first, telling what people want comes after that, knowing what people wanted is only to make the second easier and only interesting up to serving the first.

Really good or really bad only exists if there is something else to compare it to.

I always see posts like this here, and then I try it, and I get a page full of "array length" results for Javascript, while everything in the last year that I've searched for has been Java or Kotlin...

Same when I owned a Pixel after hearing about Google Now and their ML magic there. Nothing more magical than an iPhone in terms of suggestions. The camera was amazing, but not all this supposed contextual stuff.

Wild guess: in a surge of privacy consciousness you told Google to stay the heck away from your data. These checkboxes stick forever and couple years down the line some magic feature won't be able to learn from your data. E.g. despite working there, I still haven't figured out how to let Photos recognize people in my pictures, something that definitely is on by default.

Question: have you ever visited this webpage?


For many people it is enough to be totally creeped out about Google.

Also, that Google remembers context can be handy but it is not essential. Without context, I am sure you would be equally capable of finding what you are looking for, although it might take a little more typing since you'll have to supply the context yourself. Imho, convenience is not a good argument for giving away your personal information.

Yeah I must echo your sentiments wrt their Google Now product, it is great. Not only does it provides relevant content but some of it is very new and or obscure which I really appreciate. I have linked people to videos I pulled off my Google Now feed and they are amazed that I know about a video on our very specific shared interest that is less than a couple hours old and has only a few hundred views.

The flip side of this is that it makes it harder for you to stumble upon something related, but new, outside of the filter bubble Google is making for you.

There's no arguing what you're describing is useful, but it's nice to keep in mind that there are downsides even if you ignore the privacy argument (which, IMO, shouldn't be ignored).

Your results may be bad for the first week, but the better results you get later on have everything to do with Google’s long-term user-base.

> Yes, Google has a huge index, but most queries aren't in the long tail.

I'm not quite sure about that. 15% of Google searches per day are unique, as in, Google has never seen them before. [1]. That's quite an insane number.

[1] https://searchengineland.com/google-reaffirms-15-searches-ne...

Sharing for anyone who didn't know there is a very good dataset you can use now. If you don't have a nvme ssd in your computer, I highly recommend getting one for fast i/o.

http://commoncrawl.org/ http://commoncrawl.org/the-data/ http://index.commoncrawl.org/

related.. Mark's blog is amazing and worth more than any data science degree imho.

https://tech.marksblogg.com/petabytes-of-website-data-spark-... https://tech.marksblogg.com

wow, thanks.

[edit] in my experience yacy works really well. You have it crawl the sites you frequently visit and their external links and it quickly accumulates to something more accurate than google.

Wow, 15% unique searches is indeed quite an interesting figure. With that said, what OP said is definitely not disproved. Just because 15% of searches are unique, that doesn't mean the most relevant result is buried in the tail end. I mean I can think of loads of my own searches that are probably unique or rare, but lead to the same popular results because of typos, improper wording etc.

Without some clear numbers on that from a major search engine, I think this might be very difficulty to infer.

Especially with voice searches. People are searching entire sentences rather than specific keywords which are much more likely to be unique.

Do people do this?

Or do you mean queries forwarded by home assistants trying to parse inputs?

> Do people do this?

The calling card of the developer realising that real users never act like you expect :)

Real users will use your product in ways you never imagined.

Heh, yes, they do. Which is a reminder that devs are not "typical" users.

As a developer, I search using keywords; for example, if I was looking for property for sale in Inverness, I might search for "property Inverness", whereas I've seen and heard "typical" users use something like "find me a 2 bedroom house with a garden for sale in the North of Inverness" - much more verbose, and containing stop words and phrases unlikely to help (I think!).

I do the same as you, but was just thinking that if most users search using full sentences then Google will spend most effort optimizing for that, so maybe we're the ones getting the worse results?

No, the optimization they do for the low-quality query is more than balanced out by the higher clarity and relevance of a well-phrased query. There are often extraneous words that aren't simple stop words, and they're not 100% successful at removing these extraneous ones.

I almost always search keywords while my girlfriend uses sentences and we often get quite different results. If I'm having trouble finding a good result there's a pretty good chance she will find something quickly. Surprisingly this holds true even for programming questions on topics that I know well and she's never heard of before.

> As a developer, I search using keywords;

So did I.

Around the time I left Google behind I had started to search like my wife did, using full sentences. it sometimes worked better I think.

With voice I use sentences: it's far more reliable because of the Markov model (or whatever predictive model they are using).

What does it matter whether it came from an assistant or not?

Natural language is likely the preferred search input method for kids under a certain age, who cannot yet type fluently. My kids formulate very long, complex queries verbally. The other day my son asked Alexa why the machine gun is such a deadly weapon. She replied with a snippet from Wikipedia that was surprisingly relevant.

I often do full sentences and then start deleting words from it if it doesn't work.

can confirm. i search full sentences even from the keyboard

I search full sentences (questions) from the keyboard. I figure I'm not the only to have had the question before, so I ask. Also, I find that blog posts, etc. tend to match well for full sentences.

Yes, sorry - that's me. Copy and pasting Sharepoint error messages

Those searches are unlikely to be unique.

Hmmm - Error: System.InvalidOperationException: The workflow with id=15f08b34-33f5-4063-8dea-d4ca6212c0d6 is no longer available.

is not atypical.

Does that actually work? I must be old school, I always delete such IDs before searching, but then again I used Google back when it actually did what you told it instead of misinterpreting everything for you.

It doesn't seem to have any particular effect on the results that come up. I always used to delete them, and still do sometimes but Google seems to pretty much ignore them in practice.

Which is a wonderful behavior except for all the times that the error numbers are not actually GUIDs but rather identify general errors.

If only :(

Could this be explained by supposing that people are just searching for current events, sometimes national, sometimes international, sometimes very local? If so, you really wouldn't need much indexed to handle those queries. I imagine many queries are also just overly verbose and sentence-length, which artificially inflates the number of unique queries which are actually seeking roughly the same pages.

Good point and 15% is indeed much, but the question would be what "unique" means. If it means that the exact same character sequence appeared for the first time, it doesn't mean that the users searches for a term that has never been searched for.

I mean with the newest advantages like machine learning it's more and more possible to _semantically_ link queries. If that's the case, those 15% could become 5% truly unique searches or even less.

"how dumb is trump" and "how dumb is donald trump" are two different searches but they semantically belong together because they mean the same.

How many of those are confirmed to be of human origin?

Probably quite a few. New things happen. Politics, wars, famous folks, movies, music, diseases, scientific studies, products, brands, model numbers for products, fads and slang. I'm guessing there are other things as well.

Some of the new things are probably variation as well - as others have mentioned, sentences and voice commands can give lots of new stuff.

Now I feel bad for putting gibberish like jsjsjdkktkwoapaoalf in my address bar and searching Google to test if my internet is working..

I just type "test", hopefully they do that too and it is ignored.

I do that all the time, I wonder how common that is?

I would think it’s pretty common. For a lot of people google is the internet. Or at least the reference. If google isn't working it’s almost certain it’s your end. I don’t think anyone else has that reputation for availability amongst the general public.

I think they mean that the results are still from the top pages of the internet. They mean long tail of visited pages, not long tail of searches.

A unique search query could still land you on Wikipedia.

> 15% of Google searches per day are unique, as in, Google has never seen them before.

That is impossible, and therefore wrong (I'm wrong, please see below). To know if a search is unique, as in Google has never seen them before, Google must be able to decide if a query it receives was seen before or not. Even if we assume Google needed only one bit for each message it has ever seen, and assuming it only saw 15% of new messages each day since its creation more than 20 years ago, it would need to store more than 2^1471 bits.

What could be true is that each day 15% of all searches are unique on that day.

Edit: I'm wrong. The 15% of completely unique messages per day are in regards to the messages per day, and not in regards to all messages it has ever seen, therefore exponential growth doesn't apply. To see that, assume Google just received one search query each day for 20 years but it was unique random gibberish, then Google could easily save that even though 100% of all messages per day are unique.

This is somewhat a faulty analysis. One could easily use a high accuracy bloom filter to store whether a search has definitely not been seen before, and that would be an estimate on the lower bound of the error margin.

Yup. This was actually an interview question I got from a former Google search engineer.

Where are you getting these numbers? Google says they get ~2 trillion searches per year. 40 trillion searches over 20 years (way too many) would be 2^44 searches. https://searchengineland.com/google-now-handles-2-999-trilli...

(And they don’t even need to store all searches for all time for this, thanks to Bloom filters.)

The whole point was that 2^1471 is wrong.

It is roughly 1.15^(365*20). That it is wrong was clear from its size. I wanted to use it's falseness to show that the assumptions are incorrect. Which they are, just not how I understood initially.

How are you computing that number? It's definitely wrong.

Assume Google receives 1 trillion queries per year, and has been around for 20 years. Using a bloom filter you can achieve a 1% error rate with ~10 bits per item. So a 200 terabyte bloom filter would be more than sufficient to estimate the number of unique queries.

A Bloom filter is just way overkill.

If you have a list of 20 trillion query strings, and each query string is on average < 100 bytes, you're looking at a three line MapReduce and < 1 PiB of disk to create a table which has the frequency of every query ever issued. Add a counter to your final reduce to count how often the # times seen is 1.

uh, is this sarcasm?

A bloom filter is the most appropriate data structure for this use-case. How is it overkill when it uses less space and is faster to query?

Actually the bloom filter was just an approachable example. There are much more clever and space efficient solutions to this problem, such as HyperLogLog [1] (speculating purely based on the numbers in that article, it looks like a few megabytes of space would be far more than sufficient). See the Wikipedia page on the "Count-distinct problem" [2].

1: https://en.wikipedia.org/wiki/HyperLogLog 2: https://en.wikipedia.org/wiki/Count-distinct_problem

My initial approach was also technically wrong; it tells you the fraction of queries which happen once.

To find the fraction of queries each day which are new, you would want to add a second field to your aggregation (or just change the count), the first date the query was seen. After you get the first date each query was seen, sum up the total number of queries first seen on each date, compare it to the traffic for each date.

You could still hand the problem to a new hire (with the appropriate logs access), expect them to code up the MapReduce before lunch (or after if they need to read all the documentation), farm out the job to a few thousand workers, and expect to have the answer when you come back from lunch.

I don't think it's necessarily impossible to calculate. Using probabilistic data structures arranged in a clever way, it's likely possible to calculate with some degree of accuracy.

I haven't thought this through, but take all the queries as they're made and create a bloom filter for every hour of searches. Depending when this process was started, an analytics group could then take a day of unique searches, and run them against this probabilistic history, and get a reasonable estimation with low error. Although the people who work on this sort of thing probably know it far better than I.

The real question though might be assuming the 15% is right, do we care about those 15%, are they typo's that don't merge, are they semantically different, are they bots search for dates or hashes, etc.

I believe that they're unique in a sense that nobody has typed in that exact query previously.

Of course, Google knows better but to treat every search query literally. Slight deviations and synonyms work for the majority of the people, even if us techies highly oppose them and look for alternative solutions (like DDG) that still treat our searches quite literally.

>2) 20 years of experience fighting SEO spam.

Tangential - but does anyone else feel that google results are useless a lot of the time? If you search for something, you will get 100% SEO optimized shitty ad-ridden blog/commercial pages giving surface level info about what you searched about. I find for programming/IT topics its pretty good, but for other topics it is horrible. Unless you are very specific with your searches, "good" resources don't really percolate to the top. There isn't nearly enough filtering of "trash".

Yes, I feel like Google search results have very gradually become more irrelevant and spammy over the past decade or so.

There are 2 issues, I think.

Firstly, the SE-optimised spam, which has become very good as masquerading as genuine content.

Secondly, Google has dumbed search syntax down a bit, and often seems to outright ignore double quoted phrases, presumably thinking it knows better than I what I want.

As a dev, I do accept I may be an outlier though - with the incredible wealth of search history and location data that Google holds, it seems likely things have actually improved for typical users.

is there a way to turn this " ignore thing off? drives me nuts

Seeing as google has my search history for the past 14 years, they should be able to KNOW that I'm a slightly more technical user and can take advantage of power user features instead of treating me like an idiot

Google signed an armistice in the Great Spamsite War some time around '08 or '09, to the effect that spam can have all the search results aside from those pointing at a few top, trusted sites, so long as they provide any content at all. Bad content is fine. Farmed content is fine. Content that was probably machine-generated is fine. Just content. Play the game, make sure your markov chain article generator or mechanical turks post every day, throw some Google ads on your page, and G will happily put your spamsite garbage at result #3.

There’s a reason for this; click through rate on ads is higher on pages that don’t achieve the user goal.

I suspect that the AI models powering the search results develop a sort of symbiotic relationship with the spam - if the user actually finds what they are looking for by clicking through an ad on an otherwise spammy page, everyone “wins”; the user found what they were looking for with minimum effort, google got their ad revenue, and the spammy page got a little cut for generating content that best approximating the local minimum that links the users keywords to actual intent...

“Farmed content is fine”. I thought that was one of the major (intentional) victims of the Panda update. https://moz.com/learn/seo/google-panda

There are a few widespread scaled publishing operations like IAC which seems to be doing well with the split up of About.com & relaunching it as vertically focused branded sites, but the content farm business model died with the Panda update.

Some of the sites that were hit like Suite101.com went offline. eHow is still off well over 90%. ArticlesBase sold on Flippa for like $10k or some such. One of the few wins hiding in all the rubble was HubPages, but even they had to rebrand and split out sites & merged into a company with a market cap of about $26 million ... and the CEO of Hubpages is brilliant.

Even with IAC on some sites they are suggesting ad revenues won't be enough http://www.tearsheet.co/culture-and-talent/investopedia-laun... "As Investopedia charts its course as a media brand, it’s coming up against the roadblock all publishers eventually hit — the reality that display revenue alone won’t be enough. ... Siegel said he expects course revenue to exceed what’s generated from the site’s free content. While he wouldn’t say what the company’s annual revenue was, Siegel said it grew an average of around 30 percent for each of the last three years."

There is also other factors which parallel the panda update that further diminish the quick-n-thin rehash publishing business model - Google's featured snippets & knowledge graph pulling content into the SERPs so there is no outbound click on many searches - programmatic advertising redirecting advertiser ad spend away from content targeting to retargeting & other forms of behavioral targeting (an advertiser can use a URL as a custom audience for AdWords ad targeting even if that site does not carry any Google ads on it) - mobile search results have a smaller screen space where if there is any commercial intent whatsoever the ads push the organic results below the fold

I agree with this. Most searches give me almost a whole page of ads and stuff up top before the things I’m interested in start showing up way down at the bottom of the page, and even then the results are often spam.

I’ve been using DuckDuckGo and have found I have this problem less. I don’t always find what I mean on DDG, as of now I’d say Google is still better if you’re not sure exactly what you’re looking for is called, but if you know the keywords you need DDG is often better.

Someone linked to an interesting site talking about how to make homemade hot sauce here on HNs. I partly read it and thought it was a great clean site and something I wanted to try. Later going back to find it again I literally spent hours searching, even though I'm pretty sure I remembered some of the exact phrases. For some reason recipe related search results are really really terrible on both Google and Bing.

Could you not find it again via the HN site search? https://hn.algolia.com/?query=%22hot%20sauce%22%20recipe&sor...

This is awesome and helped me find it again! Thank you!

Sometimes sites get dropped from the results because they are malware hosts. It’s more likely to happen to small independent sites. They are also more likely to just pack it up and shut down their sites.

Yeah, this is why I still use and like myactivity.google.com, as creepy as it is. It's helped me re-find so many interesting half-remembered sites and videos and songs I'd previously come across.

Why would you rely on google spying instead of your own browser history?

cross platform support, maybe?

100% agree. For technical queries, as long as a StackExchange comes up, Google is still okay.

But for increasingly more basic searches about a product I'm interested in or a medication or anything else non-complicated that would have gotten me a clean list of decent, non-paid results even 5 years, I'm now getting half a page of sponsored BS and then another half a page of 'created content' written by a bot or shyster explicitly for gaming Google's SEO.

Not only has Google lost almost all their good will (i.e. Don't be evil), but their products aren't even that good anymore, at least not so much better than alternatives where the negatives of using Google outweigh the difference in quality.

Yes, at least half the time I search about a particular topic, it seems the first few pages are written by some contractor in the Philippines probably getting paid $2 / hr who just spent the prior 30 minutes researching the topic.

I am not sure that this take is accurate.

I would agree that programming search results tend to be quite good, but I think this is likely in large part because the average person attracted to programming both has a high IQ and has experience building some part of the web stack. Thus the sites that are quite manipulative in nature would have a hard time trying to fake it until they make it in such a vertical where people are hard to monetize and are very good at distinguishing real from fake. And even if a fake site started to rank for a bit it would quickly fall off as discerning users gave it negative engagement signals.

This is also perhaps part of the reason sites like Stack Overflow monetize indirectly with employment related ads targeted to high value candidates versus say a set of contextually targeted ads on a typical forum page or teeth whitening gizmo ads on the Facebook ad feed.

The lack of filtering of "trash" probably comes from a bunch of different areas

- I think there was a quote that people are most alike in their base instincts and most refined in areas where they are unique. some of the most common queries are related to celebrity gossip & such. There are also flaws in human nature where inferior experiences win based on those flaws. For example, try to buy flowers online and see how many layers of junk fees are pushed on top of the advertised upfront low price. shipping, handling, care, weekend delivery, holiday delivery, etc etc etc

- some efforts to filter trash based on folding in end user data may promote low quality stuff that people believe in. a neutral & objective political report is less appealing than one which confirms a person's political biases. and in many areas people are less likely to share or consider paying for something neutral versus something slanted toward their worldview.

- as the barrier to entry on the web has increased some of the companies that grew confident they had a dominant position in a market may have decided to buy out other smaller players in the vertical & then degrade the user experience as real competition faded. there was a Facebook exec email mentioning they were buying Instagram to eliminate a competitor. Facebook's ad load is now much higher than it was when they were smaller. But the same sort of behavior is true in other verticals too. Expedia & Booking own most the top travel portals.

There has also been a ton of collateral damage in filtering all the trash. So many quirky niche blogs & tiny ecommerce businesses were essentially scrubbed from the web between Panda, Penguin & other related algo updates.

does anyone else feel that google results are useless a lot of the time?

Google doesn't make money from you finding what you're looking for. Google makes money from you searching for what you're looking for.

It has gotten better over the years in some ways even if it feels like it also got worse. I recall pages of "ads and useful lookimh search result keywords" being more common in the past.

w3schools still outranks mdn a lot.

You're not alone. From my perspective, the value of google search results has been dropping for years. And the quality of their search results seems to be dropping in a way I suspect is profitable for google. Most of the results I get back from google these days are trying to sell me something I have no interest in buying.

For example, suppose I do a google image search for "pear", because I want images of pears obviously. The first result is indeed a pear, good job google! Except the first search result just happens to come from Amazon, and also happens to be a pretty shitty thumbnail quality photograph (355x336). It's a pear alright, but why is this particular image of a pear first? Google didn't try to give me the best image of a pear, they tried to give me the pear image they thought most likely to induce a financial transaction. Or alternatively, google let itself get cheaply manipulated by Amazon's SEO. Neither is a good look.

A much better pear image, 3758x3336 from wikipedia, is further down the search results. So it's not like google was unable to find good pictures of pears. And a non-image search for "pear" returns the wikipedia page first, so it's not like google failed to noticed the relevancy of the wikipedia article about pears. Yet the shitty amazon thumbnail of a pear shows up higher in the image search results than a high resolution photograph of a pear from wikipedia.

I would assess Google (& FB's) "crown jewel" as, ultimately, their market share, which is related to your points... and causation runs both ways.

The user data helps/ed Google create the superior UX, as you say. The reach is what makes Google & FB valuable to advertisers. A search engine with 0.1% of Google's user volume cannot charge advertisers 0.1% of Google's as revenue. Returns to scale/reach/market-share are very substantial in online advertising.

I'm glad we're talking though. Those tech giants are too powerful.

Ultimately, the old antitrust toolkit is near useless today, for dealing with tech monopolies. It's not obvious what "break up Google" even means. There are strong network effects and other returns-to-scale. It's a zero-marginal cost business, which was rare enough in the past that economists a ignored it.

We need fresh thinking, a new vocabulary, new tools, but we do need to deal with it.

'Break Google up' would mean you'd have:

* an Office suite / enterprise company (Google Cloud + Docs + Gmail + Business)

* a phone company (Android)

* a search company (Google Search + Advertisement)

* and a media company (Google Play Movies, Music, Books and YouTube)

The names would probably become different in time, but you get the gist.

Amazon and Microsoft could be broken up much the same way, in neat categorical 'silos'. Facebook should be trisected into Facebook, WhatsApp and Instagram again. I have no idea how you would break Apple up without utterly destroying their core principle, vertical integration. There is no way to do what Apple does with MacBooks or iPhones if they don't control the entire stack. I'm not saying they shouldn't be, I just see no way.

So... I think there are two issues with this.

(1) This doesn't actually reduce market share, since each of these are basically different market categories.

(2) Almost all the revenue is from search. That company is the revenue generating arm for the other ones.

(2) is one of the most important points. We have to stop Google from cross-financing new products from other revenue streams so they can no longer undercut or buy all competitors. Google Maps is a good example. They ran it for super cheap a long time to drive out competitors and now rack up the prices.

In contrast to most people here, I think breaking up Amazon is far more important to Facebook, Microsoft, Apple and many other tech companies. Only Google is as bad.

But you have to acknowledge that without the cross-financing those "markets" wouldn't even exist.

Before Google Maps we had a few online map services and they were terrible. Google Maps redefined what it means to have free access to web based interactive global maps, it changed how people find things and it was all payed by the ad business. Later on some monetizing efforts were made for it and competitors started to appear, mostly trying to catch up and copy what Google Maps did, but without the huge cash infusion of the ad business none of this would have happened.

A decade later, people take these things for granted and just want to split services up. I guess it makes sense from their point of view but to me it's not that clear what should happen while still allowing for the type of creativity and speed of development that allowed things like Google Maps to appear because I'm afraid "the next big" thing that could redefine our lives (and improve them) would be slowed down or simply made non-feasible.

> "Before Google Maps we had a few online map services and they were terrible. Google Maps redefined what it means to have free access to web based interactive global maps"

This is not true. MapQuest revolutionized things almost 10 years earlier than Google Maps. Google search is what allowed Google Maps to overtake MapQuest. Also, Android providing real-time traffic data of all their users gave them the winning formula.

Free scrolling was pretty revolutionary.

As was, like, an app that could reroute live instead of relying on pre-printed paper instructions.

Gmaps had both before MapQuest.

You are right that traffic was revolutionary and that's why google maps became the defacto standard. However, in context with the original post, this is exactly why it's unfair. Google has an android that gives them user location data which they then use as a competitive advantage in another space to eliminate all competition. If Android were 1 business and GoogleMaps another, then people like MapQuest could also negotiate deals with Android to get user data and then it's a matter of who has the best platform that wins. That's what is best for the consumer as well. In the current structure, there is no way that a small business like MapQuest could build a smartphone to ascertain user data and nor should they have to. They should only have to build the best map application to succeed in the online mapping space. Having to also succeed in location data aggregation eliminates competition. It's designed so the giants can eat the small guys at will without them being able to fight back.

Worth mentioning a couple factors related to this. You couldn't turn location data on for any service external to Google without also having it turned on for Google & even when you had location services turned off for Google sometimes they still had it turned on anyhow.

I'm not talking about using traffic data. Simply rerouting if you, for example, miss a turn, which instructions on paper can't do.

If you stopped cross financing YouTube then it would stop existing. YT has never made a profit and hosting user generated content in the YT syle is impossible to do profitably.

Google takes 45% revshare on YouTube. Some videos that were demonetized still show ads, so on those Google is taking 100%.

I've seen mid-roll ads on songs on YouTube.

Hosting costs & delivery costs (per byte) drop every year. Every year their compression gets better. Every year their ad revenues goes up. YouTube ad revenues have been growing at something like 30% a year for many years.

I think one reason Google doesn't break out YouTube profitability is because as soon as they show they are profitable they end up getting some of their biggest partners (like music labels) using those profits to readjust revshares.

Also if Google claims YouTube is not profitable they can be painted as the victim for extremist content or hate content they host, whereas if they show they were making a couple billion year a year in profits these narratives would be significantly less effective.

Core search ad prices haven't really been falling yet Google blended click prices keep falling about 20% a year while ad click volume is growing about 60% a year. This is driven primarily by increasing watch time on YouTube and increasing ad load on YouTube. Google blends video ad views in with their "clicks" count.

Yes, my thought was that by breaking everything off from everything else, these silo'd services would suddenly have to compete with the rest of their market at fair terms, instead of being propped up massively by other division(s), and thus would lose marketshare to a multitude of fresh and established competitors.

You are right though, it doesn't deal with the dominance of the search directly. My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.

As is clear I'm not really a fan of direct intervention in a single market, I see it as more of a problem when these giants muscle their way and control more and more markets, creating a vicious feedback loop.

> Yes, my thought was that by breaking everything off from everything else, these silo'd services would suddenly have to compete with the rest of their market at fair terms

I think it's instructive to look at the rest of the market. How is Mozilla funded? Basically a single gigantic contract with Google. Even Apple accepts payment from google to become the default, and it's not cheap: https://fortune.com/2018/09/29/google-apple-safari-search-en... The same logic applies to pretty much anything Alphabet spins off -- there's little difference between ownership and those contract.

About the only competition this setup produces is the ability for Mozilla to walk away to a competitor bid, which they did for like a year before bailing out at the first opportunity. There's a huge incumbency bias in these contracts. The first parallel that comes to mind is employer provided health insurance. Everyone gets to bid, but the incumbent knows the claims history far better than the competition and we'd only expect them to lose bids to companies overly optimistic about that history. Google knows how valuable various traffic sources are, but their competitors have to guess, and only when their guess is higher than Google's does it pay off. Does anyone think Yahoo winning Firefox was a good deal? I haven't seen any analysis to support that.

> My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.

Wouldn't the most profitable thing for these broken up companies be to sell their slice of the personal data pie as many parties as possible? This seems like a net loss for privacy. How much extra would it be worth to set up an exclusive arrangement?

>You are right though, it doesn't deal with the dominance of the search directly. My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.

I'm still not sure how this would work on Apple though, since their main differentiator is their design sensibilities and integration rather than their platform monopolies.

I guess iMessage and the App Store do rely on monopoly rents, but I can't think of any way to sever those links without making the iOS platform less secure.

I'm not sure how much an impact breaking Google up would have, and I say this as someone who has built a product that competes with Google's G-Suite. I want there to be a more level playing field, sure. But each of these siloed businesses would still be a monopoly in its own right.

For Google, you missed the part that makes most money.

Most of those products don't make money by themselves, they exist to keep people in the ecosystem, providing more data for the real moneymaker.

The biggest blow to Google wouldn't be to break it up into lots of small companies, you just need to separate the advertising business from everything else and you've effectively neutered the monopoly. Google's genius isn't in hiring the best engineers to providing a ton of services, it's in convincing people that they're not an advertising company, and that is where Facebook has been falling out of favor recently (I'm guessing that's why they bought Instagram, and why Google bought YouTube).

“... provide more data for the real moneymaker.”

This is a supposition - while perhaps it seems to makes sense, seems true, “must be true”, it doesn’t mean it is true!

Unless you worked on search quality at google you really aren’t in a position to know if, say, google cloud, or android provides useful signals to search (outside of the signals they’d collect anyways if they were different companies).

One thing people are obscuring is just how crazily effective AdWords are. They work for the advertisers, and they earn google like 70+% of the revenue. Confirmed via sec filings which does break that out. Go play with creating an AdWords campaign and try to infer just how much data google really needs to deliver those ads - it’s less than you’d think.

In short: this overall move is more wishful thinking than solidly reasoned. Surveying the field of streaming video, given the amount of studio driven consolidation, is there really a tons of competitors being held down and will spring up? I am skeptical.

That's an interesting thought. I agree with you that most of those products are loss leaders for data mining and thus advertisement.

But my thinking was that if you simply cut off advertising all the products still have massive marketshares and could lean on each other, as long as some succeed. Not to mention investors probably willing to prop up such a massive aggregate marketshare (one only has to look at Uber).

If you 'silo' them, success of one division of previously-Google won't lead to all of them dominating.

Thanks, I completely glossed over that since to me their advertisement division is inextricably linked to their search division. Added!

Almost all of those businesses tie back into their main business which is advertising.

Who is going to sell those ‘neat silos’ cheap advertising to survive with no other business model? Google?

Search advertising is different from web display advertising is different from streaming video advertising.

The cloud services (Enterprise apps, hosting, etc) don't need it.

  I think we'd probably see a worse Gmail (worse ads or aggressive upsells).

I rather cleave them all vertically anyways, rather than be left with a bunch of mini horizontal monopolies.

Granted most of your examples wouldn't be, except for search, but it still seems more interesting to me to just have a bunch of mini googles made from cleaving teams. Certainly that would make for some crazier competition.

Breaking up companies like Google, Amazon, and Microsoft are just not gonna happen in 2019 where huge, global mega-corps are the only way to compete outside of small local markets.

Even though a lot of these corporations build offices, hire non-Americans, and pay tons of foreign taxes in countries in which they do business, the main executives and talent still live in the US, the IP is developed here, and the majority of profits end up back in the home country.

It's better for everyone who actually matters - shareholders, intel agencies, government officials, associated businesses, etc - that these companies remain large and globally dominant, even if it screws over US citizens by having to pay the monopoly taxes and suffer the privacy invasions. We're an insignificant sacrifice in the decision-makers' minds.

Apple's already on it's descent, and at most you'd break off their cloud services, which would immediately die without the support line from the hardware.

> it's roughly feasible

What do folks even mean by "Google's index"?? Google results combine tons of signals, including personal histories for each user. Sharing metadata for the top billion urls wouldn't cover half the functionality, or make a competitive engine. And on the other hand, there may not be a single other organization in the world prepared to manage a replica of the entire data plane that impacts seatch. The proposal is somewhere between underspecified and nonsense.

Thanks, this is mainly what I came here to say. And I just don't see even the vaguely defined "index" as the crown jewel. If anything, it's "relevant results", which is something quite different.

> Bing had the money and persistence to make that investment, but how many others will?

I hypothesized once with an ex Microsoft HIGH up that it probably took 10B to launch bing. He said I was almost exactly on the nose.

Also this is a ridiculous thing to ask for. How much money do you think Google pays for the bandwidth to crawl the web? How much do you think it costs to run the machines that create indexes out of that? How do you value the IP involved in the process?

Google should give away the fruits of that labor for free, plus invest in a reasonable API to download that index? Plus the bandwidth of sharing that index with third parties? It’s probably not even feasible aside from putting disks or tapes on multiple semis to send to clients. The index is 100 petabytes according to [0]. With dual fiber lines, and no latency for mind bending numbers of API calls, that would take 12.6 YEARS to download a single snapshot.

[0] https://www.google.com/search/howsearchworks/crawling-indexi...


The hacker news guidelines specifically advise against this kind of comment.

'Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."'

'Be kind. Don't be snarky. Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.'


> Indexing the top billion pages or so won't take as long as people think.

This is what makes me wonder why we don't have a LOT of competing search engines. Perhaps i'm vastly under-estimating the technology and difficulty (I could well be - it's not my domain) but it surely it can't be THAT hard to spawn Google-like weighted crawl-based search results?

It's a long-since solved problem - heck, pageRank's first iteration recently came out of patent protection - it could just be copy'pastad. Why aren't all the big companies Doing Search?

SEO spam, and poor quality content I would guess. Google has bolted on a ton of ML over the last ten years to fight it.

And yet most Google results that don't point at one of a handful of major sites are SEO spam :-/

The spammers won. Google gave up and settled for "we like the right kind of spam—the kind that took a little effort, and makes us money".

I did a search earlier today on Google for "north face glacier" - turns out that the company North Face has a Glacier product so as far as I can tell that's all the search results contain.

Searching for "north face glaciation" did help as the first page of search results did have one entry on the topic I was actually searching on!

Maybe they should have a "I'm not buying anything" flag!

This has been the problem with results for the past few years. E-commerce gets priority in all things and you have to wade through pages of useless links if you want actual content about what you are searching for.

Big brands have the ad budget to advertise. That drives awareness. If they have offline stores, those can be thought of as both destinations AND interactive billboards which drive further brand awareness and demand for branded searches.

Many of the top search queries are navigational searches for brands.

And so if tons of people are searching for your brand then if there is a potentially related query that contains the brand term & some other stuff then they'll likely return at least a result or two from the core brand just in case it was what you were looking for.

It's not just ML, but the people that provide the labeling for the ML.

Google pays some large number of people to do search and grade the various results they get to see if the answers are good, which then helps feed back ML.

Heck, according to this article[0], google has been paying people to evaluate their search results since 2004.

[0] https://searchengineland.com/interview-google-search-quality...

It doesn't feed back into the ML directly, according to Google. Instead they use it to evaluate changes to search algorithms. If they get an increase in thumbs up back from the Quality Raters then their changes were positive. If not, they figure out why.

The original 2012 FTC investigation of Google anti-trust activity showed how they might have abused this process. Interesting read, no matter which side you take: http://graphics.wsj.com/google-ftc-report/

I feel for certain topics, especially anything to do with tutorials or coding, even Google falls foul to SEO content. Just Google ‘android custom ROM <phone model>’ for instance. There’s stock pages for all of them, identical save for the phone model, and clearly not applicable.

PageRank was an innovation at the time but modern search engines require training models on lots of query logs to get good performance. Its expensive to make a really good search engine.

It is because people just stick with their best usually instead of using a variety of search engines. It becomes rather winner takes all.

Google for general search. Duckduckgo fir general if you want something a bit more private but not extreme enough to run your own spiders. Bing mostly for porn search - not being snarky some people do consider it to have better results.

And searx.me if you want to be even more private, and you can run that yourself if you so choose.

Querying an index isn't a solved problem, building it is.

It's easy to gather the necessary data, but it's hard to know which parts of that data are the most relevant for finding good content and avoiding bad content. Is it more relevant if key words show up in links or titles than in the body of the text? If so, SEO spam sites will include a bunch of keywords in links and titles. Is it more relevant if keywords show up in the first 200 visible words of the page? If so, spam pages will make tons of pages with relevant keywords at the top.

The hard part about building a search engine isn't indexing the internet, it's adapting to spam. Spammers are continually adapting to changes in the algorithm, so the algorithm needs to adapt as well. And the more popular your search engine is, the more money you make and the more able you are too adapt to spam (and the more spammers focus on your engine).

So, the problem isn't that Google has a better index (though I'm sure it does), the problem is that nobody else has the will to spend the money necessary to tune the search algorithm to stay on top of spammers. When Google started, companies didn't care as much about improving their index and instead focused on building their other content (Yahoo, MSN, etc). Google saw the value of search and got a lead on everyone else in terms of curating results, and now they have the momentum to stay in front and have shifted to building content to improve monetization. Nobody else has the monetization network for search that Google has, so they'll continue having the problem that other companies had (Microsoft wants to point you to their other services, DuckDuckGo is limited by their commitment to privacy, etc).

In short, Google wins because:

- it was better when it mattered - it makes money directly from search - its other services improve their ability to understand what users want, which improves search quality and ad relevance

You can't make a better algorithm by being clever, you make a better algorithm by having better data, and that's hard to come by these days. The only way I can think of a competitor stepping in is if they target an underserved demographic and focus data collection and monetization there, and DuckDuckGo is close by targeting privacy conscious power users.

> The only way I can think of a competitor stepping in is if they target an underserved demographic and focus data collection and monetization there, and DuckDuckGo is close by targeting privacy conscious power users.

The irony there is that DuckDuckGo can't collect much of that data precisely because of their privacy focus.

> The hard part about building a search engine isn't indexing the internet, it's adapting to spam. Spammers are continually adapting to changes in the algorithm, so the algorithm needs to adapt as well.

Adaptive crawlers?

> Querying an index isn't a solved problem, building it is...

You didn't just hit the nail on the head; you drove it all the way in with a single blow. Bravo.

Most likely answer: lack of diversity in revenue models.

Outside of ad revenue, search has always been seen as something of a "charity" effort for the internet. It's "boring" infrastructure work that can be critically useful but doesn't really make money directly on its own. No one wants to pay a "search toll" and there's no government agency in the world that the internet would trust as a neutral index to run it as actual tax-basis infrastructure.

Which begs the question, if adblock makes advertising based models go the way of the dodo, what happens to search?

"indexing" is only part of the problem, it's a batch job. I find being able to respond to searches across a huge data set in the order of milliseconds (while having planet scale fail over) be a lot more challenging to implement.

It's not the 'raw' search itself. It's the billions (trillions) of queries they've captured: Person X searches for query Y and clicks on result Z.

This is far more valuable than the general page rank algorithms that were initially developed and have already been duplicated many times in academia and business.

It's so weird how about 1/3 of the time on DuckDuckGo, I add a !g in frustration .. half the time I still get nothing and I end up posting on Stackoverflow but half the time I get a little more useful information.

Google custom tailors results for each and every machine. Even if you're not signed in, Google uses your browser fingerprint, the OS it's reporting and location/IP data to custom fit results. There is no "stock" google result.

This is something DuckDuckGo et. al. can't do if they want to focus on a privacy model. DDG does offer location specific searches, which can be helpful.

Aside from the quality issues that others have already mentioned, I think that simply gaining traction for a new search engine is incredibly difficult - people typically use whatever is the default in their browser, or/and Google/Baidu/Yandex (which are surely the best known in their respective regions).

Consider DuckDuckGo, which sells itself on privacy, but after more than a decade has only 0.18% market share. Without the power to make it the default in an OS or browser, you'd have to have a really strong value proposition to convince people to switch.

I don't think this is correct. For years, the #3 search query on Bing in the US was "Google", and globally it used to be a double-digit percentage of all Bing queries. That suggests to me that people with a default Bing search engine had learned in droves to click their way to the preferred engine regardless of what the default was, and did so without being technically skilled enough to change the default once and for all. I don't know how large a group the latter is, but it seems hard to argue that the two together are small.

> Why aren't all the big companies Doing Search?

They are.

> 1) a record of searches and user clicks for the past 20 years

If a government was serious about getting more players in the search industry, they would force Google (and all other players) to make this data public.

Simply say "All user-behaviour data used to improve the service must be freely published".

Make the law apply to any web service with more than 20 million users globally so small businesses aren't burdened.

If the data cannot be published for privacy reasons, the private parts must be seperated and not used by google or it's competitors.

Imagine the amount of bureaucratic burden these proposals would impose (even for small business, cause it is not obvious how to count users, etc.).

> the private parts must be seperated

This means literally making legal interpretation of all documents on the net, to determine whether each of them is private or not.

> If the data cannot be published for privacy reasons, the private parts must be seperated and not used by google or it's competitors.

As a user that notices the impact of this data: please no, thanks though.

Have you ever visited youtube's home page in incognito mode? It's... bad. Really bad. Not allowing any company to use this (obviously very private) information in ranking would simply make their products suck, horribly, compared to today.

>Have you ever visited youtube's home page in incognito mode?

Do you like the personalized recommendations because of channel subscriptions?

I always get the "anonymous default" home page with YouTube and don't care. The home page is just a wasted load before I can start typing in the search bar. As a bonus, staying incognito means all the videos on the right-side panel are related to the current video. Not related to a music video I have playing in another tab.

Pretty much, and the potential for criminal activity is astronomical if you give them access to an open index. Things like every website on the web hit with the same zero day on the same day for maximum profit. Build your own best kiddie pron site evah! with direct access to the index and your own ranking system. What your admin pushed a config that left the admin pages open? Go time!

As someone who was operationally responsible for a search index (formerly VP Ops at Blekko) the kinds of things crooks tried to do was pretty instructive on how they use search in advancing their efforts.

>20 years of experience fighting SEO spam

I think we've reached an equilibrium state on this that has significantly degraded the educational quality of search engine results.

The total garbage SEO spam we used to get is gone, which is nice, but what it's been replaced with is technically relevant but mostly manipulative advertising. Product searches will basically give you a bunch of no-name blogs who are almost definitely paid off by one vendor or another.

Even actual inquiries are inundated with search results that do answer the question, but do so in extremely cursory and incomplete way. Or, in the case of recipes, Google seems to prioritize results that give you long, meandering narratives before they actually talk about their recipes. It has some very weird ideas about what people actually want when they search.

One of the most annoying things is how impossible it is to actually find the website of a local business, especially a restaurant, by Googling. Your hits are always Googles' own cobbled together dossier on the restaurant first, then some combination of Yelp, Grubhub, Postmates, AllMenus, etc. pages. If the restaurant has a website you can't tell and it's probably way on the bottom or on a second page of results.

In the past it was a handful of very decent results amidst a sea of total garbage SEO spam. Now it's a sea of mediocre content farm stuff, but it ranged from difficult to impossible to actually dig into detail on things anymore. The old spam we could at least dismiss as crap within a fraction of a second of seeing it. The new spam you have to actually read most of it before you realize it doesn't have what you're looking for.

Via API access you'd be effectively getting access to the index _plus_ the derivative search quality improvements _based on_ user data, even if you're not getting user data itself. That would certainly open the door to competition, especially on a niche basis e.g. you want to build a platform dedicated to drones - you can combine drone reviews and news with videos plus e-commerce results. The result could be awesome in sparking all kinds of small business building on Google's API.

> 2) 20 years of experience fighting SEO spam.

That's probably a key issue here though. Providing an API potentially makes it easier for spammers to identify ways to boost their content in a well automated manner.

> That's probably a key issue here though. Providing an API potentially makes it easier for spammers to identify ways to boost their content in a well automated manner.

How so? Unless you give reasoning for the scores, or provide live updates etc, just putting an API on search wouldn't change much - you can APIfy search now, there are multiple services offering it as a service. Granted, at some point it's getting expensive, but for SEO research, you're probably not running a million queries.

Totally agree. Googles' golden egg is not the index but the datasets containing searches done by the user (together with location data from Android and Maps, and speech data from Assistant).

As far as I remember Google is actually shrinking its index in terms of number of indexed websites because 90% of the internet are irrelevant for the majority of searches. Basically "quality over quantity" if you can say that.

> Basically "quality over quantity" if you can say that.

This is even more depressing. Google was such a wonderful tool for us nerds because we could finally find those usenet posts, personal blogs, tech mail lists, etc. of all the esoteric subjects that had been hard to find previously. Before Google, you'd use lists of curated links (e.g. Yahoo) for a given topic that had been traded back and forth between various sites and other interested netizens.

It's apparent that Google is becoming worse and worse for these types of searches, while it concentrates of more popular queries like "When is the next <my show> on" or "What is the current sports-ball score" or "How big are Kim Kardashian's boobs".

Just like Craig of Craigslist recently came out with an article saying the internet has actually made the news media worse, not better for informing citizens - something he did not predict correctly - it's apparent that Google is pushing us in the same negative direction in the ability to find quality information on non-consumer knowledge.

* https://www.theguardian.com/technology/2019/jul/14/craigslis...

> most queries aren't in the long tail

But that's where differentiation occurs. Every search engine will get short tail results correct. We go back to Google because it also performs with the weird queries.

I agree that algorithmic superiority will probably perpetuate Google's dominance. But making its index public is (a) legally precedented, (b) conceptually simple and (c) a small step in the right direction.

Gotta say my experience is very varying with long-tail type queries, I usually try DuckDuckGo and if that fails I search Google. They find very different things, DDG tends to be less filtered in terms of spam sites and fake news, but it also finds results of dubious copyright nature, for example.

I've had the same experience with DDG, which I use as my primary search engine. If I'm looking for a specific e.g. scientific paper or a recent news article, it doesn't have it. I run the search through Google. That's purely an indexing problem.

On the other hand, if I have a health-related search, I run it through Google. DDG has the proper content. It's just that it priorities the blog spam. That's an algorithm problem.

Relieving the former, as the author's proposal would do, makes DDG more competitive. As a second-order effect, it would also let DDG priorities resources towards the second problem, making them more competitive still.

From my experience, for long-tail queries, DDG also a lot more NSFW results than Google.

Bing does have the reputation of being better for NSFW searches than Google, so I guess that it's normal to have more NSFW false positives as well.

Index them is not hard, ranking them to yield a useful first page is.

I'd wager any startup that tries to crawl a few sites like Amazon, Yelp, Linkedin, etc will be blocked. Google, however gets a pass because they're Google. So yes, I believe their huge index, and ability to crawl any site at will is a huge, huge advantage for them.

I built a search engine that was able to crawl Amazon and Yelp. The toughest sites were reddit and facebook.

at scale? millions of pages a week? And now? I wrote a crawler that could crawl Amazon as early as a year ago too, but now it doesn't work.

And google sucks at those too.

Amazon lets anyone crawl them, Yelp has a whitelist and no you can't get on it, Linkedin has a whitelist and no you can't get on it, Facebook has a whitelist and no you can't get on it.

The long tail is important, even if it's a small percentage of search's (which it isn't anyway).

Same reason people won't buy electric cars with 100 mile ranges, even if they very rarely travel more than 100 miles.

Storage and bandwidth are cheaper than ever before, people scrape a billion pages for much more mundane purposes these days, even for academic papers.

Having a full text index on that is more involved but hardly impossible. You're completely right that it's not at all Google's secret sauce. Bing has clearly indexed much more than that, plus invested a ton in actually returning good results from their index. And still nearly nobody cares. It's just not easy to make a better Google, and the people most likely to figure out how to do that already work there.

The Common Crawl corpus is already available and stored on S3 - so analyzing billions of web pages is literally already available with an AWS account and a simple map reduce job.

I'd actually advocate for making public an anonymized list of actual search queries.

Domain specific search engines could evolved based on the demand of what has already been searched for.

Anonymizong search queries is extremely hard, if not impossible. See https://en.wikipedia.org/wiki/AOL_search_data_leak for example.

> It's just not easy to make a better Google

It depends which sense of "better" you mean. It's nearly trivial to make an ethically superior search engine by just not building the spyware bits of Google.

It's difficult to make a search engine that's "better" along the dimensions of speed, profitability, etc.

That exists, it's called duck duck go, and even less people care about it than Bing. For the most part, people don't actually care about Google collecting their entire search history and combining it with their other data on you. We may live to regret that in a hypothetical future where the government turns more authoritarian and requisitions that data for evil.

I made three statements. They're all true as far as I can see. Would the downvoters care to speak up?

Devil's advocate:

Some argue (not necessarily me) that Google isn't necessarily purely optimizing for quality using that 20 year click-and-search log, that they're accepting some inefficiency by biasing for political (left-leaning) gain or "censorship by obscurity". If competitors could more easily build alternatives, which, say, didn't have those biases, then arguably that'd put more competitive pressure on Google to not use their monopoly for bad stuff.

More importantly, Google's core competency is PageRank. Sharing the index != sharing PageRank. As time goes on, others will use inferior algorithms, and become worse. This scheme will not accomplish what it intends to do. Also, you can't just force people to give away their property.

It is the crown jewel because people choose Google precisely because they are understood to have the largest index. It's comparable to Verizon marketing 'the largest network,' but with many more benefits accrued to the company who is believed to have the largest search index.

Since the author compares the proposed API to what startpage.com does, I'm guessing he's not talking about "index" as in "raw documents", but basically Search as an API with all the sorting and ranking done.

well considering the complaints I read about Google's search quality going down for users on HN all the time I have a theory that highly technical users are adversely effected by the search improvements so an improved search engine targeting that group would essentially be one searching on what you typed.

I also happen to think that is the search engine I would prefer. I think I could build that pretty quick if I had the api access.

What Google has is "I'll google that".

In particular, they have the google.com domain. That is literally their most valuable asset.

No, most queries are in the long tail

For #1 I'd prefer if Google didnt share my search history with anyone. That would also go against GDPR in Europe right?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact