Hacker News new | past | comments | ask | show | jobs | submit login

Ex-Google-Search engineer here, having also done some projects since leaving that involve data-mining publicly-available web documents.

This proposal won't do very much. Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS. It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job.

(For comparison, when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up. Difficulty of running a MapReduce over that corpus was actually a little harder than running a Hadoop job over CommonCrawl, because there's less documentation available.)

The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006. The ones about the search & clickthrough data being important are closer, but I suspect that if you made those public you still wouldn't have an effective Google competitor.

The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better. Same reason I still buy Quilted Northern toilet paper despite knowing that it supports the Koch brothers and their abhorrent political views, or drink Coca-Cola despite knowing how unhealthy it is.

If you really want to open the search-engine space to competition, you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name. (Needless to say, you'd also need to get rid of Chrome & Toolbar integration.) Same with all the other monopolies that plague the American business landscape. Once you get to a certain age, the majority of the business value is in the brand, and so the only way to keep the monopoly from dominating its industry again is to take away the brand and distribute the productive capacity to successor companies on relatively even footing.

I think it is possible to make way, way better search engine because Google Search is no longer as good as it used to, at least for me.

I can no longer find anything remotely good quality, I discover new and quality stuff from social media like Twitter and HN.

The search results seem to be too general and too mainstream. Nothing new to discover, just a shortcut to the few websites like Reddit, StackOverflow for more techie thing and Wikipedia and the few mainstream news websites for the rest.

I usually end up to search HN, Reddit or StackOverflow directly as the resulting quality is better as I can get easily specific. Getting specific is harder on Google because it just omits or misinterprets my search query keywords quite often.

The reason for that is because Google's building for a mainstream audience, because the mainstream (by definition) is much bigger than any niche. They increase aggregate happiness (though not your specific happiness) a lot more by doing so.

It's probably possible to build a search engine for a specific vertical that's better than Google. However, you face a few really big problems that make this not worthwhile:

1) Speaking from experience, it's very difficult to define what "better" means when you don't have exemplars of what queries are likely and what the results should be. The reason search engines are a product is that they let us find things we didn't know existed before; if we don't know they exist, how can we tweak the search engine to return them?

2) People go to a search engine because it has the answers for their question, no matter what their question is. If you had a specific search engine for games, and another for celebrities, and another for flights, and another for hotels, and another for books, and another for power tools, and another for current events, and another for technical documentation, and another for punditry, and another for history, and another to settle arguments on the Internet, then pretty soon you'd need a search engine to find the appropriate search engine. We call this "Google", and as a consumer, it's really convenient if they just give us the answer directly rather than directing us to another search engine where we need to refine our query again.

3) Google makes basically 80% of their revenue from searches for commercial products or services (insurance, lawyers, therapists, SaaS, flowers, etc.) The remainder is split between AdSense, Cloud, Android, Google Play, GFiber, YouTube, DoubleClick, etc. (may be a bit higher now). Many queries don't even run ads at all - when was the last time you saw an ad on a technical programming query, or a navigational query like [facebook login]? All of these are cross-subsidized by the commercial queries, because there's a benefit to Google from it being the one place you go to look for answers. If you build a niche site just to give good answers to programming queries or celebrity searches or current events, there's no business model there.

> It's probably possible to build a search engine for a specific vertical that's better than Google.

Funny, I don't disagree with this, but my perception has been that Google seems to detect when I've switched roles from one type of programmer to another. I don't know if that's organic from the topics I'm looking up or not, but if I'm looking up a generic string search, it seems to return whatever language I've been searching for recently. (very recently in fact)

My point is, it seems like the search engine intuitively understands my "vertical" already. Maybe it's just because developer searches are probably pretty optimized.

I think its totally possible, two examples already:

Google Ads (used to?) lets you target by "bahaviour" vs "in-market". They can tell the difference between someone who is passionate about beds, maybe involved in the bed business (behavior) and the people who are making the once-in-a-decade purchase of a bed (in-market).

Google can tell devices apart on the same google account and keep together search threads. I might be programming on my desktop making engineering searches but at the same time I'm googling memes on my phone; both logged into the same account.

> Speaking from experience, it's very difficult to define what "better" means when you don't have exemplars of what queries are likely and what the results should be.

Better is a search engine that takes your queries more literal. This is what everybody means when they say Google used to be better. The query keywords and no second guessing.

When you insist on Google using verbatim mode or something, you often don't get results. Which is bullshit because I remember 10 years ago, queries like these had me plowing through the results, so much that you actually had to refine the query -- you can't do that in Google any more, at least it's not refining, it's more like re-wording and re-rolling the dice. But it all feels very random and you don't get a feel for what's out there.

I mean sure there is a place for a search engine like this, if it works well. And in its own way, Google works well.

I sometimes do want my query to be loosely interpreted like I'm an idiot, and I head straight for the Google. Ever since I saw the "that guy wot gone painted them melty clocks"-meme, for certain types of queries I have indeed found that if I formulate my question like I got brain damage, I get superior results. Because that is the kind of audience Google wants you to be.

But sometimes you don't feel like the lowest common denominator and you don't want to be treated as such. And there should be a place for that, too.

Very interesting perspective. I completely understand your point. It used to be a tool, not it is more like a system with a mind of its own. I might need both.

Why do you say there is no business model in a search niche? StackOverflow and pleny of listing sites (Tripadvisor, Yelp, Zillow, Capterra to name a few) have been successfully built in this exact premise and the user experience of searching for restaurants, real state or software on these sites is usually much better than searching directly on Google due to the availability of custom filters and the amount of domain-specific metadata that the global search engines cannot read. While it's true that most of these sites heavily rely on SEO to drive inbound traffic from the big G, there is no doubt that they are perfectly viable businesses.

StackOverflow and those other sites aren't search engines. They may have search engines in them but not many people use them (the only time I reach StackOverflow, booking.com etc is via search engine referral). They're user content hosting and curation sites.

Technically you are correct, in the sense that they do not crawl the web like Google or Bing do. But from a user perspective, they do provide a very useful service of aggregation, discovery and comparison of structured data that is way more effective than using Google search queries, if you know the type of information you are looking for.

It's the corpus that matters, mostly. The StackExchange sites are Q&A formatted and with an SKG graph (such as in solr), you can do topic extraction on questions OR answers which then leads to being able to match other answers (with links) to other questions, among other things. With related topics, many other things come to life.

Sure, they have a business reason to do exactly what they do but I think as people grow up they specialize and the general stuff that fits everybody becomes useless. Google tries to personalize search results but that so far yielded echo chambers, not personalized discoveries.

I can't get better products by searching Google, I can get the best-spammed products or most promoted products only.

The fact that I am getting low-quality service and Google is printing money means that there is a place for good a good service and if that service cannot emerge due to Google's practices, it probably means that the regulators need to take action.

Or maybe the search is dead, long live social media.

The gist is, I am not happy with a service but the company that makes that product makes a lot of maney. Can't tell if I am an anomaly or if other people feel the same way because Google is a monopoly and maybe the regulators should make it possible to compete with Google and see if there's a space for a better service.

Yes yes, I am the product but I am the product only if I am happy with the stuff I'm getting in return.

... in return for you being the product? Haha. I don't think Google sees their end of that "transaction" being an actual transaction. You're an individual, and Google doesn't deal with those.

How would google's practices stop me from creating a search engine?

Keep in mind when Google started, Yahoooooo! Was the big player and Google overtook them by simply being better

Everything turns into an echo chamber eventually.

> navigational query like [facebook login]

Definitely have seen malicious adds for "facebook login", though that was probably 2016 or 2017.

I see comments like this all the time. Am I alone in that search results, for me, have gotten significantly _better_ since a couple years ago?

I can't help but think it's partially due to people using tools _specifically designed_ to make Google's job harder (FF SandBoxes, uBlock, etc) and not understanding the implications of using them... and then blaming Google for returning "bad" results.

I get a lot more seo spam than I used to, but the results are still quite good. I think we should give google some credit for that at least.

Like, a lot more seo spam though.

People have gotten really good (i.e. it's their full-time job) at "gaming" Google. That's not to say Google is fallible - every search engine is game-able depending on its algorithm - it's just that these people are _very_ clever.

They don't even need to clever as much as persistent, because of the selection effects.

> specifically designed to make Google's job harder "Better search" doesn't necessarily mean "more personalized search."

Unless you're a very average person, I'd argue it does.

Google have metrics on how much better it makes search - at least when I was there it makes quality a lot better, but not, say, double the quality. I think in the early days of the company they thought personalisation would be a much bigger win than it was - it was big enough that it didn't make sense to turn it off or anything like that, and you can see it in action when people say their results are customised to the programming language they are most recently using. But most of the time it's not doing all that much - and the biggest component of it was basic stuff like location and language.

> Getting specific is harder on Google because it just omits or misinterprets my search query keywords quite often.

I have this problem too. Google often thinks that I made a typo and presents me results for things I didn't searched for or care about and I have no way to force it to search for things I really want.

This is exactly when I switch to brain damage mode querying. You like fixing typos Google? Have some typos. You like figuring out what I really mean Google? Here, I'll formulate my query like a deranged toddler on PCP, best of luck!

Maybe it just feels more successful because it lowers my expectations. But at least you get to mash the keyboard like a maniac, do no corrections, press return and watch it just work.

It's kind of like watching Google do a "customer is always right" squirm.

If there were viable alternatives, people would shift over time.

If I type in “<name> Pentagon” on Google, the first link is LinkedIn. DuckDuckGo doesn’t even list it at all. There’s countless examples where DuckDuckGo just can’t find basic information. DDG is just unreliable beyond it’s silly name.

I'm always confused by this. I have ddg as the default on my home computer and Google is the default on my work. So I'm constantly using both. There aren't really any apparent differences to me in results. I'm not sure what everyone else is searching, but I search everything from how to spell a word that I should definitely know all the way to niche topics in physics.

Maybe it's because I don't have tracking enabled in Google (I'm not logged into my account when at work) and opt out of tracking where I can. Maybe this is the difference between the lack of difference I see and the huge difference so many others see. But I still don't see it as an issue because I generally find what I'm looking for with one search. Might be the third item, but that's not an issue to me.

I hear this so often that I assume something has to be different. I'm curious if others have ideas as to what it might be, or if I correctly identified them.

I use DDG as my default everywhere, and when I don't find something, I'll !g it as a bit of a last resort.

I'd estimate I'm doing that maybe 5% of the time. It seems to be even odds that I find a satisfying match, though, obviously those are all the hard queries.

The hardest queries are trying to dig up details about stories in the news.

I try to use and like DDG, but the results just aren't as good. For example, it seems to be completely unaware of Docker Hub. Like, pages from that entire subdomain never show up. I can search "Docker hub" and it doesn't even show up.

For that specifically, use !dhub or !dockerhub to search the site directly. Really, the magic of DDG is bang queries.

(Search for bang queries with, not surprisingly, "!bang".)

usually I just do !g and that solves the problem ;-)

But also, thank you. I didn't realise there were so many bangs.

I agree, unfortunately the search is really really sub-par and like others said, frequently doesn’t find basic things no matter how specific the keywords I use are.

I feel it might have even been better at one stage ?

Unless you're searching in Russian, DDG is mostly a skin for Bing search results anyways. The major players in the search engine space are Google, Bing/MSN, Yandex, and Baidu - with the latter two being mostly language-specific.

I find DDG has pretty acceptable or even good results most of the time.

The real power is in the "bangs", though; you can use the `!` to immediately jump to the first search result without seeing a search page, or use `!g` to switch to Google for this particular query, among others. It enables a sort of power-user usage that one wouldn't get with Google.

I don’t really get the logic, just use a good search engine in the first place ?

I'm saying that DDG can be "good enough", and that not having to click around on a results page can save you time if you know what you're doing.

I understand that for some people that's not enough of a time savings to make a difference, but I know DDG well enough to be able to `!` things and almost always immediately get to a successful result. I treat it as an extension of my brain at this point.

The logic is when you made DDG your default search for the address bar. Then it becomes the zero-stop jump off point for all the other search engines they have !bang syntax for (which are thousands, I think).

I used to configure those as search keywords in Firefox (and before, Opera), which do roughly the same without the exclamation point. But on a new browser, even just configuring your favourite top 5 searches is a lot of hassle compared to just setting DDG as the default search and using their bangs.

It's for when the good search engine is the site's own page.

If I'm working on python and numpy and I want to look up `argsort`, I know I want to search the numpy page, so !numpy argsort takes me right there.

Any kind of web dev is !mdn whatever and I don't have to scroll through a dozen BS tutorials, I just get the specs.

The !bang feature I use the most is !w for wikipedia, however I don't use wikipedia enough to justify making it my default search engine on the nav bar.

Your browser can assign keywords to custom search engines so you could just type "wiki blah" to see Wikipedia or "jira 123" to load a specific ticket.

What does a viable alternative look like?

I've been using Bing for the past few months; it's not great or terrible but is it "viable" enough for people to shift to over time? Or is it not viable because it's backed by a major corporation?

I'm sure there are search quirks with each engine but I've seen issues with Google too and yet it's the "devil we know" ... so people unconsciously work around them.

I've used Bing for years now. The only time I go back to Google is if I'm searching for something super specific (normally programming related). Bing takes care of most of my search needs.

I wonder if this is due to google possibly ignoring the robot.txt and Bing (which powers DDG) accepting Linkedin's request? https://www.linkedin.com/robots.txt

I've been using DDG almost exclusively & find it's results to be better than Google's with the exception of local businesses & maps. Google still has an advantage there.

Neither DDG or Google return any LinkedIn results for me unless I also add LinkedIn to the search, in which case I get the same results for both search engines.

Google knows what you want before you even ask. You might find that convenient, I find it unsettling.

I guess it’s not as bad as Facebook; at least Google doesn’t spoon feed you.

This ^ times a 1000.

Google simply has the best search product. They invest in it like crazy.

I’ve tried bing multiple times. It’s slow, it spams msn ads in your face on the homepage. Microsoft just doesn’t get the value of a clean UX.

DuckDuckGo results are pretty irrelevant the last time I tried them. There is nothing that comes close to their usability. To make the switchover, it has to be much much better than Google. Chances are that if something is, Google will buy them.

One thing to keep in mind when comparing DuckDuckGo to Google is that people do not use Google with an alternative backup in mind. When you DDG something and it fails, you can always switch to google.

But what about when Google fails? Unlike DDG, there is no culture of switching between search engines when googling. Typically, you'll just rewrite the query for google. And as rewriting the query is an entrenched part of googling, you are less likely to notice this as a failure. It is this training that's the core advantage nostrademons points out.

This right here is why I don't understand people who complain about DDG's search results. If you simply make the commitment to not use Google, for whatever reason that may be, then using DDG becomes exactly the same process of rewriting search queries until you get the thing you're looking for.

I've been using DDG exclusively since I was a contractor at Google years ago and have never had a problem finding things with it...

I don't necessarily agree. The hard part of search is building the index and differentiating _real_ promotion from the _fake_. There's a lot of SEO manipulation that Google does a good job avoiding.

Webspam is a really big problem, yes. It's very unlikely that you'd be able to catch up or keep up in that regard without Google's resources.

Building the index itself is relatively easy. There are some subtleties that most people don't think about (eg. dupe detection and redirects are surprisingly complicated, and CJK segmentation is a pre-req for tokenizing), but things like tokenizing, building posting lists, and finding backlinks are trivial - a competent programmer could get basic English-only implementations of all three running in a day.

I am not even that good of a programmer and I also agree with you that index relatively trivial. Other major issues, besides fighting spam:

- Hardware infrastructure and data center presence for extremely fast search from anywhere in the world. - Near real-time search suggestion. - personalized search results based on past search + geolocation. - Search to get instant results without having to go to a website.

Just to name a few. Google Search is the gold standard of a search engine, not because its Google or because they have been around for a long time and the brand name sticks (I am sure it helps too), but for the simple fact is no search engine is even remotely close to being as good as google and I have tried them all more of the less and given them shot. They are just not good at all.

I also don't understand the hate towards google being in charge of so many products so many people use, ie, Mail, Maps, Chrome, Android, Docs (to name a few). It's simply because they are damn good at it. If its a crime to make a product so good that people continue to use it, then I don't know what else people are supposed to do. As if we are asking google to make shit products, I just don't understand the reasoning.

It has nothing to do with the number of products, it’s what they do with their influence over the market. See AMP and incompatibilities between Gmail & IMAP, for example.

You concentrating on the literal interpretation of the phrase “give access to the index”. This is non-technical article which didn’t go into details, just read it as “give access to index & ranking”.

> Google simply has the best search product.

The best available doesn't necessarily mean the best possible. And Google is far from it, and it's getting worse, not better.

I've definitely noticed a decline in quality from Google results over the past few years in particular. I don't know if that's because SEO has gotten control of the results of if Google's algo is shoving lower quality up higher for revenue but it's become difficult.

Using a bit of Google-fu I'm usually able to find what I need quickly but it's still more of a hassle than it used to be.

There's exponentially more background noise than there used to be

It's easier to return the most relevant 10 results when there's only 10 thousand options than when there's 10 trillion options with 10 thousand new ones created every day.

I work at Google but not on Search.

My guess is that it's because Google Search now also has to cater to queries from Assistant. Being required to handle web, mobile, and assistant probably necessitated tradeoffs in quality of one over another.

More generally I feel like as the company gets bigger it just gets much harder to handle all the complexity and keep things focused.

I don't know why you're getting downvoted, because the quality has 100% tanked over the last few years. I agree that there may be some selection bias between us, but it's at least got some of my normie non-technical friends commenting about it, so it's not completely without merit. I have a couple of theories, one of them is also a warning.

First, I think search results at Google have gotten worse because people are not actually good at finding the best example of what they're looking for. People go with whatever query result exceeds some minimum threshold. This means when Google looks at what people "land on" (e.g. something like the last link of 5 they clicked from the search page, and then which they spend the most time on according to that page's Google Analytics or whatever), they aren't optimizing for what's best, they're optimizing for what is the minimum acceptable result. And so what's happening is years and years of cumulative "Well, I suppose that's good enough" culminating in a perceptible drop in search result quality overall.

Second, Google has clearly been giving greater weight to results that are more recent. You'd think this would improve the quality of the results which "survive the test of time" but again, Google isn't optimizing for "best" results, they're optimizing for "the result which sucks the least among the top 3-5 actual non-ad results people might manage to look at before they are satisfied". So this has the effect of crowding out older results which are actually better, but which don't get shown as much because newer results have temporal weight.

My warning is this, too, which you've surely noticed: Google search has created a "consciousness" of the internet, and in the 90s it used to be that digitizing something was kind of like "it'll be here forever" and for some reason people still today think putting something online gives it some kind of temporal longevity, which it absolutely does not have. I did a big research project at the end of the last decade, and I was looking for links specifically from the turn of the century. And even in 2009, they were incredibly hard to find, and suffered immensely from bitrot, with links not working, and leaning heavily on archive.org. Google has been and is amplifying this tremendously, by twiddling the knob to give more recent results a positive weight in search. Google makes a shitload of money from mass media content companies (e.g. Buzzfeed) and whatever other sources meet the minimum-acceptable-threshold for some query, versus linking to some old university or personal blog site which has no ads whatsoever. So the span of accessible knowledge has greatly shrunk over the last few years. Not only has the playing field of mass media and social media companies shrunk, but the older stuff isn't even accessible anymore. So we're being forced once more into a "television" kind of attention span, by Google, because of ads.

I find the single hardest thing to search for these days is anything more than a few months old on YouTube... They hate older videos, it feels like. Beyond that, I keep seeing suggestions on new content from years ago... it's just weird.

I know it's not google proper, but I'd guess a significant number of their searches are specific to youtube.

I believe they try to put newer content first in order to make a more fair distribution of views. If you order results by popularity on yt, you will see that it uses just an "order by view count desc" (no relationship with like/dislike ratio), which is bad because it keeps popular some not so good quality videos published on first yt years.

Worse still, imho is that it may not be a popular video I'm looking for. I really wish they'd factor in a "I have viewed" for results.

I disagree. It works great for me. Maybe once every few days I will use !g when I can't find something, but I rarely end up finding it on Google either.

I read somewhere that someone used a skin to make ddg look identical to Google. After doing that, they never even thought about using Google again.

Microsoft thinks what they have is Clean UX.....

Microsoft just needs to get their head out of their ass! With the amount of money they have spent- and what they have to show for it- they should should just can the entire Bing team (or whatever they call their search engine team today). Not only have they not sucked- if they just folded - they'd let the monopoly argument against Google ride somewhat.

Sure, it costs $50 to grep it, but how much does it cost to host an in-memory index with all the data?

This is not a proposal to just share the crawl data, but the actual searchable index, presumably at arms length cost both internally & externally.

The same ideas could be extended to the Knowledge Graph, etc.

IMO the goal here should not be to kill Google, but to keep Google on their toes by removing barriers to competition.

The data was about 55TB of compressed HTML last I looked, so that's about 70 r5a.24xlarge instances, each going for $5.424/hour, so about $350/hour or $250K/month. That's not cheap, and definitely not something you'd put on your personal credit card, but it's well within the range of a seed-funded startup. Sizes may vary a bit depending upon the exact index format, but that should be a rough ballpark. With batch jobs being so cheap, you could experiment a bit with your own finances and then seek funding once you can demonstrate a few queries where your results are better than Google. If you actually have a credible threat to Google, you'll have investors breathing down your neck, because it's a $130B market.

API access to either the unranked or ranked index in memory wouldn't do anything useful, BTW. To have a viable startup you need something a lot better than Google, which means that you need algorithms that do something fundamentally different from Google, which means you need to be able to touch memory yourself and not go through an API for every document you might need to examine. Remember, search touches (nearly) every indexed document on every query - if you throw in 200ms request latency for 4B documents your request will take roughly 25 years to complete.

Knowledge Graph is already public - it was an open dataset before it was bought by Google, and a snapshot of its state at the point Google closed it to further additions is still hosted by Google:


(It's only 22G gzipped, too - you can download that onto a personal laptop.)

"Remember, search touches (nearly) every indexed document on every query" - wait, why does that happen?

Doesn't it only touch ones with at least one of the search terms in, or stemmed/varied words relating to some of the terms? And does that via an index?

I struggled with how to word that in a way that's both true, understandable, and doesn't give away any proprietary information. Added "indexed" to clarify but I didn't fix up the numbers, so they're likely an overestimate.

Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index. Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss), and it's usually beneficial to merge the scores only after they have been computed for all query terms, because you have more information about context available then.

There's a reason Google uses an in-memory index: it gives you a lot more flexibility about what information you can use to score documents at query time, which in turn lets you use more of the query as context. With an on-disk index you basically have to precompute scores for each term and can only merge them with simple arithmetic formulas.

> Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index.

But, reading through the other comments, leaving out this part would make it better than Google.

Maybe stemming. I remember when Google added stemming (somewhere in the early 2000s). I was conflicted about it because I didnt want a search engine to second-guess my query (can you imagine??), but I also saw the use because I was already in the habit of trying multiple variations.

Auto spelling correct is a no-no. Just say "did you mean X?" and let people click it if they misspelled X. No sense in querying for both the "typo" and "corrected" keywords, because the "typo" would rank much lower, right?

Similar for synonyms. Either it should be an operator like ~, or maybe it should just offer a list (like the "did you mean" question) of synonyms to help the user think/select similar words to help their query.

> Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss)

You mean like Wand or BMW?

> Knowledge Graph is already public > https://developers.google.com/freebase/

That dump is outdated, not supported, and very incomplete comparing to what google has now.

Perhaps move the google index and the facebook graph to "utility" companies, with google/facebook being frontends/consumers for those companies. Tiered access costs based on query/access volumes could fund the utility, and allow smaller companies to have access with costs based on their scale, if they can monetise as they scale up to cover the costs then they should not be in business.

>The comments here that PageRank is Google's secret sauce also aren't really true - Google hasn't used PageRank since 2006.

That's quite a claim considering they were reporting PageRank in their toolbar until 2016, and toolbar PageRank was visible in Google Directory until 2011.

Are you talking about PageRank from the original patent?

It is a seemingly incorrect claim. Google has semi-recently, publicly said they still use PageRank as one of their signals.



They replaced it in 2006 with an algorithm that gives approximately-similar results but is significantly faster to compute. The replacement algorithm is the number that's been reported in the toolbar, and what Google claims as PageRank (it even has a similar name, and so Google's claim isn't technically incorrect). Both algorithms are O(N log N) but the replacement has a much smaller constant on the log N factor, because it does away with the need to iterate until the algorithm converges. That's fairly important as the web grew from ~1-10M pages to 150B+.

> That's fairly important as the web grew from ~1-10M pages to 150B+.

This is the weird thing -- it feels smaller. Back in the early 2000s it really felt like I was navigating an ocean of knowledge. But these days it just feels like a couple of lakes.

(also, I'm pretty sure it was billions already quite early on?)

So what's the name of the new algorithm?

>The real reason Google's still on top is that consumer habits are hard to change, and once people have 20 years of practice solving a problem one way, most of them are not going to switch unless the alternative isn't just better, it's way, way better.

I agree about consumer's habits, but not about quality - i mean google of today is worse search engine than google of 5 years ago.

Now google tries to guess, badly, what you meant, instead of giving you what you asked for. The pleasure of dealing with IT systems is that they give you what you ask them for, not what you meant - it introduces extra error, and worse - one that cannot be fixed by user.

I can rephrase my querry, and google will still interpret it - leading to same batch of useless results.

I can also comment here. I built and still run a petabyte-scale web crawler:


Common Crawl and other sources do in fact have a ton of data that can be used which is very affordable.

The DATA itself stopped being a real competitive advantage probably 2008-2010.

Google's major advantage now is its algorithms and the fact that they've proven it works and is reliable.

Most importantly, its the brand. Google MEANS search in the US and that won't change anytime soon.

PS,... if you need tons of web and social data Datastreamer can hook you up too :)

>"Indexing is the (relatively) easy part of building a search engine. CommonCrawl already indexes the top 3B+ pages on the web and makes it freely available on AWS."

Interesting I would have thought that crawling at this scale and finishing in a reasonable amount of time would still be somewhat challenging. Might you have any suggested reading for how this is done in practice?

>"It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job." Curious what type of Hadoop job you might referring to here. Would this be building smaller more specific indexes or simply sharding a master index?

>"Google hasn't used PageRank since 2006." Wow that's a long time now. What did they replace it with? Might you have any links regarding this?

Crawling is tricky but it's been commoditized. CommonCrawl does it for free for you. If you need pages that aren't in the index then you need to deal with all the crawling issues, but its index is about as big as the one most Google research was done on when I was there.

$50 gets you basically a Hadoop job that can run a regular expression over the plain text in a reasonably-efficient programing language (I tested with both Kotlin and Rust and they were in that ballpark). $800 was for a custom MapReduce I wrote that did something moderately complex - it would look at an arbitrary website, determine if it was a forum page, and then develop a strategy for extracting parsed & dated posts from the page and crawling it in the future.

A straight inverted index (where you tokenize the plaintext and store a posting list of documents for each term) would likely be more towards the $50 end of the spectrum - this is a classic information retrieval exercise that's both pretty easy to program (you can do it in a half day or so) and not very computationally intensive. It's also pretty useless for a real consumer search engine - there's a reason Google replaced all the keyword-based search engines we used in the 80s. There's also no reason you would do it today, when you have open-source products like ElasticSearch that'd do it for you and have a lot more linguistic smarts built in. (Straight ElasticSearch with no ranking tweaks is also nowhere near as good as Google.)

Thanks for the detailed response. I appreciate it. I will look into CommonCrawl. Cheers.

IMHO a simpler and probably the only viable way to force competition is to legally force Google to not respond to any query on certain periodic time periods.

For instance, if you were to forbid Google from operating on every odd-numbered day, then 50% of the search engine market and revenues would immediately be distributed among competitors and furthermore users would be forced to test multiple engines and they could find a better one to use even when Google is allowed to operate.

Obviously this has a short-term economic cost if other search engines aren't good enough as well as imposing an arbitrary restriction on business, so it's debatable whether this would be a reasonable course of action

Banning anything is often not a good policy since it usually creates secondary markets.

Depends on how you count the date, this could create markets where people in different countries will sell Google search results to each other. New VPN providers pop up with the promise of 24h Google coverage. Software startups switch to a system where you bing work 16 hours straight, then get the next 32 hours break and repeat. "Breaking news" has a new Oxford definition, since newspapers change plan to publish news 5 minutes before Google opens for search. Electricity price increases for the first 2 hours of the odd-numbered day to combat the spike in demand. Comcast introduces a new fast lane at only $199 a month that has no slow down access to Google. University groups lobby for a new exemption in the law allowing unrestricted weekdays access. Political parties lobby for also blocking Google on the day of debates, regardless of whether it's an odd-numbered day. It's kinda fun to keep going.

Any search engine that was unavailable for 50% of the time would soon have 0% of the market, not 50%.

This can be solved, in the odd-days example, by making either the second most popular or all other search engines operate only on the even days (as well as making the restriction apply to the most popular engine instead of Google in particular).

This has other drawbacks of course.

Actually the omnibox made it really easy to switch to ddg. With an occasional fallback to google.

I have no problem with advertising etc. but the tracking and selling of data is such an idiotic thing. We as consumers should have a global internet-law, and be reimbursed for data leaks or usage outside the scope of the application.

By no problem with ads I mean the original ads of google. Very clear they were ads and not intermingled with the results. Scrolling down for the results is nuts. I will click ads if they’re relevant, regardless of if they’re on the right or in the results. So please stop supporting this fraud against advertisers.

I think that the "fallback to Google" might actually tend to diminish consumer confidence in DDG. Every time you use it, you basically say to yourself "$newRiskyStrategy fails sometimes, we still need $oldReliableStrategy".

Instead, what might help DDG is a plugin that detects when you go past the first or second page of Google search results, and suggests that you might get better results on DDG. It's a little intrusive, but the mental nudge becomes "$oldReliableStrategy has flaws, try $newRiskyStrategy". You get a positive emotional interaction with DDG rather than "forcing" yourself to use it all of the time and "failing back" to Google.

> when I was at Google nearly all research & new features were done on the top 4B pages, and the remaining 150B+ pages were only consulted if no results in the top 4B turned up

This may help to explain the poor quality of some the results on queries I run on Google lately that return content obviously written for ranking in SEO but that have very little value.

I have 2 questions:

- What make"the top 3B+" the top ones?

- How can I "force"a search on the other 150B+ pages?

I find it odd that you claim to be a former Google search engineer and in the end boil down the success of Google search to brand recognition / loyalty. You kinda glossed over the insane complexity of building and maintaining a high quality search engine, really weird comment to be honest.

> you'd have to break Google up and then forbid any of the baby-Googles from using the Google brand or google.com domain name.

Just let "google" become the generic term for search, as it's already well on its way.

Page Rank is a synonym for link juice. So when you say Google hasn't used page rank since 2006, can you confirm that you are talking about link juice as opposed the the old toolbar representation of page rank? And assuming you do mean link juice, well why do links still work so well for seo?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact