Hacker News new | past | comments | ask | show | jobs | submit login

> Indexing the top billion pages or so won't take as long as people think.

This is what makes me wonder why we don't have a LOT of competing search engines. Perhaps i'm vastly under-estimating the technology and difficulty (I could well be - it's not my domain) but it surely it can't be THAT hard to spawn Google-like weighted crawl-based search results?

It's a long-since solved problem - heck, pageRank's first iteration recently came out of patent protection - it could just be copy'pastad. Why aren't all the big companies Doing Search?

SEO spam, and poor quality content I would guess. Google has bolted on a ton of ML over the last ten years to fight it.

And yet most Google results that don't point at one of a handful of major sites are SEO spam :-/

The spammers won. Google gave up and settled for "we like the right kind of spam—the kind that took a little effort, and makes us money".

I did a search earlier today on Google for "north face glacier" - turns out that the company North Face has a Glacier product so as far as I can tell that's all the search results contain.

Searching for "north face glaciation" did help as the first page of search results did have one entry on the topic I was actually searching on!

Maybe they should have a "I'm not buying anything" flag!

This has been the problem with results for the past few years. E-commerce gets priority in all things and you have to wade through pages of useless links if you want actual content about what you are searching for.

Big brands have the ad budget to advertise. That drives awareness. If they have offline stores, those can be thought of as both destinations AND interactive billboards which drive further brand awareness and demand for branded searches.

Many of the top search queries are navigational searches for brands.

And so if tons of people are searching for your brand then if there is a potentially related query that contains the brand term & some other stuff then they'll likely return at least a result or two from the core brand just in case it was what you were looking for.

It's not just ML, but the people that provide the labeling for the ML.

Google pays some large number of people to do search and grade the various results they get to see if the answers are good, which then helps feed back ML.

Heck, according to this article[0], google has been paying people to evaluate their search results since 2004.

[0] https://searchengineland.com/interview-google-search-quality...

It doesn't feed back into the ML directly, according to Google. Instead they use it to evaluate changes to search algorithms. If they get an increase in thumbs up back from the Quality Raters then their changes were positive. If not, they figure out why.

The original 2012 FTC investigation of Google anti-trust activity showed how they might have abused this process. Interesting read, no matter which side you take: http://graphics.wsj.com/google-ftc-report/

I feel for certain topics, especially anything to do with tutorials or coding, even Google falls foul to SEO content. Just Google ‘android custom ROM <phone model>’ for instance. There’s stock pages for all of them, identical save for the phone model, and clearly not applicable.

PageRank was an innovation at the time but modern search engines require training models on lots of query logs to get good performance. Its expensive to make a really good search engine.

It is because people just stick with their best usually instead of using a variety of search engines. It becomes rather winner takes all.

Google for general search. Duckduckgo fir general if you want something a bit more private but not extreme enough to run your own spiders. Bing mostly for porn search - not being snarky some people do consider it to have better results.

And searx.me if you want to be even more private, and you can run that yourself if you so choose.

Querying an index isn't a solved problem, building it is.

It's easy to gather the necessary data, but it's hard to know which parts of that data are the most relevant for finding good content and avoiding bad content. Is it more relevant if key words show up in links or titles than in the body of the text? If so, SEO spam sites will include a bunch of keywords in links and titles. Is it more relevant if keywords show up in the first 200 visible words of the page? If so, spam pages will make tons of pages with relevant keywords at the top.

The hard part about building a search engine isn't indexing the internet, it's adapting to spam. Spammers are continually adapting to changes in the algorithm, so the algorithm needs to adapt as well. And the more popular your search engine is, the more money you make and the more able you are too adapt to spam (and the more spammers focus on your engine).

So, the problem isn't that Google has a better index (though I'm sure it does), the problem is that nobody else has the will to spend the money necessary to tune the search algorithm to stay on top of spammers. When Google started, companies didn't care as much about improving their index and instead focused on building their other content (Yahoo, MSN, etc). Google saw the value of search and got a lead on everyone else in terms of curating results, and now they have the momentum to stay in front and have shifted to building content to improve monetization. Nobody else has the monetization network for search that Google has, so they'll continue having the problem that other companies had (Microsoft wants to point you to their other services, DuckDuckGo is limited by their commitment to privacy, etc).

In short, Google wins because:

- it was better when it mattered - it makes money directly from search - its other services improve their ability to understand what users want, which improves search quality and ad relevance

You can't make a better algorithm by being clever, you make a better algorithm by having better data, and that's hard to come by these days. The only way I can think of a competitor stepping in is if they target an underserved demographic and focus data collection and monetization there, and DuckDuckGo is close by targeting privacy conscious power users.

> The only way I can think of a competitor stepping in is if they target an underserved demographic and focus data collection and monetization there, and DuckDuckGo is close by targeting privacy conscious power users.

The irony there is that DuckDuckGo can't collect much of that data precisely because of their privacy focus.

> The hard part about building a search engine isn't indexing the internet, it's adapting to spam. Spammers are continually adapting to changes in the algorithm, so the algorithm needs to adapt as well.

Adaptive crawlers?

> Querying an index isn't a solved problem, building it is...

You didn't just hit the nail on the head; you drove it all the way in with a single blow. Bravo.

Most likely answer: lack of diversity in revenue models.

Outside of ad revenue, search has always been seen as something of a "charity" effort for the internet. It's "boring" infrastructure work that can be critically useful but doesn't really make money directly on its own. No one wants to pay a "search toll" and there's no government agency in the world that the internet would trust as a neutral index to run it as actual tax-basis infrastructure.

Which begs the question, if adblock makes advertising based models go the way of the dodo, what happens to search?

"indexing" is only part of the problem, it's a batch job. I find being able to respond to searches across a huge data set in the order of milliseconds (while having planet scale fail over) be a lot more challenging to implement.

It's not the 'raw' search itself. It's the billions (trillions) of queries they've captured: Person X searches for query Y and clicks on result Z.

This is far more valuable than the general page rank algorithms that were initially developed and have already been duplicated many times in academia and business.

It's so weird how about 1/3 of the time on DuckDuckGo, I add a !g in frustration .. half the time I still get nothing and I end up posting on Stackoverflow but half the time I get a little more useful information.

Google custom tailors results for each and every machine. Even if you're not signed in, Google uses your browser fingerprint, the OS it's reporting and location/IP data to custom fit results. There is no "stock" google result.

This is something DuckDuckGo et. al. can't do if they want to focus on a privacy model. DDG does offer location specific searches, which can be helpful.

Aside from the quality issues that others have already mentioned, I think that simply gaining traction for a new search engine is incredibly difficult - people typically use whatever is the default in their browser, or/and Google/Baidu/Yandex (which are surely the best known in their respective regions).

Consider DuckDuckGo, which sells itself on privacy, but after more than a decade has only 0.18% market share. Without the power to make it the default in an OS or browser, you'd have to have a really strong value proposition to convince people to switch.

I don't think this is correct. For years, the #3 search query on Bing in the US was "Google", and globally it used to be a double-digit percentage of all Bing queries. That suggests to me that people with a default Bing search engine had learned in droves to click their way to the preferred engine regardless of what the default was, and did so without being technically skilled enough to change the default once and for all. I don't know how large a group the latter is, but it seems hard to argue that the two together are small.

> Why aren't all the big companies Doing Search?

They are.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact