The main problem is, I think the author is wrong about what Google's "crown jewel" is. Yes, Google has a huge index, but most queries aren't in the long tail. Indexing the top billion pages or so won't take as long as people think.
The things that Google has that are truly unique are 1) a record of searches and user clicks for the past 20 years and 2) 20 years of experience fighting SEO spam. 1 is especially hard to beat, because that's presumably the data Google uses to optimize the parameters of its search algorithm. 2 seems doable, but would take a giant up-front investment for a new search engine to achieve. Bing had the money and persistence to make that investment, but how many others will?
From what I can tell, Google cares a lot more about recency.
When I switch over to a new framework or language, search results are pretty bad for the first week, horrible actually as Google thinks I am still using /other language/. I have to keep appending the language / framework name to my queries.
After a week or so? The results are pure magic. I can search for something sort of describing what I want and Google returns the correct answer. If I search for 'array length' Google is going to tell me how to find the length of an array in whatever language I am currently immersed in!
As much as I try to use Duck Duck Go, Google is just too magic.
But I don't think it is because they have my complete search history.
Also people forget that the creepy stuff Google does is super useful.
For example, whatever framework I am using, Google will start pushing news updates to my Google Now (or whatever it is called on my phone) about new releases to that framework. I get a constant stream of learning resources, valuable blog posts, and best practices delivered to me every morning!
It really is impressive.
For the same reasons you’re exalting them, I have non-technical friends who asked me how Google knows so much about them (and suggestions on how to avoid it) because they found it too creepy.
I don’t think people forget Google’s results are useful; some just think they’re more creepy than valuable. You seem to have picked your side in that (im)balance, and other people prefer the other side.
There’s also the relevant consideration that no matter how useful they may be, they should have no right to impose themselves on you. By this I mean that one should be free to refuse their creepiness, understanding the price is their usefulness. Yet, Google is the subject of privacy violations all the time, and they are caught time and again lying about what they collect on users.
Just as a general observation without taking either side:
People routinely fail to recognize both sides of a particular thing. It's why we have sayings like "You don't know what you've got til it's gone."
But I have a hard time believing google truly partitions everything in a multi account setup.
Also, for incognito stuff, it'd be nice to have read-only basing on stock profiles related to various activities or people.
Tell that to the people who had their privacy violated by Street View. And the people who specifically disabled location services on their Android devices but were still tracked. Or all the people who have no idea what Google Analytics is and never consented to it, but are profiled by it everyday.
> All of their services are a choice you are making.
I do my best to avoid privacy invading companies, and as a technical user I find it tiring and know I deal with consequences (e.g. broken websites). It perplexes me that comments like yours still pop up. We’re not the only segment of the population that exists; non-technical users are the majority, and they have the same right to privacy as we do, with a modicum of transparency. If even technical people are regularly tripped by privacy invasions we didn’t know about, what chances do non-technical users have?
1.generally speaking I would think VERY few people care about an image of their property being on street views.
2. It's not really illegal to take pictures so even from a legal standpoint it seems like a gray area.
3. I understand there can be individual reasons for not wanting this, but it seems to be a very large net positive. And I would apply that statement to most other tracking and data policies they have.
If they are lying about how their services track people, that is definitely grounds for concern. The transparency can definitely be improved, but still these are people with Android phones and people using Google Analytics. No one is forced to use these things they are free to use any other service or create their own.
And my attitude is out of pragmatism and how I think privacy issues should be handled. I don't have any problem with the way Google uses my data so I don't care to fix a non problem. And I don't see it as their responsibility to change a way of business when anyone is free to use any other service or create their own, since I don't find it offensive.
> Google on Tuesday acknowledged to state officials that it had violated people’s privacy during its Street View mapping project when it casually scooped up passwords, e-mail and other personal information from unsuspecting computer users.
That answers your first three paragraphs. There’s no “if” to their lying and privacy invasions. They’ve been caught and admitted their actions time and again.
> No one is forced to use these things they are free to use any other service or create their own.
It is here I will respectfully give up on continuing the conversation with you. You’re either ignoring my main point or truly don’t care for the majority of users. Most people don’t understand the ramifications of these choices and for good reason; they are hard to understand. By suggesting non-technical users create their own services and devices, I’m now wondering it you’re trolling me.
> And my attitude is out of pragmatism (…) I don't have any problem with the way Google uses my data
Which is valid, but irrelevant. I’ve already mentioned in the top post different people make different choices. I presented another side and used facts to justify it. If you’re going to answer with mere opinion, you’re not adding to the points made by the original poster.
This is information that Google doesn't have any need for (noise) and didn't want in the first place.
They also self-reported the failure, where they could have just nuked it and we wouldn't be having this conversation.
My first points were about the streetview product. Scooping up passwords is obviously not the intent of that product, maybe that was an error or they changed the core product at some point? I can't read the paywalled article.
I'm not suggesting non-technical users create products... you're reading so far out of context. Just because user X can't create a new product does not mean that we should place sanctions on company Y. I'm glad you used facts somewhere else because in this post you just illogically connect a bunch of dots.
Yes some of it is my opinion and alot of this is yours. But a fact is still no one is forcing you to use these products, then you went off about stolen passwords and trolling and resigned yourself from the argument. That sounds like a rationality of a completely one-sided biased individual in itself, respectfully.
Yes everyone agrees transparency is good and lying is bad. Google is not Evil Or Benevolent. They're just people...
"And I don’t use them. I hoped that by continuing to mention non-technical users you’d get it, but this was never about me. You keep bringing up that argument, but read what you replied to in the first post — I recounted the experience of non-technical people I know, not my experience. Stop telling me I have a choice; the point is not us, it’s non-technical users who don’t have the knowledge to make informed choices!"
Haha you are so ridiculous. This was your first post:
There’s also the relevant consideration that no matter how useful they may be, they should have no right to impose themselves on you.
Then you say you don't know why I bring up that you don't need to use Googles services... C'mon man get real. That's why the point about using alternatives or creating new ones is very relevant and this entire thread is about sanctions. Don't start a convo you can't participate in and then just claim you won and leave, that's childish behavior.
That is an insane extrapolation, and the reason I don’t want to continue the conversation with you: you’re answering points I’m not making. I haven’t even hinted at sanctions; I have no idea where you’re getting that from.
> But a fact is still no one is forcing you to use these products
And I don’t use them. I hoped that by continuing to mention non-technical users you’d get it, but this was never about me. You keep bringing up that argument, but read what you replied to in the first post — I recounted the experience of non-technical people I know, not my experience. Stop telling me I have a choice; the point is not us, it’s non-technical users who don’t have the knowledge to make informed choices!
> That sounds like a rationality of a completely one-sided biased individual in itself, respectfully.
Believe what you want. I just don’t want to keep wasting my night arguing with someone that started a discussion but refuses to address the points originally made. Why reply, then?
Maybe I’m not explaining myself well enough, or in the correct way for you to understand, or maybe you’re the one not grasping what I mean. It doesn’t really matter where the problem lies, just that it’s clearly not working.
Maybe if we ever meet in person we can resume this conversation, but tonight it’s not being productive, so I genuinely wish you a good week and sign out here.
Some writer from Gizmodo tried that last February. Let's just say that you are technically correct that you don't need any of the big tech firms.
So it could very well be that as more users adopt the new language/framework in the first couple of weeks they have taught google those associations.
Google isn’t a search company. They are a distributed machine learning company that make most of their money from learning what people want and showing relevant ads to them.
Really good or really bad only exists if there is something else to compare it to.
Same when I owned a Pixel after hearing about Google Now and their ML magic there. Nothing more magical than an iPhone in terms of suggestions. The camera was amazing, but not all this supposed contextual stuff.
For many people it is enough to be totally creeped out about Google.
Also, that Google remembers context can be handy but it is not essential. Without context, I am sure you would be equally capable of finding what you are looking for, although it might take a little more typing since you'll have to supply the context yourself. Imho, convenience is not a good argument for giving away your personal information.
There's no arguing what you're describing is useful, but it's nice to keep in mind that there are downsides even if you ignore the privacy argument (which, IMO, shouldn't be ignored).
I'm not quite sure about that. 15% of Google searches per day are unique, as in, Google has never seen them before. . That's quite an insane number.
Mark's blog is amazing and worth more than any data science degree imho.
in my experience yacy works really well. You have it crawl the sites you frequently visit and their external links and it quickly accumulates to something more accurate than google.
Without some clear numbers on that from a major search engine, I think this might be very difficulty to infer.
Or do you mean queries forwarded by home assistants trying to parse inputs?
The calling card of the developer realising that real users never act like you expect :)
As a developer, I search using keywords; for example, if I was looking for property for sale in Inverness, I might search for "property Inverness", whereas I've seen and heard "typical" users use something like "find me a 2 bedroom house with a garden for sale in the North of Inverness" - much more verbose, and containing stop words and phrases unlikely to help (I think!).
So did I.
Around the time I left Google behind I had started to search like my wife did, using full sentences. it sometimes worked better I think.
Natural language is likely the preferred search input method for kids under a certain age, who cannot yet type fluently. My kids formulate very long, complex queries verbally. The other day my son asked Alexa why the machine gun is such a deadly weapon. She replied with a snippet from Wikipedia that was surprisingly relevant.
is not atypical.
I mean with the newest advantages like machine learning it's more and more possible to _semantically_ link queries. If that's the case, those 15% could become 5% truly unique searches or even less.
"how dumb is trump" and "how dumb is donald trump" are two different searches but they semantically belong together because they mean the same.
Some of the new things are probably variation as well - as others have mentioned, sentences and voice commands can give lots of new stuff.
A unique search query could still land you on Wikipedia.
That is impossible, and therefore wrong (I'm wrong, please see below). To know if a search is unique, as in Google has never seen them before, Google must be able to decide if a query it receives was seen before or not. Even if we assume Google needed only one bit for each message it has ever seen, and assuming it only saw 15% of new messages each day since its creation more than 20 years ago, it would need to store more than 2^1471 bits.
What could be true is that each day 15% of all searches are unique on that day.
Edit: I'm wrong. The 15% of completely unique messages per day are in regards to the messages per day, and not in regards to all messages it has ever seen, therefore exponential growth doesn't apply. To see that, assume Google just received one search query each day for 20 years but it was unique random gibberish, then Google could easily save that even though 100% of all messages per day are unique.
(And they don’t even need to store all searches for all time for this, thanks to Bloom filters.)
Assume Google receives 1 trillion queries per year, and has been around for 20 years. Using a bloom filter you can achieve a 1% error rate with ~10 bits per item. So a 200 terabyte bloom filter would be more than sufficient to estimate the number of unique queries.
If you have a list of 20 trillion query strings, and each query string is on average < 100 bytes, you're looking at a three line MapReduce and < 1 PiB of disk to create a table which has the frequency of every query ever issued. Add a counter to your final reduce to count how often the # times seen is 1.
A bloom filter is the most appropriate data structure for this use-case. How is it overkill when it uses less space and is faster to query?
To find the fraction of queries each day which are new, you would want to add a second field to your aggregation (or just change the count), the first date the query was seen. After you get the first date each query was seen, sum up the total number of queries first seen on each date, compare it to the traffic for each date.
You could still hand the problem to a new hire (with the appropriate logs access), expect them to code up the MapReduce before lunch (or after if they need to read all the documentation), farm out the job to a few thousand workers, and expect to have the answer when you come back from lunch.
I haven't thought this through, but take all the queries as they're made and create a bloom filter for every hour of searches. Depending when this process was started, an analytics group could then take a day of unique searches, and run them against this probabilistic history, and get a reasonable estimation with low error. Although the people who work on this sort of thing probably know it far better than I.
The real question though might be assuming the 15% is right, do we care about those 15%, are they typo's that don't merge, are they semantically different, are they bots search for dates or hashes, etc.
Of course, Google knows better but to treat every search query literally. Slight deviations and synonyms work for the majority of the people, even if us techies highly oppose them and look for alternative solutions (like DDG) that still treat our searches quite literally.
Tangential - but does anyone else feel that google results are useless a lot of the time? If you search for something, you will get 100% SEO optimized shitty ad-ridden blog/commercial pages giving surface level info about what you searched about. I find for programming/IT topics its pretty good, but for other topics it is horrible. Unless you are very specific with your searches, "good" resources don't really percolate to the top. There isn't nearly enough filtering of "trash".
There are 2 issues, I think.
Firstly, the SE-optimised spam, which has become very good as masquerading as genuine content.
Secondly, Google has dumbed search syntax down a bit, and often seems to outright ignore double quoted phrases, presumably thinking it knows better than I what I want.
As a dev, I do accept I may be an outlier though - with the incredible wealth of search history and location data that Google holds, it seems likely things have actually improved for typical users.
I suspect that the AI models powering the search results develop a sort of symbiotic relationship with the spam - if the user actually finds what they are looking for by clicking through an ad on an otherwise spammy page, everyone “wins”; the user found what they were looking for with minimum effort, google got their ad revenue, and the spammy page got a little cut for generating content that best approximating the local minimum that links the users keywords to actual intent...
Some of the sites that were hit like Suite101.com went offline. eHow is still off well over 90%. ArticlesBase sold on Flippa for like $10k or some such. One of the few wins hiding in all the rubble was HubPages, but even they had to rebrand and split out sites & merged into a company with a market cap of about $26 million ... and the CEO of Hubpages is brilliant.
Even with IAC on some sites they are suggesting ad revenues won't be enough
http://www.tearsheet.co/culture-and-talent/investopedia-laun... "As Investopedia charts its course as a media brand, it’s coming up against the roadblock all publishers eventually hit — the reality that display revenue alone won’t be enough. ... Siegel said he expects course revenue to exceed what’s generated from the site’s free content. While he wouldn’t say what the company’s annual revenue was, Siegel said it grew an average of around 30 percent for each of the last three years."
There is also other factors which parallel the panda update that further diminish the quick-n-thin rehash publishing business model
- Google's featured snippets & knowledge graph pulling content into the SERPs so there is no outbound click on many searches
- programmatic advertising redirecting advertiser ad spend away from content targeting to retargeting & other forms of behavioral targeting (an advertiser can use a URL as a custom audience for AdWords ad targeting even if that site does not carry any Google ads on it)
- mobile search results have a smaller screen space where if there is any commercial intent whatsoever the ads push the organic results below the fold
I’ve been using DuckDuckGo and have found I have this problem less. I don’t always find what I mean on DDG, as of now I’d say Google is still better if you’re not sure exactly what you’re looking for is called, but if you know the keywords you need DDG is often better.
But for increasingly more basic searches about a product I'm interested in or a medication or anything else non-complicated that would have gotten me a clean list of decent, non-paid results even 5 years, I'm now getting half a page of sponsored BS and then another half a page of 'created content' written by a bot or shyster explicitly for gaming Google's SEO.
Not only has Google lost almost all their good will (i.e. Don't be evil), but their products aren't even that good anymore, at least not so much better than alternatives where the negatives of using Google outweigh the difference in quality.
I would agree that programming search results tend to be quite good, but I think this is likely in large part because the average person attracted to programming both has a high IQ and has experience building some part of the web stack. Thus the sites that are quite manipulative in nature would have a hard time trying to fake it until they make it in such a vertical where people are hard to monetize and are very good at distinguishing real from fake. And even if a fake site started to rank for a bit it would quickly fall off as discerning users gave it negative engagement signals.
This is also perhaps part of the reason sites like Stack Overflow monetize indirectly with employment related ads targeted to high value candidates versus say a set of contextually targeted ads on a typical forum page or teeth whitening gizmo ads on the Facebook ad feed.
The lack of filtering of "trash" probably comes from a bunch of different areas
- I think there was a quote that people are most alike in their base instincts and most refined in areas where they are unique. some of the most common queries are related to celebrity gossip & such. There are also flaws in human nature where inferior experiences win based on those flaws. For example, try to buy flowers online and see how many layers of junk fees are pushed on top of the advertised upfront low price. shipping, handling, care, weekend delivery, holiday delivery, etc etc etc
- some efforts to filter trash based on folding in end user data may promote low quality stuff that people believe in. a neutral & objective political report is less appealing than one which confirms a person's political biases. and in many areas people are less likely to share or consider paying for something neutral versus something slanted toward their worldview.
- as the barrier to entry on the web has increased some of the companies that grew confident they had a dominant position in a market may have decided to buy out other smaller players in the vertical & then degrade the user experience as real competition faded. there was a Facebook exec email mentioning they were buying Instagram to eliminate a competitor. Facebook's ad load is now much higher than it was when they were smaller. But the same sort of behavior is true in other verticals too. Expedia & Booking own most the top travel portals.
There has also been a ton of collateral damage in filtering all the trash. So many quirky niche blogs & tiny ecommerce businesses were essentially scrubbed from the web between Panda, Penguin & other related algo updates.
Google doesn't make money from you finding what you're looking for. Google makes money from you searching for what you're looking for.
For example, suppose I do a google image search for "pear", because I want images of pears obviously. The first result is indeed a pear, good job google! Except the first search result just happens to come from Amazon, and also happens to be a pretty shitty thumbnail quality photograph (355x336). It's a pear alright, but why is this particular image of a pear first? Google didn't try to give me the best image of a pear, they tried to give me the pear image they thought most likely to induce a financial transaction. Or alternatively, google let itself get cheaply manipulated by Amazon's SEO. Neither is a good look.
A much better pear image, 3758x3336 from wikipedia, is further down the search results. So it's not like google was unable to find good pictures of pears. And a non-image search for "pear" returns the wikipedia page first, so it's not like google failed to noticed the relevancy of the wikipedia article about pears. Yet the shitty amazon thumbnail of a pear shows up higher in the image search results than a high resolution photograph of a pear from wikipedia.
The user data helps/ed Google create the superior UX, as you say. The reach is what makes Google & FB valuable to advertisers. A search engine with 0.1% of Google's user volume cannot charge advertisers 0.1% of Google's as revenue. Returns to scale/reach/market-share are very substantial in online advertising.
I'm glad we're talking though. Those tech giants are too powerful.
Ultimately, the old antitrust toolkit is near useless today, for dealing with tech monopolies. It's not obvious what "break up Google" even means. There are strong network effects and other returns-to-scale. It's a zero-marginal cost business, which was rare enough in the past that economists a ignored it.
We need fresh thinking, a new vocabulary, new tools, but we do need to deal with it.
* an Office suite / enterprise company (Google Cloud + Docs + Gmail + Business)
* a phone company (Android)
* a search company (Google Search + Advertisement)
* and a media company (Google Play Movies, Music, Books and YouTube)
The names would probably become different in time, but you get the gist.
Amazon and Microsoft could be broken up much the same way, in neat categorical 'silos'. Facebook should be trisected into Facebook, WhatsApp and Instagram again.
I have no idea how you would break Apple up without utterly destroying their core principle, vertical integration. There is no way to do what Apple does with MacBooks or iPhones if they don't control the entire stack.
I'm not saying they shouldn't be, I just see no way.
(1) This doesn't actually reduce market share, since each of these are basically different market categories.
(2) Almost all the revenue is from search. That company is the revenue generating arm for the other ones.
In contrast to most people here, I think breaking up Amazon is far more important to Facebook, Microsoft, Apple and many other tech companies. Only Google is as bad.
Before Google Maps we had a few online map services and they were terrible. Google Maps redefined what it means to have free access to web based interactive global maps, it changed how people find things and it was all payed by the ad business. Later on some monetizing efforts were made for it and competitors started to appear, mostly trying to catch up and copy what Google Maps did, but without the huge cash infusion of the ad business none of this would have happened.
A decade later, people take these things for granted and just want to split services up. I guess it makes sense from their point of view but to me it's not that clear what should happen while still allowing for the type of creativity and speed of development that allowed things like Google Maps to appear because I'm afraid "the next big" thing that could redefine our lives (and improve them) would be slowed down or simply made non-feasible.
This is not true. MapQuest revolutionized things almost 10 years earlier than Google Maps. Google search is what allowed Google Maps to overtake MapQuest. Also, Android providing real-time traffic data of all their users gave them the winning formula.
As was, like, an app that could reroute live instead of relying on pre-printed paper instructions.
Gmaps had both before MapQuest.
I've seen mid-roll ads on songs on YouTube.
Hosting costs & delivery costs (per byte) drop every year. Every year their compression gets better. Every year their ad revenues goes up. YouTube ad revenues have been growing at something like 30% a year for many years.
I think one reason Google doesn't break out YouTube profitability is because as soon as they show they are profitable they end up getting some of their biggest partners (like music labels) using those profits to readjust revshares.
Also if Google claims YouTube is not profitable they can be painted as the victim for extremist content or hate content they host, whereas if they show they were making a couple billion year a year in profits these narratives would be significantly less effective.
Core search ad prices haven't really been falling yet Google blended click prices keep falling about 20% a year while ad click volume is growing about 60% a year. This is driven primarily by increasing watch time on YouTube and increasing ad load on YouTube. Google blends video ad views in with their "clicks" count.
You are right though, it doesn't deal with the dominance of the search directly. My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.
As is clear I'm not really a fan of direct intervention in a single market, I see it as more of a problem when these giants muscle their way and control more and more markets, creating a vicious feedback loop.
I think it's instructive to look at the rest of the market. How is Mozilla funded? Basically a single gigantic contract with Google. Even Apple accepts payment from google to become the default, and it's not cheap: https://fortune.com/2018/09/29/google-apple-safari-search-en... The same logic applies to pretty much anything Alphabet spins off -- there's little difference between ownership and those contract.
About the only competition this setup produces is the ability for Mozilla to walk away to a competitor bid, which they did for like a year before bailing out at the first opportunity. There's a huge incumbency bias in these contracts. The first parallel that comes to mind is employer provided health insurance. Everyone gets to bid, but the incumbent knows the claims history far better than the competition and we'd only expect them to lose bids to companies overly optimistic about that history. Google knows how valuable various traffic sources are, but their competitors have to guess, and only when their guess is higher than Google's does it pay off. Does anyone think Yahoo winning Firefox was a good deal? I haven't seen any analysis to support that.
> My hope is a complimentary effect to the above also happens: Google no longer gets gobs of personal data from its other services, allowing other search engines to approach its efficacy.
Wouldn't the most profitable thing for these broken up companies be to sell their slice of the personal data pie as many parties as possible? This seems like a net loss for privacy. How much extra would it be worth to set up an exclusive arrangement?
I'm still not sure how this would work on Apple though, since their main differentiator is their design sensibilities and integration rather than their platform monopolies.
I guess iMessage and the App Store do rely on monopoly rents, but I can't think of any way to sever those links without making the iOS platform less secure.
The biggest blow to Google wouldn't be to break it up into lots of small companies, you just need to separate the advertising business from everything else and you've effectively neutered the monopoly. Google's genius isn't in hiring the best engineers to providing a ton of services, it's in convincing people that they're not an advertising company, and that is where Facebook has been falling out of favor recently (I'm guessing that's why they bought Instagram, and why Google bought YouTube).
This is a supposition - while perhaps it seems to makes sense, seems true, “must be true”, it doesn’t mean it is true!
Unless you worked on search quality at google you really aren’t in a position to know if, say, google cloud, or android provides useful signals to search (outside of the signals they’d collect anyways if they were different companies).
One thing people are obscuring is just how crazily effective AdWords are. They work for the advertisers, and they earn google like 70+% of the revenue. Confirmed via sec filings which does break that out. Go play with creating an AdWords campaign and try to infer just how much data google really needs to deliver those ads - it’s less than you’d think.
In short: this overall move is more wishful thinking than solidly reasoned. Surveying the field of streaming video, given the amount of studio driven consolidation, is there really a tons of competitors being held down and will spring up? I am skeptical.
But my thinking was that if you simply cut off advertising all the products still have massive marketshares and could lean on each other, as long as some succeed. Not to mention investors probably willing to prop up such a massive aggregate marketshare (one only has to look at Uber).
If you 'silo' them, success of one division of previously-Google won't lead to all of them dominating.
Who is going to sell those ‘neat silos’ cheap advertising to survive with no other business model? Google?
The cloud services (Enterprise apps, hosting, etc) don't need it.
I think we'd probably see a worse Gmail (worse ads or aggressive upsells).
Granted most of your examples wouldn't be, except for search, but it still seems more interesting to me to just have a bunch of mini googles made from cleaving teams. Certainly that would make for some crazier competition.
Even though a lot of these corporations build offices, hire non-Americans, and pay tons of foreign taxes in countries in which they do business, the main executives and talent still live in the US, the IP is developed here, and the majority of profits end up back in the home country.
It's better for everyone who actually matters - shareholders, intel agencies, government officials, associated businesses, etc - that these companies remain large and globally dominant, even if it screws over US citizens by having to pay the monopoly taxes and suffer the privacy invasions. We're an insignificant sacrifice in the decision-makers' minds.
What do folks even mean by "Google's index"?? Google results combine tons of signals, including personal histories for each user. Sharing metadata for the top billion urls wouldn't cover half the functionality, or make a competitive engine. And on the other hand, there may not be a single other organization in the world prepared to manage a replica of the entire data plane that impacts seatch. The proposal is somewhere between underspecified and nonsense.
I hypothesized once with an ex Microsoft HIGH up that it probably took 10B to launch bing. He said I was almost exactly on the nose.
Also this is a ridiculous thing to ask for. How much money do you think Google pays for the bandwidth to crawl the web? How much do you think it costs to run the machines that create indexes out of that? How do you value the IP involved in the process?
Google should give away the fruits of that labor for free, plus invest in a reasonable API to download that index? Plus the bandwidth of sharing that index with third parties? It’s probably not even feasible aside from putting disks or tapes on multiple semis to send to clients. The index is 100 petabytes according to . With dual fiber lines, and no latency for mind bending numbers of API calls, that would take 12.6 YEARS to download a single snapshot.
'Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."'
'Be kind. Don't be snarky. Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.'
This is what makes me wonder why we don't have a LOT of competing search engines. Perhaps i'm vastly under-estimating the technology and difficulty (I could well be - it's not my domain) but it surely it can't be THAT hard to spawn Google-like weighted crawl-based search results?
It's a long-since solved problem - heck, pageRank's first iteration recently came out of patent protection - it could just be copy'pastad. Why aren't all the big companies Doing Search?
The spammers won. Google gave up and settled for "we like the right kind of spam—the kind that took a little effort, and makes us money".
Searching for "north face glaciation" did help as the first page of search results did have one entry on the topic I was actually searching on!
Maybe they should have a "I'm not buying anything" flag!
Many of the top search queries are navigational searches for brands.
And so if tons of people are searching for your brand then if there is a potentially related query that contains the brand term & some other stuff then they'll likely return at least a result or two from the core brand just in case it was what you were looking for.
Google pays some large number of people to do search and grade the various results they get to see if the answers are good, which then helps feed back ML.
Heck, according to this article, google has been paying people to evaluate their search results since 2004.
Google for general search.
Duckduckgo fir general if you want something a bit more private but not extreme enough to run your own spiders.
Bing mostly for porn search - not being snarky some people do consider it to have better results.
It's easy to gather the necessary data, but it's hard to know which parts of that data are the most relevant for finding good content and avoiding bad content. Is it more relevant if key words show up in links or titles than in the body of the text? If so, SEO spam sites will include a bunch of keywords in links and titles. Is it more relevant if keywords show up in the first 200 visible words of the page? If so, spam pages will make tons of pages with relevant keywords at the top.
The hard part about building a search engine isn't indexing the internet, it's adapting to spam. Spammers are continually adapting to changes in the algorithm, so the algorithm needs to adapt as well. And the more popular your search engine is, the more money you make and the more able you are too adapt to spam (and the more spammers focus on your engine).
So, the problem isn't that Google has a better index (though I'm sure it does), the problem is that nobody else has the will to spend the money necessary to tune the search algorithm to stay on top of spammers. When Google started, companies didn't care as much about improving their index and instead focused on building their other content (Yahoo, MSN, etc). Google saw the value of search and got a lead on everyone else in terms of curating results, and now they have the momentum to stay in front and have shifted to building content to improve monetization. Nobody else has the monetization network for search that Google has, so they'll continue having the problem that other companies had (Microsoft wants to point you to their other services, DuckDuckGo is limited by their commitment to privacy, etc).
In short, Google wins because:
- it was better when it mattered
- it makes money directly from search
- its other services improve their ability to understand what users want, which improves search quality and ad relevance
You can't make a better algorithm by being clever, you make a better algorithm by having better data, and that's hard to come by these days. The only way I can think of a competitor stepping in is if they target an underserved demographic and focus data collection and monetization there, and DuckDuckGo is close by targeting privacy conscious power users.
The irony there is that DuckDuckGo can't collect much of that data precisely because of their privacy focus.
You didn't just hit the nail on the head; you drove it all the way in with a single blow. Bravo.
Outside of ad revenue, search has always been seen as something of a "charity" effort for the internet. It's "boring" infrastructure work that can be critically useful but doesn't really make money directly on its own. No one wants to pay a "search toll" and there's no government agency in the world that the internet would trust as a neutral index to run it as actual tax-basis infrastructure.
This is far more valuable than the general page rank algorithms that were initially developed and have already been duplicated many times in academia and business.
Google custom tailors results for each and every machine. Even if you're not signed in, Google uses your browser fingerprint, the OS it's reporting and location/IP data to custom fit results. There is no "stock" google result.
This is something DuckDuckGo et. al. can't do if they want to focus on a privacy model. DDG does offer location specific searches, which can be helpful.
Consider DuckDuckGo, which sells itself on privacy, but after more than a decade has only 0.18% market share. Without the power to make it the default in an OS or browser, you'd have to have a really strong value proposition to convince people to switch.
If a government was serious about getting more players in the search industry, they would force Google (and all other players) to make this data public.
Simply say "All user-behaviour data used to improve the service must be freely published".
Make the law apply to any web service with more than 20 million users globally so small businesses aren't burdened.
If the data cannot be published for privacy reasons, the private parts must be seperated and not used by google or it's competitors.
> the private parts must be seperated
This means literally making legal interpretation of all documents on the net, to determine whether each of them is private or not.
As a user that notices the impact of this data: please no, thanks though.
Have you ever visited youtube's home page in incognito mode? It's... bad. Really bad. Not allowing any company to use this (obviously very private) information in ranking would simply make their products suck, horribly, compared to today.
Do you like the personalized recommendations because of channel subscriptions?
I always get the "anonymous default" home page with YouTube and don't care. The home page is just a wasted load before I can start typing in the search bar. As a bonus, staying incognito means all the videos on the right-side panel are related to the current video. Not related to a music video I have playing in another tab.
As someone who was operationally responsible for a search index (formerly VP Ops at Blekko) the kinds of things crooks tried to do was pretty instructive on how they use search in advancing their efforts.
I think we've reached an equilibrium state on this that has significantly degraded the educational quality of search engine results.
The total garbage SEO spam we used to get is gone, which is nice, but what it's been replaced with is technically relevant but mostly manipulative advertising. Product searches will basically give you a bunch of no-name blogs who are almost definitely paid off by one vendor or another.
Even actual inquiries are inundated with search results that do answer the question, but do so in extremely cursory and incomplete way. Or, in the case of recipes, Google seems to prioritize results that give you long, meandering narratives before they actually talk about their recipes. It has some very weird ideas about what people actually want when they search.
One of the most annoying things is how impossible it is to actually find the website of a local business, especially a restaurant, by Googling. Your hits are always Googles' own cobbled together dossier on the restaurant first, then some combination of Yelp, Grubhub, Postmates, AllMenus, etc. pages. If the restaurant has a website you can't tell and it's probably way on the bottom or on a second page of results.
In the past it was a handful of very decent results amidst a sea of total garbage SEO spam. Now it's a sea of mediocre content farm stuff, but it ranged from difficult to impossible to actually dig into detail on things anymore. The old spam we could at least dismiss as crap within a fraction of a second of seeing it. The new spam you have to actually read most of it before you realize it doesn't have what you're looking for.
> 2) 20 years of experience fighting SEO spam.
That's probably a key issue here though. Providing an API potentially makes it easier for spammers to identify ways to boost their content in a well automated manner.
How so? Unless you give reasoning for the scores, or provide live updates etc, just putting an API on search wouldn't change much - you can APIfy search now, there are multiple services offering it as a service. Granted, at some point it's getting expensive, but for SEO research, you're probably not running a million queries.
As far as I remember Google is actually shrinking its index in terms of number of indexed websites because 90% of the internet are irrelevant for the majority of searches. Basically "quality over quantity" if you can say that.
This is even more depressing. Google was such a wonderful tool for us nerds because we could finally find those usenet posts, personal blogs, tech mail lists, etc. of all the esoteric subjects that had been hard to find previously. Before Google, you'd use lists of curated links (e.g. Yahoo) for a given topic that had been traded back and forth between various sites and other interested netizens.
It's apparent that Google is becoming worse and worse for these types of searches, while it concentrates of more popular queries like "When is the next <my show> on" or "What is the current sports-ball score" or "How big are Kim Kardashian's boobs".
Just like Craig of Craigslist recently came out with an article saying the internet has actually made the news media worse, not better for informing citizens - something he did not predict correctly - it's apparent that Google is pushing us in the same negative direction in the ability to find quality information on non-consumer knowledge.
But that's where differentiation occurs. Every search engine will get short tail results correct. We go back to Google because it also performs with the weird queries.
I agree that algorithmic superiority will probably perpetuate Google's dominance. But making its index public is (a) legally precedented, (b) conceptually simple and (c) a small step in the right direction.
On the other hand, if I have a health-related search, I run it through Google. DDG has the proper content. It's just that it priorities the blog spam. That's an algorithm problem.
Relieving the former, as the author's proposal would do, makes DDG more competitive. As a second-order effect, it would also let DDG priorities resources towards the second problem, making them more competitive still.
Bing does have the reputation of being better for NSFW searches than Google, so I guess that it's normal to have more NSFW false positives as well.
Same reason people won't buy electric cars with 100 mile ranges, even if they very rarely travel more than 100 miles.
Having a full text index on that is more involved but hardly impossible. You're completely right that it's not at all Google's secret sauce. Bing has clearly indexed much more than that, plus invested a ton in actually returning good results from their index. And still nearly nobody cares. It's just not easy to make a better Google, and the people most likely to figure out how to do that already work there.
I'd actually advocate for making public an anonymized list of actual search queries.
Domain specific search engines could evolved based on the demand of what has already been searched for.
It depends which sense of "better" you mean. It's nearly trivial to make an ethically superior search engine by just not building the spyware bits of Google.
It's difficult to make a search engine that's "better" along the dimensions of speed, profitability, etc.
Some argue (not necessarily me) that Google isn't necessarily purely optimizing for quality using that 20 year click-and-search log, that they're accepting some inefficiency by biasing for political (left-leaning) gain or "censorship by obscurity". If competitors could more easily build alternatives, which, say, didn't have those biases, then arguably that'd put more competitive pressure on Google to not use their monopoly for bad stuff.
I also happen to think that is the search engine I would prefer. I think I could build that pretty quick if I had the api access.