Zooming out, the language field breaks into several subfields:
- A large group of Chomsky followers in academia are all about logical rules but very little in the way of algorithmic applicability, or even interest in such.
- A large and well-funded group of ML practitioners, with a lot of algorithmic applicability, but arguably very shallow model of the language fails in cases like attribution. Neural networks might yet show improvement, but apparently didn't in this case.
- A small and poorly funded group of "comp ling", attempting to create formalisms (e.g. HPSG) that are still machine-verifiable, and even generative. My girlfriends is doing PhD in this area, in particular dealing with modeling WH questions, so I get some glimpse into it; it's a pity the field is not seeing more interest (and funding).
If you argue this is bad behavior: Maybe we need a web query which really only takes the query literally. Putting the query in quotes will not quite have this effect for Google. Maybe some other syntax?
In this very specific case I don't buy it. Sure, it probably applies for other queries, but if you approach a salesperson and ask him for "shirts without stripes" it's pretty clear what you want, and he wont bring you any piece with stripes on it.
The only difference is that those are all physical properties of a shirt while stripes is a type of pattern.
shirt without buttons - preety much fail.
shirt without red button - as already expected, shirts with red buttons
* tie without paisley
* tie not paisley
* non-paisley ties
* ties that aren't paisley
* ties other than paisley
You guessed it, in each case, at least half of the results are paisley ties. The only way to actually get what you want -- the set described by X, minus the set described by Y -- is to use the exclusion operator in the search, "ties -paisley".
This is great, and makes intuitive sense to somebody with multiple computer science degrees. But not only is it hard to explain to an outsider, it's actually quite hard to get them to think in a way that accommodates this capability, that is, in terms of set theory.
I have the same reservations about Google as anyone, but rewriting history is never the right move. Moving beyond text matching was what made search truly useful.
> Moving beyond text matching was
> what made search truly useful.
What I don't like is to search for the band "Chrisma" and get results for "fruitcake sale!" because Google corrected my spelling to "Christmas", decided to look for related concepts, and then boost whichever result is the most mercantile.
Yes, they have. Read what I said again, I don't dispute this. What Google does now is an improvement over text matchers. I never claimed that what they do now is better than Google circa 2000 (though I don't care to register an opinion either way on that).
Whatever they've done since, their product remains better than text matchers. Mercantile search is better than terrible search.
See how that works? That's not really what's going on. Sure, G. is incentivized to include pages quickly, but they are also incentivized to produce them accurately, and as the above poster indicates, this is quite a hard problem to solve generally.
A is also incentivized to sell items.
In many cases different algorithms will lead to quantifiably different results. The algorithm changes that work better for the measurement set will be kept and those changes which dont will be discarded. And both A and G do that within different constraints.
Pointing out the obvious: Google is an advertising company. If the cost of producing an accurate result outweighs the advertising income on a given term, there is no incentive for Google to produce better results.
Having a search engine that people go to whenever they want to search for things is incredibly valuable, because they will come to you when they want to buy things and you can sell ads. But unless you consistently give the best results for all queries, people will go whenever does. It's worth investing strongly in all queries, not just highly monetizable ones.
(Disclosure: I work for Google, speaking only for myself)
I was about to say there are no such queries but then I remembered having to type a captcha for seemingly automated queries. The captcha page has no results on it obviously. This is because automated queries do not produce advertising revenue. You have to buy them.
I've typed an insane number of queries since the beginning. A decade ago I use to be able to find truly exotic articles, I could find every obscure blog posting on every blog with 3 readers and I was pretty sure google delivered all of it. The tiny communities that came with the supper niche topics rarely produced a link I didn't already find. If they did it was new and I didn't google for a while.
Today google feels like it is a pre-ordered list from which it removes the least matching articles. Only if the match is truly shit will it be moved slightly down the page. The most convincing in this is typing first name + last name queries in imagines and getting celeberties who only have the first or the last name.
People wont go, it has to get much worse before they do.
With humans an pets a good slap over the head or a firm NO! will usually do the trick.
There are very clearly many queries with no advertising revenue, because there are many queries that show no ads. Trying some searches off the top of my head that I expected wouldn't have ads, I don't get any ads on [cabbage], [who is the president], [3+5], or [why is the sky blue]. On the other hand, if I search for a highly commercial query like [mesothelioma] the first four results are ads.
> A decade ago I use to be able to find truly exotic articles, I could find every obscure blog posting on every blog with 3 readers
My model of what happened is that SEO got a lot better. When Google first came out it was amazing because Page Rank was able to identify implicit ranking information in pages. Once it's valuable to have lots of backlinks, though, this gets heavily gamed. Staying ahead of efforts to game the algorithm is really hard, and I think a lot of times people's experience of a better search engine comes from a time when SEO was much less sophisticated.
> The most convincing in this is typing first name + last name queries in imagines and getting celebrities who only have the first or the last name.
This hasn't been my experience, so I tried an image search for [tom cruise], curious if I would get other Toms. The first 45 responses were all of the celebrity, and image 46 was of Glen Powell in https://helenair.com/people/tom-cruise-helps-glen-powell-lea... which is a different kind of mistake. Do you remember what query you were seeing this on?
I believe what he means is that searching for first name + last name of someone who isn’t a celebrity gets you celebrities who match either the first name or last name.
Searching for Tim Neeson gets you a wall of photos of Liam Neeson:
Searching for Tim Cruise blankets you with pictures of Tom Cruise, but it at least says “Showing results for tom cruise“ so you know it did an autocorrect. When I tried other first names + Cruise, the effect is less pronounced than with the Neeson example. Maybe it’s because cruise is a more common name as well as an English word.
You don't have to bother creating anything new unless you have something to sell and are willing to invest (big).
Facebook is actually a pretty pathetic implementation where we can still find content created by normal people. If people made traditional websites in stead of facebook groups and facebook pages NO ONE would be able to find it.
We've witnessed the great obliteration of what was once a nice place and now we have to hear google was not to blame?? The death by a thousand cuts is actually well documented.
We tell you what your site must look like or we'll gut it:
Google Penguin is a codename for a Google algorithm
update that was first announced on April 24, 2012. The update was aimed at decreasing search engine rankings of websites that violate Google's Webmaster Guidelines
There, this is what the entire internet must look like. We went from indexing to engineering here.
If you want to recognize and reward trustworthy contributors, you might remove this attribute from links posted by members or users who have consistently made high-quality contributions over time. Read more about avoiding comment spam.
Before this those elaborately contributing got actual credit for it. Do you think you got a choice in it? Google clearly demands you ban credit for comments. OR ELSE!
Woah association! How did we go from linking-to to association? It was important enough for readers but be careful to hide it from google. Such little unimportant websites simply shouldn't exist in our index. We command you to help keep our index clean of such filth!
Then the magical: We wont actually tell you what is wrong with your website! Ferengi Rule of Acquisition 39 "Don't tell customers more than they need to know." Get a budget and hire someone to do SEO. Deal with it, we don't care. No, you don't have any feedback.
Queries without ads do produce revenue. They are an essential part of the formula.
Think of people standing around in bars. We cant argue that just standing there doesn't produce revenue.
The flowers on the table in a restaurant produce revenue.
Free parking produces revenue.
If queries without adds didn't produce revenue they wouldn't exist. More often enough it doesn't even take an extra query, the adds will sit behind the links.
No, it would predict that a query that has no advertising income will poor results. You can determine on your own whether that is the case.
But it isn't necessary to formalize any of it. At the current level of sophistication, our informal common ground of words like "understanding" suffice for a discussion. It's obvious Google Translate doesn't resemble human language processing.
Yes, the English grammatical rules make it unambiguous where it belongs. This is solvable.
Seems like a matter for logical inference. At which point it becomes fairly easy to find shirts made from material where that materials pattern is not stripes.
But yes, no AI I have seen works reliably on even basic queries like this.
Most likely, common sense reasoning will be required to get full natural language processing, since human communication relies extremely often on such reasoning. But building a knowledge base of common sense facts will be one of the hardest challenges ever attempted in machine learning/artificial intelligence.
Couldn't you just parse the sentence into a dependency tree and look at the relationships to figure that out? CoreNLP got both of your examples right (try it at http://nlp.stanford.edu:8080/corenlp/process, can't link the result directly).
To be useful, Google must solve natural language problems. You can't solve natural language problems by using formal language in sine bits of the problem, at least not until we have a full Chomsky-style understanding of the whole of human language.
Well, one could argue, that it belongs exactly where anyone entering the query put it. Before "stripes".
The problem is often, that search engines try to be too clever, while not offering any kind of switch "exactly those words in this order" and that is just a bad user interface.
If it just disregards the word without, well, that's pretty bad.
I will not be surprised millions of dollars are being lost because of this substandard query result per year.
“Shirt -stripes” is unambiguous to a system, yet the first result on Amazon(.ca) is a striped shirt, and the 3rd is sweatpants.
That's the sort of thing I'd expect Amazon to be doing?
“Yes, I would like an unstriped dress shirt please”
“How about this striped shirt?”
“No thank you, I would like an unstriped dress shirt please”
“I have some lovely jogging pants”
“Ok, I need to be clear here, I would like a dress shirt that has no stripes”
“Can I interest you in a white undershirt? People who buy dress shirts usually buy undershirts”
T: I think pink would look good on you, and it's very fashionable right now.
You: Just bring me some yellow shirts to try.
T: Oh, I got these, and brought this pink one anyway; try it!
But, of course Google isn't making fashion suggestions. But then, ... the tailor might also be just trying to shift excess stock or be on a bonus for selling that particular high-cost shirt.
They can certainly also bring some stock to shift, or offer suggestions while I’m trying something on, but if they aren’t listening when I make a direct request or when I clearly say no, then they aren’t really there for me, their customer.
I’m an odd one that I already know specifically what I want to buy before I search for it, but I’m certainly not the only one (and I think everyone has done that at least once).
I mean, context is key, right? You're on Amazon and your first search term is "shirts". Unless their is a band called "shirts without stripes", the user wants shirts. The rest of the query is probably some filter of that product. You know shirts sometimes have stripes. It's not a one-size-fits-all algorithm, but it's simple enough that the user should end up with the results they wanted.
> "no evidence of cancer" and "evidence of no cancer" are very different things.
Why is it not as simple "no belongs to the word it precedes" ? like unary operator, ! (not), in typical computer languages.
- no textbook evidence of cancer
Statements have structure, parsing them with simple rules like this is akin to parsing C++ with regular expressions.
You'd also have quite a bit of fun trying to parse the phrase "no means no" or other usages where "no" is being used as a noun... And for bonus points, folks talk to search engines in broken english all the time so "shirts no single striped" is a totally reasonable query to submit to a server and expect to be parse-able.
I think the basic issue is that people just don't respect machines and want to minimize the amount of effort spend on communicating with them - I don't say "Alexa please bring up songs by Death Grips if you don't mind" I shout "Alexa! Play! Death Grips!" and then yell at it when it misunderstands.
Does she want ice cream? Answer: No, she doesn't. I added a not, so she's reversing the answer as Japanese people do.
The number of times I've been dumbstruck by this is larger than I'd like to admit, and I'm a coder.
Q: "Do you mind if I sit here?"
A1: "Not at all!"
Both are valid answers and mean the same thing, the person asking is welcome to sit there. This has always amused me.
"Not at all" == "I do Not [object to you sitting here] at all"
A3: “Sure I do, last time you sat next to me you wouldn’t shut up.”
There have been some lengthy discussions on HN about vertical search and how Google doesn't always buy up a small company; they litigate.
I'd be curious to see how many sentences with attribution problems actually have other structural issues. If I want to write clearly and without ambiguity, I rewrite sentences that have these problems. Why wouldn't I do the same for search queries?
The bad results are because they're not positively indexing the absense of the feature by deeply analyzing the images or products beyond the descriptions. "Shirt with stripes" yields almost exclusively striped shirts. Exclude those results from all "shirts" and there are still a lot of striped shirts that the search algorithm doesn't know enough to exclude.
There is no ambiguity in "not stripes", you can't invert it and write it in the positive form of what you want; the neatest way to describe the category of what you want to browse is "things which are not stripey".
Particular personal bugbear is car websites where you can filter in "petrol engine" or "diesel engine", but there is no support for negative filtering, so you can't choose "not LPG". In so many search-and-filter options you can't exclude your dealbreakers, and it's much more likely that I have a single dealbreaker which rejects a choice overriding all other considerations, than that I have a single dealmaker which makes a choice overriding all else.
What do you call a skyscraper like that if you want to refer to it? They exist, but you can't find them using that search term on Google.
Windowless is a superset of glassless.
1. No glass used in the exterior construction at all -> implying no windows
2. No glass used in the exterior construction at all -> implying the windows are made out of something other than glass
3. A skyscraper in which glass is not a prominent architectural feature, but the building does contain features like windows and doors that contain glass. (This comment)
That's the full glass buildings returned in your windowless query.
Ok, so imagine one online retailer follows your advice and expect the users to write clear and unambiguous queries, while another retailer puts extra effort into attribution.
Which one will make more money?
A sales gimmick furniture store would use in the past was to offer customers a free gallon on ice cream for visiting the store. The value was to the store offering the promotion, as shoppers would be drawn to the "free" gift, but on receiving the ice cream -- too much to eat directly. -- would then have to go home to put the dessert in the freezer. And have less time to comparison shop at competing merchant's stores. Given limited shopping time (usually a weekend activity), this is an effective resource exhaustion attack.
Similar tricks to tie up time, patience, or cognitive reserve are common in sales. For a dominant vendor, tweaking the hassle factor of a site so long as defection rates are low could well be a net positive, if it makes the likelihood of a visitor going to other sites lower.
Still I insist that business serving up more relevant search results for loosely phrased queries will make more money than the one relying on the user to formulate perfect queries.
That's my story and I'm sticking to it.
See Scott Adams, "Confusopoly" (2011): https://www.scottadamssays.com/2011/12/07/online-confusopoly...
I've touched on this: https://old.reddit.com/r/dredmorbius/comments/243in1/privacy...
The antipattern is sufficiently widely adopted that I've been. looking for possible dark-pattern justifications.
I'm not sure trying to confuse people about whether a shirt has stripes on it would make as much sense. The purchaser seems likely to give up on picking an ideal shirt and just go with the cheapest result.
Both though have the same essence: a manifestly confusing and annoying interface may be serving the merchant's interests.
See also Ling's Cars, possibly explaining awful Web design:
https://ello.co/dredmorbius/post/7tojtidef_l4r_sdbringw (HN discussion: https://news.ycombinator.com/item?id=16921212)
That’s the best I can do, sorry.
I think you'd struggle to find anywhere Google claims to "understand everything", making your assertion a strawman.
Literally in the article you're quoting from Google:
> But you’ll still stump Google from time to time. Even with BERT, we don’t always get it right. If you search for “what state is south of Nebraska,” BERT’s best guess is a community called “South Nebraska.” (If you've got a feeling it's not in Kansas, you're right.)
"So that’s a lot of technical details, but what does it all mean for you? Well, by applying BERT models to both ranking and featured snippets in Search, we’re able to do a much better job helping you find useful information. In fact, when it comes to ranking results, BERT will help Search better understand one in 10 searches in the U.S. in English, and we’ll bring this to more languages and locales over time.
"Particularly for longer, more conversational queries, or searches where prepositions like “for” and “to” matter a lot to the meaning, Search will be able to understand the context of the words in your query. You can search in a way that feels natural for you.
"No matter what you’re looking for, or what language you speak, we hope you’re able to let go of some of your keyword-ese and search in a way that feels natural for you. But you’ll still stump Google from time to time. Even with BERT, we don’t always get it right. If you search for “what state is south of Nebraska,” BERT’s best guess is a community called “South Nebraska.” (If you've got a feeling it's not in Kansas, you're right.)
"Language understanding remains an ongoing challenge, and it keeps us motivated to continue to improve Search. We’re always getting better and working to find the meaning in-- and most helpful information for-- every query you send our way."
"The AI works in mysterious ways. Trust it."
It's rather surprising how often almost all complex systems theories be it AI, cosmology or economics have an aspects where even the theorists are resorting "to because it is".
Sometimes those statements are based on measured data but it's not always easy or possible to do so accurately for highly interconnected system or worse system where you have actors reacting to theoretical model in a way that changes how the system behaves.
I don't have proof, but I strongly believe that a search algorithm that returns what a customer is actually searching for will drive more sales. I suppose it's possible that with time, consistently bad results will beat a customer into submission and drive more sales of stuff the customer doesn't want. But I don't believe that's true, and this would only be the case if the customer accepts that the thing they want doesn't exist. If the customer is pretty sure that solid color shirts exist, they'll just shop elsewhere until they find it.
edit: fixed typo born -> porn
Presumably this would be after the algo devalued people who clicked on "Next Page" until they came to a page that had stripeless shirts on it, or who, after the search, only ever clicked on stripeless shirts. "Deeds not words," dontchaknow.
Which is not the case: searching for "plain shirts" do not give similar results than searching for "shirts without stripes".
> sometimes still don’t quite get it right,
> Even with BERT, we don’t always get it right.
And nothing in the blog is about image search.
Nobody has solved the common sense knowledge problem yet. A solution for that would qualify as Artificial General Intelligence and pass the Turing Test.
But search engines have come a long way. I even suspect that when search engines place too much logical - or embedding relevance to stop words such as "without", that, on average, the relevant metrics would go down. It is not completely ignored as "shirt with stripes" surfaces more striped shirts than "shirt without stripes". "shirt -stripes" does what you want it to do.
Searching for "white family USA" shows a lot of interracial families. Here "white" is likely not ignored as much, and thus it surfaces pages with images where that word is explicitly mentioned, which is likely happening when describing race.
You can use Google to find Tori Amos when searching for "redhead female singer sings about rape". Bing surfaces porn sites. DDG surfaces lists (top 100 female singers) type results. The Wikipedia page that Google surfaces does not even contain the word "redhead", yet it falls back to list style results when removing "redhead" from your query, suggesting "redhead" and "Tori Amos" are close in their semantic space. That's impressive progress over 10-20 years back.
Is it surprising that very few of the result surprises me?
"Kind person" - pictures of men women, children, of all ages and colors.
"good person" - Mostly pictures of two hands holding. No clear bias towards women at all. If anything, more of the hands look "male".
"Bad person" - Nearly 100% cartoon characters
Absolutely ridiculous that you would take the time to write up such fake nonsense.
Following the stereotype content model theory I would likely get a pretty decent prediction of what kind of culture and group perspective produced the data. You could also rerun the experiment in different locations to see if it differ.
I did use images.google.se in order to tell google which country I wanted my bias from since that is the culture and demographics I am most familiar with. I also only looked at photos of a person and ignored emojis.
I have also seen here on HN links to websites that have captured screen shots of word association from google images and published them so you could click a word see the screen shot. They tend to follow the same line as above, but with some subtle differences, and I suspect that is the country culture being just a bit different to mine.
I just submitted all your searches to google.com from Australia, and the results were nothing like what you described; all the results were very diverse.
This is to be expected, as Google has been criticised for years for reinforcing stereotypes in image search results, and has gone to great effort to adjust the algorithms to reduce this effect.
But here, not that I think it will help: https://www.recompile.se/~belorn/happyvscriminal.png
First is happy person. Out of 20 we have 14 women, 4 guys, 2 children.
Second is criminal person. The contrast to the first image should be obvious enough that I don't need to type it.
If I type in "person" only I get the following persons in the first row in following order:
Pierre Person (male)
Greta Thunberg (female)
Greta Thunberg (female)
Unnamed man (male)
Unnamed woman (female)
Mark zuckerberg (male)
Keanu Reeves (male)
Greta Thunberg (female)
Read Terry (male)
Unnamed man (male)
Greta Thunberg (female)
Greta Thunberg (female)
Unnamed woman (female)
Unnamed woman (female)
Resulting in 8 pictures of females, 8 males, which I must say is very balanced (I don't care to take a screenshot, format and upload, so if you don't trust the result then don't).
Typing in doctor as someone suggested in a other thread I get in order (f=female, m=male): fffmffmmmmfmmfffmfmfmmmff
and Nurse: fffmffmfmmffmffmfffmffmffff
Interestingly the first 5 images have the same order of gender and are both primarily female, through doctor tend to equalize a bit more later while nurse tend to remain a bit more female dominated.
Your initial comment said "Happy person", women of color.
But your screenshot showed several white people, several men, and a diversity of ages. Yes, more women, which is probably reflective of the frequency of photos with that search term/description in stock photo libraries and articles/blog posts featuring them. No big deal.
You also said "Criminal person", Hispanic men
But the screenshot contains more photos of India's prime minister than it does of Hispanic men. In fact I can't see any obviously-Hispanic men, and the biggest category in that set seems to be white men (though some are ambiguous).
The doctor and nurse searches suggest Google is making some effort to de-bias the results against the stereotype.
To me the biggest takeaway is that image search results still aren't very good at all, for generic searches like this.
Indeed it's likely that they can't be, as it's so hard to discern the user's true intent (for something as broad as "happy person"), compared to something more specific like "roger federer" or "eiffel tower".
The net result of that Google search, combined with the "Shirt Without Stripes" repo, leaves me even more unimpressed with the capabilities of our AI overlords.
- If I entered "person" I'd see a mix of images substantially similar to what I saw using google.co.uk up to and including Terry Crews, which was frankly a little weird, and otherwise mostly white
- If I entered "人", which Google Translate reliably informs me is Japanese for "person", I'd see a few white faces, but a substantial majority of Japanese people
So it seems possible that Google's trying to be smart in showing me images that reflect the ethnic makeup I might expect based on my language and location. I mean, it's doing a pretty imperfect job of it (men are overrepresented, for one) but viewed charitably it's possible that's what's going on.
Is the case for woke outrage against Google Image Search overstated? Possibly; possibly not. After these experiments I honestly don't feel like I have enough data to come to a conclusion either way, although it does seem like they may at least be trying to do a half decent job.
The TL;DR of it is that google crawls the internet for photos, associates those photos with text content pulled from the caption or from the surrounding page, and gives them a popularity score based on the popularity of the page/image. There are some cleverer bits trying to label objects in the images, but it's primarily a reflection of how frequently that image is accessed and how well the text content on the page matches your query. There's some additional localization, anti-spam, and freshness rating that influences the results too.
The majority of pages with "人" and a photo on it that has a machine labeled person image would be a photo of a japanese/chinese person, and if you're being localized to japan with a vpn, that would be even more true.
Google doesn't "know" what you're trying to search. It's a giant pattern matching game that slices and dices and rearranges text to find the closest match.
I'm not disputing that, and it certainly explains why it's "good enough" for somes search queries whilst being totally gimpy for others.
My understanding was that Google does prioritise what it's classified as local search results though, on the basis that they're likely to be more relevant.
"Person without stripes" shows several zebras, tigers, a horse painted like a zebra, and a bunch of people with stripes.
Interestingly, duckduckgo shows me, as second result, an albino tiger with, you guessed it, no stripes. The page title has "[...] with NO stripes [...]" in it, so I assume that helped the algo a bit.
EDIT: I also got the painted horse (it looks spray-painted, if you ask me) and I must admit it's quite funny to look at
Unless things have really changed, [doctor] will be mostly white men and [nurse] will be mostly white and Filipino women.
But don't blame the AI. The AI has no morality. It simply reflects and amplifies the morality of the data it was given.
And in this case the data is the entirety of human knowledge that Google knows about.
So really you can't blame anyone but society for having such deeply engrained biases.
The question to ask is does the programmer of the AI have a moral obligation to change the answer, and if so, guided by whose morality?
Any sort of image search is going to tend to be biased toward stock photos, because those images are well labeled, and often created to match things people search for.
Key point right there. Unless Google is deliberately injecting racial and/or gender bias into their code, which seems extremely far fetched (to put it kindly), the real fault lies with us humans and what we choose to publish on the web.
Nurses it's 34 women to 5 men. Proportions of skin tones are what I'd expect to see in a city in my country.
I would contend that society is biased. There is no evidence that says men are better doctors than women, and in fact what little this has been studied says that women make better doctors than men (and is reflected in the more recent med school graduation classes which are majority women).
So it's a question of what you are asking for when you search for [doctor]. Are you asking for a statistical sampling or are you asking for a set of exemplars?
> So statistically, it would be correct to return mostly male doctors in an image search.
And that's exactly it. The AI has no morality. It's doing exactly what it should, and is amplifying our existing biases.
You can blame statistics for that. Beyond that, you can blame genetics for slightly skewing the gender ratios of certain fields and human social behavior to amplify this gap to an extreme degree.
IMO, wrapping it in a concept like "morality" because the pictures have people in them just serves to excuse the problem and obscure its (otherwise obvious) solution.
(That's how I would do it if I wanted more accurate rather than more general results.)
The next few images contained Donald Trump, Terry Crews, Bill Gates and a French politician named Pierre Person.
After that it was actually quite a varied mix of men/women and color/white people.
I am still not very impressed with Google's search engine in this aspect, but it is not biased in the way you suggest.
At least it is not biased that way for me. As far as I am aware, and I might be completely wrong here, Google, in part, bases its search results on your prior search history and other stored profile information. It is entirely possible that your search results say more about your online profile than about Google engine :)
Well, she was the 2019 Time Person of the Year.
Likewise, Trump was the 2016 choice, and Crews and Gates have been featured as part of a group Person of the Year (“The Silence Breakers” and “The Good Samaritans” respectively).
There's not much diversity, assuming Terry Crews is from USA, then all the first viewport full of images are Western people; except Ms Thunberg they're all from USA AFAICT [I'm in UK].
The first non-Western person would be a Polish dude called Andrzej Person (the second Person called Person in my list after a USA dancer/actress), then Xi Jinping a few lines down. The population in my UK city is such that about 5/30 of my kids primary and secondary school, respectively, classmates have recent Asian (Indian/Pakistani) heritage. So, relative to our population, there are more black people, far fewer Indian-subcontinent no obviously local people.
Interesting for me is there are no boys. I see girls, men and women of various ages but no boys. 7 viewports down there's an anonymous boy in an image for "national short person day". The only other boys in the top 10 pages are [sexual and violent] crime victims.
The adjectives with thumbnails across the top are interesting too - beautiful, fake, anime, attractive, kawaii are women; short, skinny, obese, big [a hugely obese person on a scooter], cute, business are men.
Most of the very top results seem to be of trump and greta thunberg.
If you were unfamiliar with them and searched "widgets" to find out more and got widgets of a single colour and form, it would not be an unreasonable assumption that widgets are mostly (if not entirely) that shape and colour, especially if there was nothing to indicate that this was a subset of potential widgets.
It's not so much "demand for diversity" as it is "more accurate and correct representation".
I never figured out what kind of mistake could have led to that.
Relatedly, one time I picked up a prescription for a cat. The cat's name was listed as CatFirstName MyLastName. They had another (human) client with that same first and name. It turned out that on my previous visit they had "corrected" that client's record to indicate that he was a cat.
If I search for 'person' it's a mixed-race woman, then a white woman (Greta Thurnberg), then a white man.
Many interpreted this along tribal lines, but likely it is that there is constant tuning and lots of complex constraints.
 not to say that you implied the reason was racism, but often it is attributed to something along those lines
Something of a corollary to Brooksian egg-manning: with an infinite number of possible searches, you can find at least one whose results do not exactly match the current demographics of the state from which you place the search.
The google image search you did -- did not provide incorrect answers, unlike the OP's
There’s a nuanced argument that practitioners know how ML is so dependent on training data and accuracy tails off sharply, but that nuance tends to removed from anything selling to potential customers — which has not been a great way to keep them in my experience.
Edit: "stripes" not "stripped" ugh
EDIT: scrap that, I didn't mean Alexa, which is doing AI obviously, but the search engine of Amazon's retail website.
Anyway, NLP is hard and everyone sucks at it. Think about it: just building something that could work with any <N1> <preposition> <N2> or any other way to express the same requests would mean understanding the relationships of every possible combinations of N1 and N2. It means building a generalized world model that is quite different from simply applying ML to a narrow use case. Cracking that would more or less mean solving general AI which probably won't happen soon.
You're right the NLP is hard, but not everyone sucks at it.
Not actually true. ML is one area of study within the field of AI. Thanks to marketing departments and slightly shoddy journalism these two things are now casually treated as equivalents, but they're really not: ML is still very much a subset of AI.
Additionally, "shirt without stripes" is not the same as "solid color shirt"; as an example, take a look at:
Whereas all these services seem to be processing the input in such a superficial way that they give the searcher results that aren't just inaccurate but are the opposite of what was asked for.
Lol what? These are words a toddler would understand.
If your "ML algorithm" doesn't understand straightforward language, how is it any better than a couple if-then statements?
Beyond that, I'm unsure how you think "<something> without <something>" is at all unusual or difficult to decipher.
If vendors would use the term "shirt without stripes" than it would match great, but they call it "plain shirt".
Google advertises using BERT natural language models
> ... but they call it "plain shirt".
Or polka dotted :)
How am I supposed to explicitly search for a shirt without stripes, then?
People still think we will have self driving cars "in two years" yet here we are talking about dumb shirts. AI winter is coming
Google has not yet discovered how to automate "is this a quality link?" evaluation or not, since they can't tell the difference between "an amateur who's put in 20 years and just writes haphazardly" and "an SEO professional who uses Markov-generated text to juice links". They have started to select "human-curated" sources of knowledge to promote above search results, which has resulted in various instances of e.g. a political party's search results showing a parody image. They simply cannot evaluate trust without the data they initially harvested to make their billions, and without curation their algorithm will continue to fail.
Google has so much more data than just the keywords and searches people make, it seems like this should be a problem they could solve.
Through tracking cookies (e.g. Google Analytics) they should be able to follow a single user's session from start to finish, and they also should be able to 'rank' users in some vague way where they'd learn which users very rarely fall for ads or spend time on the sites that they know are BS. Those sites that are showing up on page 5 or 6 of the search results, but still get far more attention than others on the first few pages, could get ranked higher.
But I don't think many of Google's problems these days are technical in nature. They're caused by the MBAs now having more power at Google than the techies, and thus increasing revenue is more important than accuracy.
Also, don't underestimate the adversaries. Ranking well on Google means earning a lot of money. So much so, that I'd argue the SEO-people are making significantly more money than Google loses by having spammy SERPs. They will happily throw money at the problem and work around the filters. I don't think you can really select for quality by statistical measures. Google tried and massively threw "trust" at traditional media companies and "brands". The SEO-people responded by simply paying the media companies to host their content, and now they rank top 3, pay less than they did by buying links previously, and never get penalties.
They already do this today for any venue where they can link “traffic volume” to “ranking increase without human review”.
That might explain a lot but I don't think so.
Just look to how they are messing up simple searches because of basic lack of quality controls:
- Why doesn't doublequotes work anymore? Not because dark SEO vut because nobody cares.
- Same goes for the verbatim option.
- The last Android phone I liked was the Samsung SII, and last year I finally gave up and got the cheapest new iPhone I could get, an XR. My iPhone XR reliably does something my S3, S4, S7 Edge and at least one Samsung Note couldn't do: it just work as expected without unreasonable delays.
- Ads. They seem to be optimized to fleece advertisers for pay-per-views because a good number of the ads I've seen are ridiculous, especially given that I had reported those ads a number of times. I guess what certain customers that probably paid a lot for those impressions would say if they knew that I had specifically tried to opt out from those ads and weren't in the target group anyway.
Google's aim was to replace other sources of information with Google:
> People make decisions based on information they find on the Web. So companies that are in-between people and their information are in a very powerful position
Profit was on their minds from the very beginning:
> There are a lot of benefits for us, aside from potential financial success.
Revenue, however, was not urgent back then, to them or to their VCs:
> Right now, we’re thinking about generating some revenue. We have a number of ways to doing that. One thing is we can put up some advertising.
So over the past two decades, they executed a two-pronged approach: Become indispensable and Become profitable. But now they're trying to pivot from "at web search" to "at assisting human beings", and that's a much more difficult problem when their approach to "Become profitable" was to use algorithms rather than human beings.
Here's a useful litmus test for whether Google has succeeded at that pivot:
If you were in a foreign city and you suddenly wanted to propose marriage to your partner, would you trust Google Assistant to help you find a ring, make a dinner reservation, and ensure that the staff support the mood you want (Quiet or Loud, Private or Public)?
If so, then Google's pivot has been successful.
Google is a dumbass nowadays, and regularly ignores half your search terms to present you with absolutely irrelevant results, that have gotten lots of visits in the past.
People want better results but don't want to be tracked, and those things are in opposition to each other.
But taking it as a given the Google's results are better, is that really because of lack of privacy, or just because of how Google has been pouring more money and talent into the problem longer than anyone else? Because I'm not convinced that personal data is particularly useful for generating search results. The example they always give is determining whether a search for "jaguar" means the cat or the car. But that always seemed silly to me, because most searches are going to give extra context to disambiguate ("jaguar habitat"), and even they don't, the user is smart enough to type "jaguar car" if they're not getting the right results. Further, Google doesn't actually know whether I'm more interested in cars or cats—it justs know that I'm a woman in college, so it guesses that I'm less interested in cars. Is that really so useful?
Does searching Google through Tor give noticeably worse results than searching google while logged in? I would be genuinely surprised if it did.
I mean, that's probably why they are equivalent for you. You've chosen privacy over better results (which is a totally legit choice to make!).
Have you tried viewing pages past the first page? Often times it's just filled with what looks like foreign hacker scam websites.
It's funny because it's frequently mentioned how Google's tracking is what enables it to give such personalized search results, but often I question how effective that really is.
For instance I question if Google has some profile on me and shows results they _think_ I will want to see (e.g. news related), and thus leave out other results. If it works that way then I'm frequently seeing the same websites in my results and effectively being siloed and shielded from other results that I may find interesting.
Their new strategy of adding snippets for everything has truly gone insane. I search a query for "covid us deaths" today and had to scroll about 3 viewport lengths down to even see the first result.
What happened to just a plain list of blue links?
From a marketing perspective, I feel like DDG needs to change it's name or use a shortened alias. "Google" is an incredible word as it's easy to spell, remember, and it's short. Interestingly they own "duck.com"...
Alternative hypothesis: people only have had Google as reference for years, which means that Google represents "reality" to them. Anything that looks even slightly different is therefore worse.
Still though: This is not evidence for Google's search quality. I, too, feel, like the results got worse over the last years.
Also: Afaik, DDG uses bing under the hood, not what I would call "search startup" in the sense of revolutionizing search quality.
Page 2: Page 2 of about 86 results (0.36 seconds)
It seems they're really just trimming the web.
Google's job is not to give you great search results, it's to keep you clicking on ads. Ideally it would be the ads on the search results page directly, but if that doesn't work then a blogspam website with Google ads is the next best thing.
If Google was a paid service this problem would be solved the next day. Oh, and Pinterest would completely disappear from Google too. :)
Nope. Cable television was introduced with the promise of no ads. That didn't last long.
Search engines are a relatively competitive market. A paid Google with no extra perks will not fly when the majority of people will just flee to Bing. For a paid Google to be successful it has to provide additional value such as filtering out ads, blogspam, Pinterest and other wastes of time.
Subscription based services also require you to be authenticated and that enables fine grained invasive tracking. Something traditional media couldn't do.
If delivery costs were a factor then I shouldn't be charged $15 for an ebook with near zero distribution costs when a paperback was $5 before ebooks came onto the scene and introduced a new incentive for price gouging.
Today, it will silently guess at what I want, and rewrite the query. If they have indexed pages that contain the words I put in, but don't meet their freshness/recency/goodness criteria, they will return OTHER pages with content that contains vaguely related words. "Oh, he couldn't have meant that, it's from 6 months ago, and it's niche!"
They'll even show this off by bolding the words I didn't want to search for.
So, if I'm looking for something that isn't popular -- duckduckgo it is. It doesn't do this kind of rewriting, so my queries still work.
I still continue to use it though since as some here have already mentioned Google's results because worse a few years ago and DDG was lean and good enough to switch. I do hope they'd consider more such feedback.