Hacker News new | past | comments | ask | show | jobs | submit login

This happens all the time on Amazon. Just search for a product, like shirt, and then change the sorting from "relevance" to "high to low". For this particular search at this time there is a nice women's $2,000 Nike tee shirt.



In general, sorting by anything but "relevance" is just a way to find bad data, like sorting TVs by size and finding a TV which reinterpreted its size from mm to inches or vice versa and now seems to be an absurdly large or small TV, on paper. It's not even spamminess in this case, just bad data.


Makes one appreciate the effort that's put into the "relevance" scoring. It's probably >100 engineers working in Amazon on just this core ranking problem.


Tuning relevance is extremely difficult, and it’s not just an information retrieval problem - at Amazon’s size (and even smaller companies) a lot of what you see for “relevance” is hand-tuned by humans, through machine learning on longer-tail queries, and some combination of both in the middle (with sponsored or high margin or velocity items weighting more heavily). If you search for something like “men’s shoes” there’s 0% chance you’re getting a search result, it’s going to be a curated result page. Lots of people work on this hand tuning, but they’re not engineers, they’re product and marketing people.

If you search for “men’s shoes with green laces” you’ll see the real challenge, with a top 10 search result that’s not even shoes (it’s a pack of laces).


Some of my favorite test queries for shopping sites (from back when I used to work on a comparison shopping engine:

[apples to apples] -> frequently returns random "Apple" products, because when you dedupe, stem and remove stopwords from that query, you're left with "apple", and Apple products are big ticket items.

[shirt dress] -> frequently returns dress shirts, because many sites essentially ignore word order or the notion of compound words.

[fish swim vest] -> frequently returns fishing vests, partially because of overstemming (many POS stemmers will stem "fishing" -> "fish", but in a product search setting, it's usually the wrong thing to do).


Here's one that's fun if your objects of interest include cities:

https://en.m.wikipedia.org/wiki/Best,_Netherlands

What's the [best city in europe]? Obviously "Best, Netherlands".


The other day, I found a household appliance (I don't remember what kind), that boasted 16 gigs of RAM.


Probably a refrigerator. A washing machine only takes 8 gig.


Refrigerator Repairing And Servicing, Memory Size: 2GB, 4GB, 8GB, 16GB, 32GB

https://www.indiamart.com/proddetail/refrigerator-repairing-...


Haha! That's crazy! Though, it does point out that many listings may have been created by copy&paste, leaving them with incorrect information.


If your discard outliers you get usable data.


That helps, but you also have a lot of bad data at the extremes that are still within the bounds of plausibility.

For instance, nine out of ten 122" TVs might actually be 12.2" TVs that got misinterpreted or typo'd to 122", because small TVs are so much more common than large TVs that there can be more malformed small TVs than well formed large TVs, but you can't necessarily discard all of the 122" TVs as outliers, because there are actually valid TVs in that range as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: