Hacker News new | past | comments | ask | show | jobs | submit login

I'd like to note, donning my #2 at amzn hat again, that when the idea for affiliate sales originally came up, we certainly did imagine people creating pages with lists like "best bikes for under $500". However, I feel fairly confident in saying that nobody involved in that in the 95-96 timeframe was imagining that such pages would be created by anyone other than actual enthusiasts (probably a reflection of the state of the internet at the time).

In retrospect, this was a profound failure of our imaginations back then. It's also somewhat damning that in the 25+ years since, nothing about the affiliate sales concept has been substantially modified to mitigate the weaponization of this by "best X for 202X" pages.

An actual "best X for 202X" page is extremely useful; the problem is that there are no search signals to distinguish it from a trivially created spam mockery.

I think that's a failure of imagination. In the early days of search engines and web crawling companies paid subject matter experts such as librarians and other experts to manually grade and categorize web pages.

Spam pages tend to be filled with Ads and derivative content, owned by spam company domains, and do not provide unique information relative to other sources.

If we as humans can recognize a spam page, why can't the machines? At the very least Google can penalize repeated spam domain owners/domains to help reduce the problem.

Can humans recognize spam pages though? Back in the day, such pages would be created on blogspot or other spammy domains, so there was at least a visual signal of some sort. Now the spam is on Facebook groups, Whatsapp group chats, TikTok, Twitter, Medium, Linkedin.....the list goes on.

Online misinformation would not be as destructive as it is if people could tell the difference between say a credible news site and a Macedonian troll farm-run 'news' site written in broken English.

> Now the spam is on Facebook groups, Whatsapp group chats, TikTok, Twitter, Medium, Linkedin.....the list goes on

Of that list, Medium is the only one I would ever expect/want a search engine to even consider returning results for searches on "how to buy a bicycle for under $500". If I was King, you'd need to add allow:social to any google search to get any results at all from anything remotely like a social media or messaging application. I'd include LinkedIn and Pinterest and similar sites in that exclusion. If you want that stuff, you need to ask for it.

> Can humans recognize spam pages though?

Probably not, but they can recognize valuable non-spam. It's called by many names, but "reputation" is as good a term as any.

Yes, I'm saying that "Facebook groups, Whatsapp group chats, TikTok, Twitter, Medium, Linkedin" are all low reputation.

If we as humans can recognize a spam page, why can't the machines?

I think that many of these automated aggregators have come a long way. On the better sites, it's not immediately obvious that the page is spam, until I've read some way into it and notice patterns in the language and a lack of "meat".

My dad can't seem to recognize spam pages 99% of the time...

I find the advertising to content ratio is a decent clue as well.

I don’t think there is a financial incentive to clean up the internet.

Too much money to be paid with shady practices that no one in power wants to do it.

Ah, the old "Step one: create a general purpose artificial intelligence."

No general AI required. Simply count the number of ads in the text.

For a top ten site I’d also expect a certain degree of interesting vocabulary which loops in industry jargon. I’d also expect spam sites to use similar syntax/sentence structure for all products.

Lastly if I can follow the links and get references to the final products, then I can estimate how similar the product description is to the product. I’d expect most customers looking at top tens want something more than what they get from searching a retailers site.

Heck, many legit top lists include dedicated YouTube videos visually inspecting the products and reviewing them - cross referencing blogs to YouTube channels and validating that the YouTube video is of some production quality (likes/views/youtube Spam detection) could strip down a ton of spam.

> Simply count the number of ads in the text.

What is an ad?

Something served from double click or other major ad network

That does absolutely nothing to address the affiliate problem, nor does it catch advertorial or "native ads" as they sometimes called. Ignoring all content served by ad brokers is no more sufficient than browser ad blocking extensions. It helps, but you don't get an ad- and distraction-free result.

Plus, as soon as you make that a metric that is penalised by Google, how long do you think it'll take until ads are distributed from some rotating nondescript domains?

(See Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."

https://en.wikipedia.org/wiki/Goodhart's_law )

Don't spammers already do a lot of domain cycling? Doesn't NextDNS already ignore domains less than 30 days old, for exactly this reason?

Most affiliate spam sites that I see cover the page in intrusive advertisements to maximize revenue. Penalizing intrusive ads would certainly have some effect. This is also referring to penalizing rankings so that the entire page gets downranked rather than simply blocking ads in your browser.

It still wouldn't do much because as soon as Google starts to try to identify sites full of ads by some criteria, shady sites are going to start serving ads in some other way. It's already hard enough to distinguish ads in some sites that have gone with native ads or advertorials.

There's consumer advocate sites like https://www.which.co.uk/ that are pretty cool. I'd be interesting in knowing other trustable sources, mind.

There is Consumer Reports in the US: http://cr.org They are a nonprofit that has been independent for decades and have strict policies against accepting gifts, sponsorships, ads, etc, and take other steps to remain free from influence. For example, when they review new cars they send an undercover individual to buy it at normal price from a random dealer so the company and the dealer don't know the buyer is CR (and maybe sweeten the deal or throw in upgrades or take steps to improve its reliability or whatever). They also review a lot of household and kitchen appliances, mattresses, etc.

The New York Times now has their "Wirecutter" reviews of tech and household stuff, but they do rely on affiliate links for income so take that as you will.

There is not a single independent consumer organization that can test every product out there, so if you're looking at road bikes you may end up with a different set of trusted reviewers than if you're looking at computer hard drives or roof shingles.

CR also tends to limit itself to middle-of-the-range items. If you're actually interested in the best, their surveys don't always cover it, since they prefer to review things that most people can/would purchase. This has changed a bit in the last few years, but it's still the case that if you want to find solid unbiased reviews of, for example, high end ranges or washing machines, CR is not always the most helpful.

Good info. As a non-rich person I do appreciate their emphasis on highlighting the products that provide the best value for the money.

My dad was a big CR subscriber. After a while I realized that they focus on what might be termed "best value", although that's still not quite right. Their picks always emphasized mundane things like TCO, reliability, durability. For cars they were very explicit about this, each year issuing data on vehicles with categories like reliable, held their value, and low-maintenance, etc.

The prefect example of a CR-friendly car, in my mind, is the Toyota Camry. Completely unremarkable.

Everything else I remember them reviewing – watches, TVs, stereos, calculators, cameras – always ended up recommending some middle-of-the road "it works but it's not fancy and doesn't have many bells-and-whistles" product. I have a vague memory of them putting a Minolta camera at the top of the list of best 35mm SLRs in the 70s.

I found The Wirecutter quite useful. It's been purchased by the NYT, I hope they keep it going in the original form.


In Germany, there is the Stiftung Warentest, which has in depth tests of many household appliances and goods. According to Wikipedia, they cooperate with Which? in the UK and Consumers Union in the US.



(For example, when they recently tested shaving blades, they had 23 testers shave their face half with one, half with a different shaver, randomised, then had each (half of each) shave assessed by the tester himself and an external expert blind (ie the tester did not know which shaver was used, or how the tester had judged the shave). Assessed where: quality of the shave; comfort; burning, reddening and irritations of the skin; cuts; how many shaves until the blades were blunt; ease of use & switching blades; cleaning of the blades; presence of polycyclic aromatic hydrocarbon and other chemicals in the handle; and more.)

I came here to say this. Embedding trusted sources like that [0] would a) help fund such useful organisations and b) help drown out link farms.

[0] in fact I think in Germany there is a state-sponsered impartial review body for consumer products.

Part of the point of my other comment in this thread [0] is that we shouldn't need to drown out link farms. Google chooses to not hide link farms. Does anyone seriously think that a company (Alphabet) that is capable of sucking in known protein structures and then spitting out predicted structures for thousands of other proteins is not also capable of identifying link farm pages? Yes, it's a non-trivial problem, but chess, go and protein folding are non-trivial problems too. Google appears to have chosen not to treat this as a problem worthy of its capabilities.

[0] https://news.ycombinator.com/item?id=27996639

Are those sites generated by bots now? They always seem to just be the top x results if you search on Amazon directly.

I would guess most are written by humans. I know a lot of affiliates (mostly gambling but also finance and shopping) and most of their stuff is produced manually. There is some more automation these days but the industry is surprisingly manual.

The reviews are usually somewhat accurate but only include items found on Amazon. So I'm guessing they hire people to write the reviews.

I have the impression that they use ghostwriters that summarise a couple of Amazon reviews, gain some superficial familiarity with the subject (talking points), but have no real experience with it. I've seen many of such sites (mattresses, bike accessories, VPNs, certain apps, etc.) that seem superficially plausible, but here and there betray profound ignorance of the subject matter.

I imagine that once these "businesses" get a hold of GTP2 or GPT3 they will crank them out.

Some for sure are (i.e. some really just are stringed-together sentences comparing numbers), many probably are just really low quality manual work.

They're written by humans. I know someone who did this, she was paid by the word and would write "best 5 [product]" lists several times a month.

Obviously, you only have time to basically summarize points made on the listing and in the reviews for the product for it to be worthwhile.

This was pre GPT-2, though.

Failure of imagination of the bad things Amazon would go on to do seems like the hallmark of the early days. Great job. Did you also game out the exploitation of warehouse workers?

I left Amazon after 14 months because, indeed, the exploitation of warehouse workers was obviously a part of the company's future. Didn't want to be a part of that.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact