Hacker News new | past | comments | ask | show | jobs | submit login
"Sufficiently advanced spam is indistinguishable from content" (lesswrong.com)
84 points by moultano on May 12, 2010 | hide | past | favorite | 38 comments



It's interesting that PageRank's measure of quality is entirely dependent on there being a community that recognizes the quality of the content first, before the search engine. Without a community, you're not going to get incoming links.

In other words, work produced by lonely geniuses is quite likely to go unnoticed.

For all we know, the content that is being produced by companies like Demand Media has already been produced by thoughtful people, writing at length about subjects they love on obscure websites that no one ever links to. What a shame that would be!


I've actually seen that happen to one of the lead engineers in search quality at Google. He'd written a great guide to ultralight backpacking that until I linked to it, wasn't indexed by any major search engines.

http://eric-and-april.com/Ultralight/index.html


Is this a meta-post ? Did you just link to it for the first time ?


In other words, work produced by lonely geniuses is quite likely to go unnoticed.

It’s not quite as depressing as that. I recently made a quaint little site for a band and it has exactly zero other sites linking to it. It’s the first result when you search for the name of the band (which is town-name+generic-term-used-in-bandnames).

This only works with stuff that’s rare on the web, though. If there were other bands with the same name and if someone linked to them my little website would probably get swamped. (The same would presumably happen if someone were to write a blog post about the band – say, a scathing review of their last gig – and if that one post gets only a handful of links. Hm, so getting a few links seems at least like a good defense in such cases. Luckily many of the band’s target demographic aren’t actually all that internet savvy :)


You imply that works of geniuses should be noticed, but geniuses are so esoteric, rare, and difficult to understand that most wouldn't notice. Since the majority of people don't care about what geniuses care about, it's unlikely they'll appreciate it enough to link to it. If they do link to it, then it's "they" are probably a very small population of people, maybe a handful of other geniuses themselves.

The google page rank algorithm is designed in such a way that the work of geniuses should go unnoticed. Pagerank is designed for the masses. For the masses of consumers specifically.

Google is not designed for the geniuses. It's designed for people who want what everyone else wants.

In the beginning, when google was a tool used primarily by geniuses, then geniuses were the community. They were the masses that used google. Their algorithms now pick selections from a new community. Bloggers who can copy/paste. Bloggers with lots of friends who will link to their posts because the friends are asked to and because other friends reciprocate.

Google doesn't know if you are linking to a web page because you like the web page or because someone who built the web page asked you to link to it or because you are getting paid.

And google doesn't care.


The content produced by Demand Media is still spam, all the more effective as spam to the extent that it approximates thoughtful but obscure content.

The problem is that "indistinguishable" does not mean "identical". The Optimization-by-Proxy concept also applies to the way we recognize useful content and distinguish it from spam: if spam-creators exploit the gap between our perception of content and the actual quality of the content, they will ultimately create spam that fools even savvy users, and we will be influenced by it without even realizing it.

One of the characters in Neal Stephenson's "Anathem" described this phenomenon, occurring on his world's equivalent of the internet: sophisticated AI had led to spam (or "crap" as he called it) which was created by taking perfectly valid, reasonable ideas, combining them with falsehoods or biased information expressed clearly and reasonably, and releasing it in the form of real, substantive communications between users. A great deal of time and energy had to go into sorting "crap" from valid information.


> In other words, work produced by lonely geniuses is quite likely to go unnoticed.

I think this is something that has happened throughout history. The web probably makes it easier for the their work to be uncover than before but they are still at a disadvantage.


I work on search-quality at Google. This is my life.


Hi, I am the author of that. Would you say the depiction in the article is more-or-less accurate? I am asking as I wrote this purely from an outside/theoretical perspective.


It's very true for some things, and not for others. There's information asymmetry in both directions, things site owners know that Google doesn't, and things that Google knows that site owners don't.

The summary of the history seems pretty accurate to my perception of it, but I don't think it's hopeless from here. :)


My life is going from using a Google that used to give me useful results to one where "tar up website" returns the top result:

"Deep-sea ice crystals stymie Gulf oil leak fix - Yahoo! News 8 May 2010 ... thick blobs of tar began washing up on Alabama's white sand beaches. ... platform at the Deep Sea Horizon oil spill site in the Gulf"

At least a result from 4 days ago is an improvement on when I'd get usenet or mailing list results from 1999-2004 whenever I searched for anything linuxy.

:/


Tell us more!


I wish I could. :)

All of the fascinating things about signals are confidential for all of the reasons listed in the article, and Google has been sued so many times by sites that think they should rank better than they do that I can't really give examples.

I think it's safe to say though that there are a lot of people worried about and thinking hard about what the web is turning into and how to rank it appropriately.

Most of the content is no longer written by devoted hobbyists, people no longer link as often to things they like, and much of the content on the front pages of reddit, digg (and sometimes even hackernews) was put there by people trying to make your search results worse.


I feel like you missed listing a big change: with the rise of UGC sites, the boosting of results of anything on an "important" domain doesn't seem valid anymore. Just because something is on twitter.com doesn't mean it's highly relevant, since anyone could've posted it. But it's fairly easy to google bomb someone's name by just registering a Twitter account and listing their name as owning it.

Similarly, stackoverflow.com doesn't have very good answers for a number of technical topics, but it's often on the first 10 hits, even when the answers are useless and there's much better answers ranked lower (like project mailing list archives).


I think that a mailing list only search engine done well and broadly would be a very useful technical resource. Much of the internet, at least for SysAdmins and Systems Programmers still happens on mailing lists.


Yeah, I often find myself wanting an "exclude answers sites and wikis that aren't Wikipedia" checkbox.


The days of a central search engine 'indexing' the Web are nearing an end---indeed, it's already over, people just don't realize it yet. New material now appears much faster than it can possibly be indexed, and unless Google figures out a way to beat Einstein, it's only going to fall further and further behind. People will discover relevant material through their network, in a massively parallel fashion. Everyone will be their own Google.


Fascinating essay, but I'm not quite sure whether it's a problem that sufficiently advanced spam is indistinguishable from content.

After all, Demand Media does produce real, editorially vetted content from real human writers. The payment system encourages what I'll call extreme efficiency of research and writing, but that simply optimizes it for the handy-reference domain of search results (e.g. How to fillet a smallmouth bass), which may not be "high quality" as such but does provide direct, clearly written and reasonably valid responses to the search queries that elicit them.


I've seen a lot of pages where I couldn't tell if it was written by a markov-model or a human. Many of the people who get paid for $1 content don't speak English natively.


On that topic, I present Emily Chimpinson, a little project I've been playing with:

http://twitter.com/chimpinson

She's a Markov-based script inspired by the public domain works of a certain poet. All of her incantations are checked to make sure they don't repeat verbatim her model.

Every once in a while she comes up with something inspired.


Most english speakers in the world learnt it as a second language, and it only becomes a problem to me when they're not good at it and they're trying to sell me a service predicated on their language skills.


I'd put a finer point on it: paid writing encourages the creation of content which appears superficially relevant (especially through the eyes of a search engine), but doesn't actually convey any substantial information.


I'd suggest that it is a problem. It's something that Harry G. Frankfurt examined in his essay "On Bullshit" (http://en.wikipedia.org/wiki/On_bullshit and http://press.princeton.edu/titles/7929.html). I listened to an audio version of it and it was quite fascinating. As the Wikipedia article suggests, Frankfurt posits that bullshit is more corrosive than lies because bullshit bears no relation whatsoever to the truth.

This is exactly what makes Fox News, as an example, so dangerous. They don't care about the truth when they report; they only care about getting more eyeballs. I suspect that ANY spam that humans have to deal with to determine if it's useful is much the same.



Hah, I ranted a bit about this evolutionary arms race years ago, from a different angle.

http://hamstermotor.motime.com/post/683104/the-future-of-spa...


That was very interesting. I added a pointer to your article to the OP.


Moultano - I have a strange request, but one I hope you'll take seriously.

I think this issue is very important - to Google, to web searchers, to businesses seeking to be found by Google and even to less scrupulous web operators. I'd love the opportunity to engage in 20-30 minute written chat with you and publish it (anywhere on the web you'd like).

As background, I've worked for years as an SEO consultant, founded a community and company in the space (SEOmoz.org), and have been spending the last few years developing and launching search marketing software.

I certainly respect your background and beliefs, but I think there's some flawed logic in your assumptions and arguments that I'd love to dig into, talk about and maybe even have some of my own perceptions changed. I would not ask you to disclose anything that's confidential - I'm much more interested in the theory and logic behind web spam, SEO and search relevancy.

You can reach me via email - rand@seomoz.org. Would love to hear from you!


Sorry, I'm not equipped for that sort of public discussion. Talk to Matt. ;)


It could be anonymous on your end?...

I like Matt a lot, too, but his opinions (at least, those he publicly offers) are well known and well publicized. It would be great to hear other voices.

If not, how about if/when you ever leave the team. Happy to be patient :-)


Haven't read it all, but I am just wondering: by now data dumps of people's connections are probably making the rounds in the dark channels? I think sending spam that appears to be from your friends could be a big "improvement", and should be child's play with the data that is already freely available.

Maybe that could become one of the first privacy disasters, when people realize they made their email unusable by publishing their connections.


If we presume that any algorithmic, procedural, or structural system built by one party can be reverse-engineered and understood by another party, the concept of Optimization by Proxy, and the more general Goodhart's law, form a pretty compelling argument against designing optimized systems as solutions to problems in general.

Maybe in some cases keeping a system convoluted and inconsistent can actually help ensure stability and durability?


Just ask Calacanis!


Internal Server Error

And sufficiently advanced errors are indistinguishable from pages made for pure irony.



Thanks, but it's back again.


absolutely....sometimes i mark as "spam" conversations that i'm personally not interested in, even if the author is "legitimately" spamming me. (eg a mis-guided friend's mass email...or more likely the dozens of mis-guided reply-all's)


I think this is a valid use case of spam filters. I have trained more than one to detect my father's powerpoint emails and bad chain-mail jokes and separate them from his personal messages that I actually want to read.


Also all the newsletters companies feel entitled to send just because you bought a toothpick from them 10 years ago.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: