Hacker News new | past | comments | ask | show | jobs | submit login

I find both its birth and death interesting. The birth because the fact that you could mark sites as spam was one of the early talking points at Blekko, the back end architecture of our engine includes a 'selector' mechanism (slashtags) and maintaining a personal 'spam' slash tag came along for free. Its one of the features I continue to use.

And then it showed up as a feature in Google's results which I found interesting because having worked at Google and had the 'deeper than non-Googlers, not as deep as someone in the search group' classes on how the two services that made up Google search at the time, I had a feel for how much lubricant it would take to squeeze that feature into the existing pipeline. It made me wonder if Google was following us :-)

The death of it is also interesting, because having it in the browser as a plug-in vs the results means two things; You can't offer it as a service to your partners, and you can't know apriori if you're sending junk. If Blekko's partners say "We'd like to use your index but we don't want any results that include x,y or z" we can do that but that is at the API level with results coming right out of the index filtered by a 'negative' slashtag, but its unclear if anyone can (or does) use Google's index in that way (unlike BOSS for example). On the browser side, since you don't know what the plug-in is going to kill, how do you select the 10 documents to send? It makes me wonder that if you're doing a search in a highly contested search (like 'weight loss') and Google can't know that the 10 blue links it is about to send you are all spam (and on really contested keywords this is not uncommon) are you left with just sponsored links and no organic results?

'Panda' update all you want, even Google now admits they are adding staff to curate results (Microsoft/Bing had already gone public with their 'editor's choice' announcement). This is a good thing, but it only covers half the situation. At Blekko we got flamed by a user for not having any 'alternative' medicine sites in our 'health' slashtag, which started a conversation with that user about creating their own slashtag with all of those web sites that were unfairly penalized by the medical establishment just because their methods and claims weren't the product of some 'big pharma' company. But they could (and I believe they did) create a slashtag of all those 'hidden' sites and they got great search results for them. Reinforcing to me and others that 'spam' can be relative, and really it has two sides, user and index.

So pondering this move on Google's part makes for interesting reading.




I think it'd be illogical for Google not to in some ways compete with alternatives, like Blekko in this case. Google's core business is search, and in the early days, it was this very naive behavior that allowed Google to sneak right under places like Inktomi. Just because a search engine starts small doesn't mean it can't gain traction or market, and if it has compelling enough features, it'll happen in a heartbeat.

I suggest reading "I'm Feeling Lucky" by Douglass Edwards - the early chapters describe a lot of Google's anxiety of being killed by other search engines for one reason or another. It helps to explain why they'd be quick to adopt new strategies to keep results fresh.


I'm not sure google is even clear about what its core business is, these days. It really seems quite clear that they've abandoned much of the approach and focus on being the information finder to now being the social-network wannabe.

It's as if something flipped 180 degrees - they went from being the company that wanted to help you find out about everything to being the company that wanted to find everything about you.


Google's core business is advertising.


Anyone who disagrees with this other Jacques is invited to take a squiz at Google's annual reports.

Any of them since the introduction of adwords and adsense. It doesn't matter which. They all basically read the same.


Indeed – perhaps I was wrong to say that they don't currently know what business they're in – rather, they haven't known and recently, they've re-aligned around a singular strategy focused on that.

The problem is that for so long, they built their public facing brand on a completely different premise and image. Therefore, as they've re-aligned, they're facing a bit of cognitive dissonance in the minds of their products (err, the public).


Might the leadership change have anything to do with that?


If google is still researching AI then it is likely because the war with the spammers can't be won in any other way. Sooner or later you end up in a situation where the amount of 'spam' versus the amount of 'ham' is such that no matter how good your algorithms and how good your computing infrastructure you'll end up spitting out a lot of spam.

Human curation is a stop-gap solution, computers can generate spam faster than humans can filter it. You could argue that the humans will only have to look at the good stuff but unfortunately they'll have to look at all of it to make a decision in any practical setup.

Google has stepped up the arms race against spam and for a while their algorithms gave them an edge, now we have reached the level where the spammers have the edge again and it will take another quantum leap before the good guys can regain the edge.

It would be funny if we end up with AI mostly because of the spammers :)


It would indeed be an interesting situation if the spammers forced the creation of AI (either they doing it to make better spam or someone like Google to deny it).

One of the interesting things I have experienced in my time at Blekko has been that "growth" on the Web isn't really growing all that much. Sure there are trillions of pages being created but there are only so many things the few billion people in and around the Internet care about. There are 'hard information' places, which are things like libraries where reference searches are common, there are 'entity' places, be thay shops or service providers or SOMA startups, and there are "transient" places where information is current and then stale, to be stored and later reconstructed like coral into a 'dead' (in terms of change) but 'useful' base. Seeing the web from the point of view of a web scale crawler and indexer it starts to be clear that the mantra "Organize all the world's information" is getting tantalizingly close to a dynamically stable froth.

I have to believe that Google has figured this out, some of the smartest engineers I've worked with are at Blekko but Google has its share as well. When you trawl through the fishery and all you get are trash fish you start to wonder, "hmm did we actually catch all the fish there are?" So to it goes with "the Web".

I started doing some speculation [1] on how you could value information that was discoverable on the Internet. And one of the schemes I came up with is how many people would find that information "useful", where useful really means they would have some reason of seeking it out. And then scaling that value by the value to them of having it. So for example if my genome was online, there is maybe a dozen people who would find it "useful", and of that dozen probably on the insurance actuaries and perhaps the occasional researcher who would find it "valuable".

Now one takes that unit of applicability/value and scales it again by the "cost" to acquire it (find it, index it, etc). And from that you can compute the total size of the Web. Well estimate it at least. So far my upper bound (based primarily on the fact that world population is stabilizing) is about 72 trillion documents at any given time.

When you look at it that way you can see that ultimately the spammers lose. They lose because over time the actual information that rises to the useful vs cost threshold is identified and classified, or the legitimate channels that provide dynamic information, or the legitimate archival sources that provide distilled information are all, for the most part known and 99.99% of all your users can find everything they want. And as a spammer you are no longer given the free reign of "appear and be indexed" you have to ask for admittance though some form or another. And the level of new credible sources that are created is inherently a function of the number of people that exist, and the number of people that exist is stabilizing.

When Yahoo started with its human curation it was vastly better than anything anyone else could do, and then it was overwhelmed by a combination of growth and algorithms that could much more rapidly infer curation from the social signals of bookmark pages and article reference. Curation has come back into favor, and it combined with machine learning algorithms will create what is essentially a stable corpus of documents[2] known as "the web".

Wikipedia is a great analog for what is happening world wide. All the reference articles they want to put in are nearly done, the number of editors required has gone down not up, and the future of web search is, in my opinion of course, similar. The only new ground on the web is social networking and Google understands that it seems. Its fortunate for them that only deep pocketed entity that is possibly a near time threat there is Microsoft, and since to date Microsoft is trying to do exactly what they did, they benefit from having already been through that part and know exactly what Microsoft will have to do next.

[1] My side hobby is attempting to discern the economics of information.

[2] Documents are just that, pieces of information, I hardly count every page rendered by Angry Birds in a browser as a separate document, in fact applications are them selves a single "document" in the since that some number of people will seek them out, and connect/consume them over what we think of as the "web" interface.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: