Users would be happier if they landed on the root page of your site or the root page of http://www.superfillers.com/ or http://www.filleritemfinder.com/ than if they landed on a deep page full of links and unrelated products.
Also: are any specific notices or warnings given in Google Webmaster Tools, to inform the webmaster and give some clues about the findings and decisions of your team?
Thanks Matt for taking time to explain things here.
Thanks for submitting the reconsideration request. I just added Disallow: /iframe.html?q= as suggested.
A search for hacker news results in a page full of affiliate links, just as the example you gave above. Only difference is that they didn't re-write the URL.
filleritemfinder.com has no robots.txt that I was able to pull up.
So, filleritem.com, a google customer, was blocked, but filleritemfinder.com, doing the same thing, is the number one result.
Further, shouldn't this kind of advice be given to people who appeal being excluded from the index? Or should we all post to Hacker News when it happens to us so that you can come explain directly?
I think %50 of the problem is the arbitrary picking of sites to block (and it's not working, btw) and %50 of it is that google seems uninterested in explaining or advising people when it happens to them.
 Been buying gear for a project lately, and so doing a lot of google searches in the form of product-model-number review or product-name review. Overwhelmed with spam sites, and mindless human generated spam sites like dpreview.com, etc.
I mentioned filleritemfinder.com as a random example (there are many of these services), but filleritemfinder.com appears to use AJAX to keep results on the same page rather than making a new url.
"filleritem.com, a google customer, was blocked, but filleritemfinder.com, doing the same thing, is the number one result."
The filleritemfinder.com site is not doing the same thing, because it's not generating fresh urls for every possible search. But you're not really suggesting that we should treat advertising customers somehow differently in our search results, are you? The webspam team takes action regardless of whether someone is an advertising customer or not.
"shouldn't this kind of advice be given to people who appeal being excluded from the index?"
This advice is available to everyone in our quality guidelines. It sounds like the site owner reached out to the AdWords team, which gave him clear guidance that the site violated the ads policy on bridge pages. It sounds like the site owner also filed a reconsideration request, and we replied to let the site owner know that the reconsideration request was rejected because it was still in violation of our policies. It doesn't look like the site owner stopped by our webmaster support forum, at least that I could see in a quick look. At that point, the site owner did a blog post and submitted the post to Hacker News, where I was happy to reply.
Also is there a preferred monetization model, e.g. do Google think advertisements are more or less harmful to the user experience than affiliate links, sponsored posts, etc?
Obviously across different models you can't just track space taken up, so is there some kind of metric that tracks the rate of content diluting via monetization?
Our models currently suggest that the presence of contextual advertising is a significant predictive factor of webspam.
We use 10-fold bagging and classification trees, so it's not all that easy to generalize. But I pulled one model out at random for fun.
The top predictive factor in this particular model is the probability outcome of the bigrams (word pairs) extracted the visible text on the page. Here are a few significant bigrams:
Next, this model looks for tokens extracted from the URL and particular meta tags from the page. Similar to above, but I believe unigrams only. A few examples follow. Please keep in mind that none of these phrases are used individually... they are each weighted and combined with all other known factors on the page:
The model then looks at the outdegree of the page (number of unique domains pointed to).
From there, it breaks down into TLD (.biz, .ru, .gov, etc)
The file gets pretty hard to decipher at this point (it's a huge XML file) but contextual advertising is used as a predictive variable throughout.
Just from eyeballing it, it appears to be more or less as significant as the precision and recall rate of high value commercial terms, average word length (western languages only), and visible text length.
Based on what I'm looking at right now, my answer would be that sponsored posts are going to be far more harmful to the user experience than advertising.
Can't answer the rest of your question which I assume relates to the number of ad blocks or amount of space taken up by ads... we don't measure it.
Edit: Just realized that Google will probably delist this page within 24 hours. Should've used a gif for those bigrams. Oh well ;-)
"True" - because my current understanding (which Matt_Cutts can elucidate on if he chooses to) is that Google has looked into - but does not currently incorporate - the presence of advertising as a spam signal.
"Sadly" - because my independent research has shown that advertising - most notably the presence of Google AdSense - is a reliable predictive variable of a page being spam.
All things being equal, a page with AdSense blocks on it is far more likely to be spam. Yet as of a few months ago, that does not appear to weigh very heavily into the equation.
That way their focussing less on removing spammers and more on user quality and thus removing spam.
I agree but don't you think that from an algorithmic point of view Google would be better looking at what the user wants and what monetization models they prefer[...]
No. In fact, I am a rather loudmouthed opponent to Google's somewhat clumsy attempts to measure this ala "Quality Score".
In addition to webspam detection and machine learning, I have spent way too much time in marketing (I have a master's degree in marketing, in fact.)
A neat thing I learned along the way was the value of market research.
There are so many nuances in every line of business. Segments, preferences, pricing, even down to minutia (now well studied) such as fonts, gutter widths, copy styles, and so on.
You can learn a lot by combining large amounts of data and well chosen machine learning algos. But even with a few thousand businesses in most categories in a particular country (far less outside of the US), that doesn't give an outsider enough data to truly distinguish what can be a winning formula from a spammy one. This knowledge is hard won through carefully executed experiments and research.
A few years ago I was researching the topic of landing page formulas by category. One example that stuck out most in my mind was mortgages. There were a few tried and true "formulas" that significantly outperformed the rest. Two stuck out:
1) Man, woman, and sometimes child standing on a green lawn in front of home. Arrow pointing down from top left of landing page to mid/lower right positioned form. Form limited to three fields.
2) Picture of home/s docked to bottom of lead gen page. No people. Light/white background. Arrow pointing down from top left of landing page to centrally located form.
These sites were incredibly successful. More than a few of them had to contend with quality score issues over the years. Can an algorithm capture nuances such as the ones I mentioned? In theory... they could. But today, they don't. All of Google's QS algorithms to date have been failed attempts and have caused an incredible amount of harm and distrust.
You finished that sentence with:
to see versus the averages in terms of monetization models on spam sites.
I'm not at all sure what this means. Could you explain? Is it even possible to directly model the monetization model of a site without having direct access to their metrics?
And assuming they could (despite your arguments against) determine the method of monetization on a site, then simply compare the models they see on spam sites versus another metric which would track the user's reaction to certain types of monetization models.
I can assure you that that they cannot do this accurately. You would be amazed at the scummy business models that openly advertise on Google and are not caught. It would be incredibly difficult to do so as some of them are downright ingenious (one example: free software that updates your drivers.)
It sounds to me that you are positing some type of magical technology that doesn't exist predicated on Google's seeming omniscience. Of course, I am eager to stand corrected...?
This. Assume you work on the webspam team and you have a 92% spam detection rate but a 99.9% accuracy rate on what you do detect.
There are around 40,000,000 active domains in a given month listed on Google. That means 40,000 sites on average are being penalized without reason.