An obvious improvement to Google whose absence shocks the hell out of me would be this:
Personal domain blacklist.
There's a lot of spammy bullshit on the web and Google seems to have given up on keeping this away from me. Fine. But for my specific searches, there's usually a handful of offenders who, if I never, ever saw them again, it would improve my search experience by an order of magnitude.
So let me personalize search by blacklisting these clowns. Why can't I filter my search results so that when I search for a programming issue, I never see these assholes from "Efreedom" who scrape and republish Stack Overflow?
I don't, personally, need an algorithmic solution to spam. Just let me define spam for my personal searches and, for me, the problem is mostly solved.
There is already an FF extension that does this. It's called OptimizeGoogle and they need help keeping it up to date. I've offered to help but the code is quite strange (came from an older extension called CustomizeGoogle) and I'm not sure it's worth saving.
A cross-browser effort that implements a few key features from OptimizeGoogle, would be a very good idea. I'd be up for that.
Typically it's ok to err on the side of caution but when someone offers to do a bunch of work if you just indicate that you'd like it done the safe bet is to assume that in fact they are qualified, after all their reputation is on the line in public.
Just a small note, SO puts all of its content under a Creative Commons license, which they are (in my non-lawyerly opinion) following.
Now, this doesn't mean that filtering them wouldn't be useful to you, since at first glance it appears they're solely a duplicate. Just pointing out that they're not actually doing anything wrong, and they're (probably) not scraping.
I mean, their use of SO's content may be legal but it doesn't mean they aren't dicks. It's a wholesale ripoff of another site's content that adds absolutely no value to anyone but the publishers themselves. It inconveniences users and it harms the good work of Stack Overflow by robbing them of rankings they deserve.
In the same way I can call someone's mother bad names because it isn't illegal, it doesn't mean I should do it, because I can follow the letter of the law 100% and still be an asshole. Overall, my policy is that it's best not to be an asshole and it annoys me when others can't share that basic ethos.
The same happens with Wikipedia as well. It's free-content, because that's sort of the point of the project. And reusing that content is great and encouraged. But just rehosting the exact contents of en.wikipedia.org with ads slapped on is a bit lame. Legal, but it's not any sort of interesting reuse, just adding more noise to the internet.
You know, I am starting to think more and more that copyright should ALLOW peer-to-peer sharing but not republishing. That's where the line should be drawn. There should be a clear definition of publishing, e.g. serving content upon request to anyone immediately on demand. That way, if the authorities can download something copyrighted from a public source which is not the original author, they can go prosecute that source. Just a thought. Ideas?
It's actually not easier to eliminate intellectual property altogether. IP laws haven't been enforceable for years, but so far that's resulted in revenues declining across certain industries rather than going away completely. Think of all the change in the past 15 years the entertainment industry has faced and imagine how much bigger in magnitude--and how utterly sudden and catastrophic--immediate IP repeal would end up being.
I'm not saying whether it's a good or bad idea, I'm saying whether it's easy or not. There's a difference between "creative destruction which will end up good in the long run" and "easy"--in fact, they're nearly exact opposites. Just look at the fallout from this last recession for an example.
(On a side note, capitalism is defined by the legal enforcement of property rights. Abolishing intellectual property is probably the exact opposite of capitalism. The word you're looking for is "market".)
Property rights, as far as capitalism is concerned, are an artifact of law, not some fundamental philosophical truth. For capitalism to exist in a given domain, the government has to enforce property rights in that domain. That includes everything from cap-and-trade (where there are property rights to air emissions), water rights, real estate, equity in businesses, and yes, even copyrights and patents. Whether any of this actually constitutes "property" in a philosophical sense is an especially worthless genre of philosophical argument.
really, so you would be totally OK if someone copied your article and slapped their name on it, or took your fictional characters and wrote some stories where they are made to be the scum of the earth?
“You shall be solely responsible for your own Content and the consequences of submitting and publishing your Content on the Service. You affirm, represent, and warrant that you own or have the necessary licenses, rights, consents, and permissions to publish Content you submit; and you license to YouTube all patent, trademark, trade secret, copyright or other proprietary rights in and to such Content for publication on the Service pursuant to these Terms of Service.
For clarity, you retain all of your ownership rights in your Content. However, by submitting Content to YouTube, you hereby grant YouTube a worldwide, non-exclusive, royalty-free, sublicenseable and transferable license to use, reproduce, distribute, prepare derivative works of, display, and perform the Content in connection with the Service and YouTube's (and its successors' and affiliates') business, including without limitation for promoting and redistributing part or all of the Service (and derivative works thereof) in any media formats and through any media channels. You also hereby grant each user of the Service a non-exclusive license to access your Content through the Service, and to use, reproduce, distribute, display and perform such Content as permitted through the functionality of the Service and under these Terms of Service. The above licenses granted by you in video Content you submit to the Service terminate within a commercially reasonable time after you remove or delete your videos from the Service. You understand and agree, however, that YouTube may retain, but not display, distribute, or perform, server copies of your videos that have been removed or deleted. The above licenses granted by you in user comments you submit are perpetual and irrevocable.”
That's how YouTube does it, because YouTube is a channel for individuals to share their own content. Wikipedia, on the other hand, requires CC-BY-SA/GFDL licensing, because Wikipedia is a project to develop a knowledge base.
The question is whether StackOverflow is closer to YouTube or Wikipedia. I think it's closer to Wikipedia because it's a curated reference source, not just a medium for self-expression.
I am torn. Why bother CC licensing if they didn't want you to do this, you know? My first instinct is "What a jerk!" but they're also spreading more knowledge around, which is something I can't really fault.
They're not spreading knowledge around, though. By copying the content without adding anything, they're removing impressions from the real site, which is where all of the relevant related content lives: comments, upvotes, discussion, etc. They're preventing the knowledge from getting out there.
blekko has that feature. every result has a "spam" button underneath it. Click the link, and the host will be added to your personal /spam slashtag. Everything on your /spam list gets negated from all of your results by default.
Very handy. I put ehow.com on mine and never see results from them.
At one time google did provide this, if you were logged in , there was an [X] option to remove that result from your searches, as well as a voting up and down mechanism. I think it was just an experimental feature, but I wish they would of kept it, and expanded on it. If google is reading.. please bring it back
One idea that comes to mind to deal with the wikipedia / stackoverflow problem is result clustering. With Google News, they have done a pretty good job of clustering articles on a single story. They are getting better at deriving the original source in many cases. The simple act of duplicate detection should enable them to identify sites that scrape content and show them as duplicate results.
In the interests of results diversity, you don't want the same content repeated ten times on the first page, although this has the side effect of pushing the original source onto the second page if you guess wrong.
Problem is that this was tried before, with SearchWiki. When it launched, it was widely derided as being useless and distracting. Its usage numbers didn't show widespread adoption. And then when it was removed, there was much rejoicing.
Now, it's possible that SearchWiki just needed a few more iterations, and with a few details changed, could be a big success. There have been a few other recent launches that were tried years ago, didn't work then, but had a few more iterations and now are big successes. I could at least raise the issue. But unless I can tell a convincing story about why people would use this when they didn't use SearchWiki, it may be an uphill battle to get resources devoted to this.
I think SearchWiki solved a different problem. It was about, approximately, globally curated results for certain searches. What's being described is personalized curation, at least that's how I read it. I'd definitely be into such a feature, and it really doesn't seem like a tremendous undertaking to make it something opt-in via Labs. I also recall SearchWiki being about specific results, which is not desired behavior for the personal curation experience I have in mind.
Similarly, I'd like a blacklist or advanced search operator that, for certain queries, would allow me to exclude all sites with AdSense on them. Might have to wait for Blekko or Bing to provide that, though.
This is something a browser extension could do (though not as well as the engine itself, of course). There is Chrome extension called "Google Blacklist" but it didn't seem to have any effect when I tried it out. Perhaps others will have better luck.
The only reason we don't have it is we don't have accounts.
It's even really a complaint. I am glad that when I show up on DDG (which I still do several times per day) I get the same high-quality results without regard to who I am.
I'm also happy to take requests to ban these stupid sites for everyone.
The problem is that I have, for example, en.wikipedia.org marked as spam, simply so that their juice doesn't overwhelm my search results. It makes sense for me, but I suspect it's not even close to what your average user wants or expects.
In any case, thanks for the recent addition of non-Google options for searches when DDG runs out of results. Small as it may seem, I consider that a major step in the right direction.
An user marking a web as spam is hardly a problem.
Haven't you used Gmail? If I tag a site as spam, DDG shouldn't show it to me. If a thousand users mark it as spam... then it starts to be clear that is a shady site, and DDG should eliminate it from its system.
Uhm, but now we have a vector for script kiddies to ban sites from a... who knows, maybe in some years... major web search.
Perhaps definitively removing a site should be done by a human operator.
This issue in a roundabout kind of way touches on Facebook.
The issue of social search has a lot of mindshare. Some think it is the future of search. I disagree.
One of the things that made search successful anduseful early on was scale. Instead of having to go to the librar or ask your friends you can effectively canvas the connected world.
I find the notion that friends' recommendations will replace that as nothing short of bizarre. It's like a huge step backwards. The argument is that you can filter out the garbage as your social graph will provide a level of curation.
Let me give you a concrete example. If I wanted t buy a camera I'd stil need t go to dpreview and other sites. It's highly likely that my friends don't really know a lot about this (but some will have an opinion anyway).
This same idea of human curation is behind such sites ad Mahalo and the garbage sites themselves to a degree. Of course at some point computers will be powerful enough to generate this garbage content.
Blekko's idea of slash tags s interesting (to a degree) but if it's successful its easily reproducible. Google is still in the box seat here but of course that's no barrier to a link-baiting TC title.
Personally I'm an optimist. I believe that, much like email spam, the garbage from AC, DM and others I'd a transitional problem (email spam is basically a solved problem now if you use a half-decent email provider). If they succeed we won't be able to find anything. I don't believe that'll happen so these services are therefore doomed.
So betting on Demand Media is (to quote Tyler) like betting on the Mayans (meaning betting they're right about the world ending in 2012: it doesnt really matter if you're right).
There are other options besides social or text-based algorithms. My company, Pikimal, is pursuing one - we're pulling together facts on items and are allowing people to weight those facts to dynamically create recommendations. We've added some "user rating" facets but are just barely past our minimal viable product.
I very much agree with you that social is limited. It's a filter which avoids spammers, but it also filters out experts and users. It also solves some of the problems introduced by the generic nature of search algorithms.
I'm betting my time on the idea that the solution which wins will combine an understanding of the product space, the value of new features, a current understanding of price, and which can be customized transparently to the needs of the user.
I'd divide the possible filtering processes into three approaches.
* Throw all your questions to all your friends and all the review sites you "trust". This means any third party site you trust gets a rather excessive ability to spam you - even a site with good user reviews on them are trying push pop-up windows on me. Just because I've gotten good info from X once doesn't mean I want any more from it.
* Throw friend recommendations and trusted sites to a third party "meta-filterer" who organizes things for you. You'd have to really trust that site and essentially there's no reason they'd be better than Google.
* Do the filtering yourself. Most people essentially do that now. I'm working on a project to create tools to automate and improve this this process. Create your own relevance and topic-weighting system that adaptively filters all the other filters. I believe that this kind of approach eventually going to be needed. Not so much because each person can or should do all their topic-relevance-weighting but because this approach would keep the other systems honest.
Eventually, people are going to realize that both their social graph and their content-relevancy-weightings/algorithm are far too personal to farm out unquestioningly to a third party. The present social networking system is like AOL-email in 1992 except with the added provision that provide can look-at and alter your emails as part of the service. I'd envision this stabilizing to the present email system where webmail exists but has to work more or less the same of email to your personal computer.
Indeed. For such filtering to work, a significant portion of your friends should be searching the entire network and not just filtering through friends. But this might just work out to an early-adopter/follower divide. Friend network filtering will work better for the followers. The early-adopters/leaders will continue to use other means of search and will feed the friend-network.
That assumes the same group of people will be the early adopters for everything, from recipes, to gardening tools, to what video card to get. If your peers don't happen to be knowledgeable about what you're seeking, then you're SOL.
Your best bet ends up being to simply include everybody, which improves your chances of finding an expert opinion, regardless of the obscurity of the query.
The problem with 'social' search is that it reduces the universe of documents to what is known to the members on the platform. There is already data that is not crawled by most of the search engines, which itself is a problem. This only makes it worse.
"Google does provide an option to search within a date range, but these are the dates when website was indexed rather than created; which means the results are practically useless."
I believe the author is mistaken on this point. Quick proof is to do a search for [matt cutts] and you'll see the root page of my blog. Click "More search tools" on the left and click the "Past week" link. Now you'll only see pages created the last week, even though lots of pages on my site were indexed in the last week.
BTW, why is it that Google has not opened up its core search engine to third party developers so that their code can be used to bring up some of the search results by default (without requiring the user to subscribe to third party features)?
Most people have never used Google's Subscribed Links feature because it is not enabled by default.
User feedback can be used to determine which third party code to use in which contexts. Spam/unhelpful features would be detected quickly. There would be intense competition among third party developers for highly desired features.
Something like this could give you Wolfram Alpha like features among other things such as custom UIs for various searches (e.g., travel).
The Google App Engine could be used for computation. You could pay third party developers by how often their code is used in search results.
Just speaking for me personally, I'd be a fan of trying ideas like that. The potential danger would be that some code could introduce latency or other things that would make the search experience worse instead of better.
There are huge risks with this approach, but there's also a lot of potential. Search could look quite different.
It would be an ideal place to experiment with and profit from novel search ideas. For example, third party developers may experiment with query-induced flash mobs where people who just performed a similar query could collaborate in real-time to find the information they need.
Finding ways to prevent such an ecosystem from descending into absolute chaos would be a fascinating challenge.
It seems more likely that the average developer would introduce latency when searching through a database of billions of text documents than when making an application that makes a fart sound when you press a button. Just sayin'.
Indeed, this is true when using the search tool in the sidebar which sets the cd_min and cd_max parameters in the query string. However, if you were to use the daterange operator directly in the search input box, you'd see all the content indexed in the last week.
The date range is one of Google greatest features. Its not perfect but it is great. I would like to see past 2 years added to the list. Sometimes 1 year is not enough and it is to time consuming to type a custom date range. But it is wonderful at filtering out 5 year old content. This is a feature I use every time I am looking for code samples on the Internet.
This is exactly what blogger Paul Kedrosky found when trying to buy a dishwasher. He wrote about how he began Googleing for information…and Googleing…and Googleing. He couldn’t make head or tail of the results. Paul concluded that the “the entire web is spam when it comes to major appliance reviews”.
So I happen to know somebody who is taking a small section of the home appliance market and creating content around it -- reviews, news, advice, a place for other consumers to talk to each other.
Of course to do this you need to have income, so they are going to use some sort of ad-supported model.
My question is very simple: is their project a spam site or not? To some, I guess it would qualify. To others, not.
You see, there are two questions when it comes to search results: 1) Am I being presented results that match the query I entered? and 2) Am I being presented results that match what I want to know?
These are two entirely different things. A third-grader looking for information on a movie star might find a games page with all sorts of information on that star -- all sponsored by some kind of adsensey stuff. And he's very happy. A researcher typing in the same question gets the same page? He's pissed.
There is no universal answer for any one question. It's all dependent on the culture, education, and intent of the user -- all of which are not easily communicated to a search engine.
Look -- this is a real problem. I hate it. Sucks to go to pages you don't like. All I'm saying is that it's more complicated than "we need a new Google" Finding what you want exactly when you want it is a difficult and non-trivial problem. We just got lucky in that Google found a simple algorithm that can be helpful in some situations. It may be that we're seeing the natural end of the usefulness of that algorithm.
To me that depends a LOT on how they present the advertisements and what they do on pages where they don't have information for a product. My biggest complaint with so called review sites is that they are presented more advertisements than content. They also tend to have automatically generated pages for every model number you can imagine, including incorrect ones that they receive searches for. On those pages there tends to be links to shopping sites and prices for unrelated appliances and products. I absolutely consider that to be spam because they are content free.
The problem with google searches lately, and especially for things like appliances, is that the spammers and content mills are clearly winning. In my current search for a washer and dryer the manufacturers page was on page four or five or the results. The first few pages were flooded with bogus content pages, sale pages and unrelated pages. Trying to filter out shopping sites and explicitly target specific keywords and filter others doesn't help. There are a growing number of search topics for which google is simply broken.
In regards to the style of the pages, I completely agree with you -- probably because we have similar tastes in content. Others, however, might not. I'm not saying that flashing text and floating images are okay, just that the threshold between "mildly annoying" and "total shit" is a personal thing.
The problem is, as you point out, that most people, most of the time, are beginning to see results they don't need or like when they type a search. This is a big problem for both searchers and the companies that provide search. If you create an algorithm for directing people's behavior (a search engine) folks are going to game it. You and I might not like it, but "gaming the way people do things" is called marketing in any other context and has been around for hundreds of years.
This leads me to suspect that no simple (or even complex) system of finding things for people is ever going to work for an extended period of time. It's a radar vs. radar detector problem. It's a natural competitive situation.
But it doesn't have to be all bad. From competition and fitness criteria comes evolution. Spammers and search engines will probably be a key part of how AI evolves. It'll be neat to see if we move beyond Bayes -- and if so, how would that work?
The one thing you bring up that's interesting is what to do with bad searches. How do you deal with a mis-typed part number? Should a system know which part number you have? If so, how would that be done?
I think the spammers covering all the misspellings are doing a service -- as long as the site isn't obnoxious and provides the user with the information they are looking for. We think of it as a failure of Google, but in fact it looks like a win: thousands of little spammers trying to find all the mistakes I make and providing content for them -- as long as they have my best interests in mind (and are not trying to trick me). I'll happily look at an advertisement for a Ford Explorer in return for valuable information on my 1978 dishwasher that I couldn't read the entire part number for. And I hate ads. I like that scenario a lot more than looking for a favorite mp3 for a cell phone ringer and spending the next 3 hours in spammer hell.
> The one thing you bring up that's interesting is what to do with bad searches. How do you deal with a mis-typed part number? Should a system know which part number you have? If so, how would that be done?
There's lots of ways to handle this but doing a fuzzy search for similar or possibly related part numbers should be easy enough. The user can then be presented with those search results. If the model/part number isn't found you could even provide them the option of adding content for that part if the site takes user generated content. I don't think I've seen any site take this approach but instead go with the spammy show a ton of ads approach instead.
Appliance models and part numbers is actually pretty interesting. A couple of people I know built and maintain a simple desktop application for smaller appliance retailers and parts/service companies. The application contains a database of all valid part numbers issued by every major appliance vendor for the last twenty years or so. This information is updated about once per week by the manufacturers and they supply this information freely to anyone that wants it or to members of specific programs. Some of this information is provided via faxes or emails which sucks but data entry can be farmed out to temps. This company aggregates the data and provides it as a service to their customers. The application can do full or partial matches for part numbers and can filter based on appliance type. If a small two man team that doesn't even work on the project full time can successfully manage that I don't see why the big web based sites are so full of bogus content and spam.
There's lots of ways to handle this but doing a fuzzy search for similar or possibly related part numbers should be easy enough. The user can then be presented with those search results. If the model/part number isn't found you could even provide them the option of adding content for that part if the site takes user generated content. I don't think I've seen any site take this approach but instead go with the spammy show a ton of ads approach instead.
Fuzzy logic searches would be awesome.
The problem here, of course, is that the site doesn't own the search program. They can only influence it in certain predefined ways. So if you're building a site for dishwashers from the 1970s and you know that folks consistently misspell some brand name? You either provide a page for that misspelling that Google can crawl or those folks don't get content. Assuming you're doing a quality site, folks who can't spell need content as much as those who can. Yet if you provide a page based on a misspelling folks will yell "spammer!". It puts you in a bind. There's no answer everybody is going to be happy with.
I think people can tell whether or not site owners are trying to help them out or just trying to trick them using Google. At least I hope so. I know as much as I hate ads, I'm happy if I never saw one again for the rest of my life. I have to be careful not to take that personal opinion and apply it to all site creators, however. There's nothing wrong with noticing that folks are looking for something, can't find it, and providing content in that area.
In a lot of ways Google is a victim of their own success. The net was so new, the algorithm so cool, that it looked a lot like magic. People got used to the magic and forgot that it's just a computer program somewhere. I think we may expect too much.
Sure, but it can also have to do with different people perceiving inputs differently. While I lament the content duplicators with a passion, I deal with it myself with longer search queries. Grouping and requiring, excluding particular domains (like efreedom), and so on. I know we all probably do this to some extent, but I've found that, sure, you have to exclude 10 different domains, but it does winnow down the results in a useful search. I'd rather be able to search with three words, but for now those days are over.
"Unfortunately, it isn’t just appliance reviews that are the problem. Almost any popular search term will take you into seedy neighborhoods." Quantification? Detailed examples? Link to his research? All missing in the original article.
My thought when I read the Kedrosky thing months ago was very similar. Many times the reason you aren't getting good Google results is that there isn't good indexable content. Consumer Reports is a leader in this space and they are behind a paywall. I started exploring the possibility of creating a site for appliance reviews as well.
It will be interesting to see how this impacts the Android/iOS battle. Search revenue funds almost all of Google's other activities so if people start using other search engines or find alternate ways to get their content it could impact the level they can spend on phones.
With a push to a mobile first world the Android model is especially sensitive to spam. On a full size browser you have a lot more context and results for a given search. 5 Results may be spam, but you can work around them. If the average phone screen shows 3-5 results and all of them are spam you will quickly find alternate tools.
Google ignoring spam is like Microsoft ignoring the cloud.
It won't. Whoever comes up with a better search engine is going to get a large check from Google, and a monster check if he can make MS and Google go into a bidding war over the business.
A better search engine is not what will do Google in, because they would understand the danger it posed to them. What will do Google in is a business which they don't understand that can kill search engines as a place to do business.
Ignoring the damage of incentivized false-positives in searching our highly-connected global compendium of human knowledge and activity is at least as stupid as major producers of low cost consumer information goods and services failing to capitalize on the on-demand delivery and storage capacities of said compendium. Avoid both, please.
This is like writing stuff for the Voyager's Golden Record.
The article calls out two specific companies as "landfill in the garbage websites that you find all over the web." Reasonable people can disagree over whether such content is truly spam or low-quality content, and thus how to respond.
What is the difference and how should they respond? It seems to be a rising frustration among power users that Google is increasingly becoming a wasteland populated by spam. For example, Marco Arment recently commented on his podcast how hard it was to find answers to simple questions on Google these days. He was saying that the content farms have basically created a page for every PHP function with thin content and rendered it useless. For a company whose goal is to index all human information it is a pretty big warning flag.
What is the appropriate user response? Go to Stack Overflow? Find a branded knowledge base like O'Reilly's Safari? I'm genuinely curious to know what we can do.
Disagree on the PHP function thing.
For nearly all function names the php.net page is the top result, even when there is a C function of the same name.
Occasionally there is a w3schools or similar close to the top, but it's not like those guys have just wholesale ripped the docs.
I was referring to how Google should respond to content farms. Historically, Google has been willing to take manual action on webspam. With the rest of search quality and ranking, Google tries to use algorithms as much as we can. So the distinction of whether something is spam vs. low-quality is an important one within Google.
If we were to pick a random Demand Media or Associated Content page, what are the odds a reasonable person would prefer its writing to a less-cynically-optimized, less-ad-drenched alternative page that the test page has bumped out of the top results?
Sounds more like a semantic excuse to justify the proliferation of adsense riddled spam content mills that are chasing keyword search volume and making Google a hefty sum each year rather than a legitimate conundrum.
I am not opposed to the content mills per se, mainly because I think that they aren't doing anything "destructive" IE comment spamming, phishing, etc. They're using the tools and data that Google freely provides to produce laser targeted content that the algorithm eats up and ranks high because of their domain authority and onsite interlinking. It's essentially a battle between them and Google, and clearly Google is fine with them doing what they do or else they would have stopped them years ago.
What most (power) users are opposed to are scraper sites that reuse other sites' content to rank higher than the original content source, and tactics like eHow uses where they have 10 different articles about how to tie your shoe, but each one has a title that matches a different long-tail version of the search query.
Again though, the issue is that Google isn't reacting to and clearing out content spam. Most likely because the sites add to Google's bottom line and a spam engineer modifying the algo to remove the powerhouse content mill sites can actually negatively impact the Google's revenue.
Also, when we complain about search quality, we're a vocal micro-minority. Most people, like you said, find these sites useful, and haven't even though about the implications of these content mill sites.
"I was referring to how Google should respond to content farms. Historically, Google has been willing to take manual action on webspam. With the rest of search quality and ranking, Google tries to use algorithms as much as we can. So the distinction of whether something is spam vs. low-quality is an important one within Google." - Matt_Cutts
I know you are probably asked to respond about specific spam cases constantly, so don't take this as me demanding an answer for this specific instance. However, eHow is clearly leveraging their domain authority here to scrounge up the traffic for each different long tail variation of the term "How to Tie Your Shoelaces".
The reason they're targeting each of these phrases with a different page of content is because of the data Google gives them (and all of us) about who is searching for what and how many times per month, coupled with the fact that they have a mega powerful domain which, when a new page of content is added to it that uses an exact keyword in its title, that page will rank top 5 in Google almost every time.
Therefore, the data that they're using to come up with these keywords to feed their gaggle of writers is the related keyphrases data provided by your keyword tool. Algorithmically, this should be easily detectable, as you guys have the list of related keyword data that they're using in the first place.
Why then, are they in the top 5 for each of these keywords? Are 3+ different guides on how to tie shoe laces really necessary? Shouldn't 1 page be ranking for all 3+ of these tight variations? Shouldn't dozens of related pages of content targeting minute keyword variations be something relatively easy to detect?
Seeing multiple Adsense units on these obviously SEO-fueled pages I've linked to leads me to believe there's at least a little bit of truth to what you quoted me saying.
"Seeing multiple Adsense units on these obviously SEO-fueled pages I've linked to leads me to believe there's at least a little bit of truth to what you quoted me saying."
Speaking as someone who has worked at Google for ~11 years at Google and worked on spam at Google for ~10 years, I can tell you that running AdSense doesn't get you any kind of special consideration in Google's rankings. You don't have to believe me, but it's true. :)
Matt: Google historically has tried to do most everything algorithmically. blekko does allow you to identify content farms, but blekko is more human based response. Google is having an active debate about this. If you can’t algorithmically identify a content farm, is it still ok to take action and remove a site?"
The other relevant write-up was at http://www.seroundtable.com/archives/023229.html and they transcribed the discussion as
Barry Schwartz: Q: Brian asked, what is google doing in terms of content farms?
Barry Schwartz: A: Matt fed this Q to Brian earlier ... hehhehe
Barry Schwartz: Tricky, Matt's team is in charge of web spam. If web spam doesn't last long in the index, what do they do? So a content farm is the bare min someone can do to get in to the index, but its borderline
Barry Schwartz: Some people in Google dont consider content farms as web spam
Barry Schwartz: They have been a little worried about people passing judgement on sites if it is a content farm a useful site.
Barry Schwartz: Think of Mahalo, Wikia, Blekko
Barry Schwartz: Those sites provide a curated experience
Barry Schwartz: It is a really interesting tension here, they don't want to bring Humans into the mix... They will let computers do it
Barry Schwartz: This is an active debate
Barry Schwartz: May Day, at least partially, was a first pass at this.
Barry Schwartz: If you can't algorithmically detect content farms, then do you take manual action?
Barry Schwartz: This is the problem they are thinking
Barry Schwartz: So if they do anything on this, they will update their guidelines
Barry Schwartz: This is an active debate in Google and we will see where we go
Barry Schwartz: Someone asked, Matt, what side are you on?
Brian Ussery (@beussery):
Matt says users are angry with content farms
Barry Schwartz: Matt said, users are not happy with content farms so he wants them out of the index."
Matt, thanks for responding. I guess I'm not exactly sure where to end this back and forth, but your latest response does open up a few questions of mine.
I've never argued that running Adsense helps a site rank higher, I know that it doesn't. But I do believe that sites like eHow, who presumably make Google millions of dollars a year, are given a free pass to pursue content spam like I posted in my previous comment without any sort of repercussions that we can see. They're leveraging their domain authority and producing very low quality articles to target obscure long tail variations of keywords to keep getting that traffic.
What concerns me is "If you can’t algorithmically identify a content farm, is it still ok to take action and remove a site"...is the issue that the algorithms aren't sophisticated enough to catch these content mills from spitting out article after article of low quality, long-tail targeted traffic, or that you guys have thrown in the towel and believe that if the algos aren't throwing flags, then the sites are fine?
I posted up an example of a content mill type situation in my last response. To most people, a manual review should throw up a warning flag if the goal was to identify people targeting keywords rather than trying to help people. The top 5 rankings for each of those pages shows that neither algorithmic nor manual measures are in place to deal with such a situation.
I have 10+ content sites targeting random niches. I know how the SEO game works. I know dozens of internet marketers who have dozens of their own sites each who know how to game the algo to rank high with low quality content sites like these. It's obvious people are taking advantage of the algorithm, but it doesn't appear to be drastically improving anytime soon.
I do appreciate the time and effort you have put into your responses. If you'd like to talk privately, I would love to. I'll try and watch the video you suggested tonight.
I wouldn't argue that just because current algos aren't throwing flags, the sites must be fine. We read TechCrunch and HN, hear the complaints, and see searches that we want to be better.
The challenge (in my mind, at least) is how to improve the algorithms more and when it's appropriate to say "This is low enough quality that it's actually spam, and thus we're willing to look at manual action." On the bright side, we've actually got a potential algorithm idea that we're exploring now.
I think identifying low quality content is important, yes. But the topic I've brought up is dealing with somewhat decent quality content (all of the guides do explain how to tie shoelaces) that are individually targeting subtle longtail keyword variations.
It's keyword variation content spam using hand written content and curated by very specific keyword data. So that seems to be a different algo trigger than a quality trigger.
Tell me, how exactly is writing a sensationalized article that targets one of the Internet's oldest and largest communities to get fed by CPM advertising any different than what they decry? People have said this time and time again, but they never seem to debut let alone promise any sort of technology to address the issue. They just leave that end of the deal up in the air. As if to say that it's o.k. to spin topics as long as they strike a social nerve, but those who're less graceful at the craft are undeserving of the benefits which they themselves reap.
If the search giants had any balls they'd cut the "Internet Marketing" community off at the knees. Because the money making methods pushed by that community either don't work or are unsustainable, so they're entirely reliant on a steady stream of new recruits. If they want to promote gaming your system don't let them reap any benefits from it.
For many things there is utility to social content finding. Sites like HN work for news/discussion, and Twitter and Facebook work for a random amalgam of things. I'm interested in the future of social search startups that somehow curate content from friends. In the articles example- asking friends if they liked their Dishwasher, and if yes what brand it is. That's the most like how people IRL make these decisions. I know there are some startups in this space as well, hope they do well!
Having just bought a new washer and dryer I can honestly say that my friends opinions were not a factor in the decision at all even though we did speak about it before I made the purchase. The main reason being other than some superficial elements of the appliances my friends can't really speak intelligently about their own appliances. They tend to not remember beyond the initial purchase point what many of the features even are. For example I asked a friend if his washer contained a heater and while it does he didn't even have a clue that it did.
I have a relative working at Home Depot and we just discussed this topic at length given my own recent purchase. Most of their appliance customers come in with some idea of how much money they want to spend and what basic feature set they would like. Then they are looking for a knowledgeable sales associate to explain to them the differences and benefits of the various models. They may have "heard from a friend" that a particular model was good or a particular feature was good but even that information is usually incomplete or incorrect. I would fully expect "social" results for these types of queries to result in even more misinformation.
Google did try this. I don't think it ever left the lab, but I used it for a while. I think they may have canned it; I just realized I haven't seen it in a while. It was called searchwiki. I think it was originally just supposed to apply to your own results, but I'm sure they analyzed the tendencies to see if it would've helful to use overall.
Honestly, such a system would be ridiculously easy to game with a botnet, so there needs to be significant work in the area beforehand.
I think the 'X" was live for maybe a month or two. I figured either or both that there was a resource/algorithm problem that prevented them from continuing in a fruitful way, or that it was a bit of user survey they could use in aggregate as preference weighting on the indexes/results. That is, not as a function of personal filtering, but to see if there were commonalities in what people considered to be bad sites. This may be further supported by the fact that the lifetime of the feature may have been too short for gamers to exploit it.
The problem with that, is again, spam. How do you differentiate fake votes from real ones? You can do per-IP limiting, and probably other things I'm not aware of (please comment about them), but it's still pretty open to abuse I'd say.
Google has proven to be very effective data collectors (see "What they know" stuff from WSJ e.g. http://online.wsj.com/article/SB1000142405274870330970457541...) - it may on the one hand be surprising they don't use this data more effectively to identify and thwart spam. OTOH, they're making a lot of money on spammy SERPs.
I'd even be OK with "Thumbs Up/Down this result just for your own personal book-keeping", so that you can promote/demote results in your own search history (try searching for "_____ tutorial photoshop" and you'll get tons of splog links before you eventually hit a golden result. It'd be nice to save time later).
Google briefly had a feature like this in their results, but they've removed it. All that's left is the Star system, which isn't quite the same.
How would it be possible to discover content without searching? If I want to discover content about something I have to tell the computer what that something is about, that would be searching. If you are talking about directories, then I wouldn't use that.
Let me see if I correctly understand the learned professor's article. In his view, the problem is that a user using a free search engine to find information will find a lot of information about people who want to sell products and services, gaining money by exerting their time and effort. What he hopes to obtain for free is email addresses of persons to whom he wants to send his survey, so that he can use their time and effort without compensating them to produce something of value to him. Exactly how is this a problem?
People who actively like to be contacted by random persons surfing the Internet make their contact information readily available (and answer questions sent through those publicly visible contact channels). But to many other persons, not being readily visible on the Internet is a feature rather than a bug. (Disclaimer: my contact information is readily visible on the Internet, so readily visible that it has been used by point-of-view pushers on Wikipedia to give me harassing telephone calls.)
I think he was actually trying to piece together the background stories for the people who didn't respond to his emails, not find their emails. I don't think the info he was searching for was particularly private either.
For instance, trying to find out the company a CEO worked at before their current one. The problem is that content copiers will produce so many copies of the PR announcement for their current job, it's impossible to find the announcement for their previous job. I've tried doing this exact search and it's very frustrating.
I often find that things that really don't exist wind up with spam results. For instance, if you type in "free ipad" into google you will likely get thousands of search results, all spam because they don't actually exist.
Similarly, contact and personal information for CEOs of major corporations does not exist online either and any search will turn up spam.
One more example, to add to the many. If you get a genuine wrong number call from somebody who made a simple mistake and type their caller id into the internet, you'll just get a bunch of reverse phone lookup spam while if you search for a phone number of a know telemarketer or bill collector, you'll likely get a full dossier on that company.
It's interesting that the way of "gaming" Google appears to be in having thousands of people generating SEO friendly content. I think Google's problem is that it's pushed SEO to the point where the definition of Spam depends either on a subjective view of what kind of site the user is looking for, or it's just mildly worse than something else that's out there (e.g. When I search for something coding related and get one of the stackoverflow scrapers).
Where do we go from here? Well, I don't think the answer is just a radically new way of indexing/ranking websites. That might work in the short term but the spammers will soon catch up. The answer probably lies in a combination of better language interpretation, context sensitivity using browsing history and location, and user profiling based on the social graph and search history. All of which google seems to be working on.
I love Google products, but I can't help but agree. I'm currently trying to find a colour laser printer that has good performance (quality vs speed) with a reasonable running cost over the life of the printer (at least a few years).
All I'm getting is either the manufacturers slant (PR) or spam sites all harvesting the same reviews.
I stick with Google because it largely works well, but when I know what I want to see and that it must exist but cannot find it... then I find myself looking elsewhere all the time. DDG and Blekko I use in these cases, but even they're not solving these kinds of needs.
I've found them nearly useless. Way too many people giving one star reviews because their 15" laptop doesn't fit in a 13" bag. If you take the time to read all the reviews, you can weed out the idiots, but sifting out the haters and the astroturfers can take longer than just going to a store and fondling the merchandise.
Too many products without reviews. No control for the reviews.
Example... the HP CP4025 appears to be good (I have a HP Z800 workstation so thought it was worth checking HP for the hell of it)... yet I haven't found a balanced and comparable review for it. I'd like to see it set against other small office printers and to see the average cost per print.
Why would it be so difficult for Google to filter out spam sites? E.g. DuckDuckGo filters out eHow.com results, because they are low quality and tend to be spammy.
Oh of course, it's not in Google's interest to do this, because they make money from the spam sites. So I don't expect Google to really "solve" this problem.... their trick is to stay useful enough that users don't abandon them, but allow enough spam into the search results to provide revenue. A tricky balance...
But that is different than saying the prioritization of this as a "problem" is or is not affected by AdSense revenue. What we, as a bunch of highly technical people, see as a problem, Google's ranking decisions sees as, at best neutral and at worst, a positive.
It may not affect ranking (and I do believe that), but I would certainly be willing to bet it affects them (and similar sites than aren't AdSense-based) not being de-ranked.
Professor, you could've proved your point by linking to
at least one example of how Blekko found a founders
work and listed it by date (as the task required), instead
you have hashtags on health, finance, etc.
The truth is that nobody has arranged that information
in the way you want, if it existed at all, that venture
database where you found the 500 companies
would've been the natural place to look.. CrunchBase
One of the new things I am working on with unscatter.com is getting quicker access to reviews and blog posts using the blekko api. The next release will be a major change as I've dumped most of the current search providers in favor of blekko and have moved realtime search to it's own page with analysis by providing lists of links in the realtime feed.
Nothing is released yet unfortunately. The site is officially a hobby for me write more but I hope to have the new stuff up in the next week or two. I may just hide the realtime stuff and get the blekko feeds up sooner rather than later.
Now that I am focusing building the site to fit my needs getting up to date info about products and technology, the bulk of my personal searches, is the top priority. Have to admit the blekko api has helped.
In the mean time I would suggest the slash tags /reviews and /blogs with /date on blekko would be very helpful if you are doing product searches. With unscatter I am really only providing shortcuts for the with additional ui tweaks.
Disclaimer: I am in no way associated with blekko other than having been given permission to use their api for a personal project.
I just tried two test queries on blekko and google. Small sample, but there did seem to be less link-bate results on blekko. The issue is whether their results are close to being as up to date as google's results.
I was interested that blekko seems to have done a lot with a modest amount of funding.
Also, I wonder if they are getting some monetization with the association with Facebook.
Consumer reports are probably about as objective a source as you can find, but I don't believe they are without their biases. They tend to give a lot of weight to things like value and reliability, and less on aesthetics, though the latter may be an important factor for some consumers. For tech products, the review them from the standpoint of an "average consumer" and probably won't evaluate factors that matter a lot to many readers here.
They are also politically left-leaning, if that matters to you.
Hey, so what you are basically saying is, "the best computer algorithms in the world" (you know, Google has like > 578690 Ph. D's) are not good enough to have effective search, so we should introduce the human element.
Fair enough. There is the Open Directory Project (which is pretty old) and of course there is Facebook, Twitter, and other, human-curated services. Starting a whole new company to do search and compete with Google (and Bing)? Seems like a waste of time as Google can just copy what you are doing and incorporate it into its already massive site (complete with traffic, audience, and lots of other goodies). Instead, why not get Google to add more social recommendation and feedback features?
How about a P2P search/bookmarking platform where peers could publish search/bookmarking histories ranked by like/dislike/spam votes which other peers can subscribe to. Publishing peers can also be ranked according to number of subscribers. Actually P2P curation could be the next level up from raw centralised search. Is there anything like this out there already?
Not only that, but Yahoo didn't think search was important, so it's not even a real try of enhancing search with curation.
The truth is, if some company comes up with a better search engine, whatever ideas behind it are not going to sound like an obvious win up front—if they did then Google would already be doing that. Instead they'll have to create a search engine that is better, but somehow antithetical to Google's business model so that they can't just copy it, because there's no way for a startup to come up with enough resources to stay materially ahead of Google in pure search. And of course that's only half the battle; then you have to be better enough that users can be bothered to switch (or a browser deal coup).
Personally I haven't found the spam problem to be nearly as bad as the echo chamber makes out. I think silicon valley types just have a good imagination about how good it could be.
I've been wanting to build a decentralized curated social search (buzzword soup I know). The idea is that security has to be based on trust, and adversarial search is a security problem. Here's my idea, it starts with delicious style bookmarks and a social graph. the search engine indexes the pages you've bookmarked so your bookmarks are searchable. AND, you can expand your search to include your friends bookmarks. Obviously it wouldn't be comprehensive, but it would work really well for shared interests like programming.
But it would be useless if you're the first person in your social graph to have a need for particular information; similarly, for it to be very useful you would want your users to push almost every website they visit into it, or any website which includes useful information (otherwise you would get too many searches filled with junk, or just with nothing). It's also useless if a user doesn't have friends using it (as compared to other programs; see below), and so bootstrapping up would be a bit hard.
Essentially, you're talking about something like Chrome's history search (it indexes the content of every page you visit, and allows you to do full-text searches of them), but with the ability to expand the search to other users indexes. You'll want to make it entirely painless for users to add pages - a keyboard shortcut at most - and integrate into the browser as much as possible. It's also probably a good idea to add some mechanism to identify users with reliable indexes and allow others to use them; a karma system could be used for this (add karma if a result is useful, remove it if it's not - similar to how HN works with comments). Users with really low karma (indicating that they're pushing lots of spam sites into the service) would have their results biased against in full-index searches, or outright removed without them being aware (similar to how users on HN can be killed, so that the see all their posts normally but no one else does).
Disclaimer: I'm not quite awake, so take this advice with a huge grain of salt. It's just my thoughts on the idea.
True, but another part of the issue is a pure algorithmic approach can be gamed, inside a search company human curation doesn't scale, and community curation has scaling problems / seeding problems / can be gamed.
I do think think that some human curation, if for nothing else to mark sites as copy spam, might be workable. Pattern matching is still humanity's territory.
Google's current incentives align with those content sites more than the searcher. Those sites provide a lot of revenue for Google. Until Google starts losing searchers as a result of those sites (unlikely, because my mom is unlikely to care that eHow is massively gaming things), those sites are going to stick around.