Hacker News new | past | comments | ask | show | jobs | submit login
A Chrome extension to avoid the Stack Overflow ripoffs (chrome.google.com)
126 points by jlangenauer on Dec 8, 2010 | hide | past | favorite | 48 comments

Sites like these are, in my opinion, the scourge of the internet. There is a lot of talk nowadays about curated search engines displacing machine-generated search engines but I tend to think this goes too far. A search engine that could reliably determine the authoritative source of duplicated content and only include that source would be killer. Seems within the realm of possible... Anyone working on that?

Look up information cascades.

We're doing something along a similar vein for LazyReadr. The idea is to merge news about the same story together, an important step from there will be deciding which is the authoritative source to display. Going from that to effective search isn't a large leap.

Thanks for the tip. My search turned up this resource, which looks awesome:


I suspect Google isn't doing this for legal reasons. They probably have to take care not to look discriminatory in any way.

It shouldn't be difficult to detect. Whichever site receives content first is likely the definitive source.

Also, in many cases (eg. this one) the duplicated content even links to the original source, so even simple pagerank should put stackoverflow significantly higher than its duplicates.

see http://webmasters.stackexchange.com/questions/6556/does-the-... for the primary outcome of this investigation

How can we prevent these sites from ranking well in the first place?

Use http://dukgo.com instead of its less quality-motivated big brothers. If google, et al. see duckduckgo usage spikes, they will likely implement similar filtration measures. Then they will post blog updates about how awesome they are for adding said measures to their service. And I will read about it after searching with duckduckgo.

Sounds a bite like Opera vs the major browsers.

I use duckduckgo as the default search engine with Chrome, and it works really nice. The key to make switching easy: Preface your query with, like `!g my-query' or `"!bing my-query' to search on Google or Bing. Useful, because for me Google is still better on searching stuff in German.

Just using `! my-query' gives you I'm-feeling-lucky semantics.

Have a look at http://duckduckgo.com/bang.html for an exhaustive list. !hn searches Hacker News via searchyc.com.

Often times, efreedom ranks higher than StackOverflow, and in some cases SO isn't even on the first page (for some reason the official source 'misses' with my search terms). In those cases, I open up the efreedom link and click through to SO. This seems to be happening more and more.

EDIT: I see that the extension actually redirects to SO. So in a way the presence of those sites when I wouldn't normally see SO results is a good thing. Nice.

>> In those cases, I open up the efreedom link and click through to SO. This seems to be happening more and more.

I completely understand why you're doing that, but you should know that's seen by the Goog as a big +1 for efreedom. All they know is you clicked on that result and didn't come back because it didn't answer your question.

It's sites like these that have made me wish I could downvote or mark-as-spam from the search results. Why can't I tell Google that I never want to see results from certain URLs ever again?

I would love to see this feature. Preferably as a way for search results to be "voted-down" by multiple users, with an aggregated score (this would probably be heavily gamed/spammed by black hat SEO people, and become useless anyway).

At the very least if I am logged into my gmail account, I should be able to hide certain site from being returned in my personal results.

You used to be able to do this using the Search Wiki feature

Wait, let me see if I understand this right. Stack Overflow doesn't use AdSense. Other sites scrape Stack Overflow and surround the ripped-off content with ads from AdSense. And you're wondering why Google ranks its customers' sites higher than its non-customers'?

A correction: these sites don't scrape Stack Overflow's content, they download and use it directly and legitimately. Stack Overflow content is cc-wiki licensed and released in full data dumps every month, as long as the sites link to stackoverflow.com and otherwise comply with the cc-wiki requirements it's legit.

Google Blacklist can be used to remove any arbitrary site: https://chrome.google.com/extensions/detail/hbodbmhopadphblo...

THANK GOD. I was getting really pissed off while trying to meet a pretty hard deadline yet trying to lookup iOS esoterica.

DuckDuckGo does filter some of these sorts of sites. Not sure about these in particular, but I did try searching for "NSFetchedResultsController" there and none of these sites were in the results.

Actually, I never thought I'd find one, but DuckDuckGo is a search engine that every programmer should use. In fact, I've found that DuckDuckGo can be better than Google in a lot of cases.

Google is not currently optimized for technical people - if anything, it's anti-optimized. It shouldn't be hard to beat it for technical queries.

Judging by the permissions, it redirects for a whopping three sites. Why not just not click on the links for those sites?

Because, for reasons known to the engineers toiling at Google, and not me, these sites will show in the search results, and the actual StackOverflow answer will not be shown at all.

SEO is non-deterministic enough that I hesitate to even speculate, but the clean markup on sites like eFreedom is probably a lot of their advantage. Stack Overflow buries its content in quite a bit of presentational markup. Everything else being equal, it makes sense that the tighter, more focused page should rank better than the larger, more convoluted page containing the same content.

yes, it's very very bizarre and it drives me crazy. We've tried 3 or 4 different things to fix it and nothing seems to take. Note that our attribution terms do require a link back primarily for this reason, and even the sites who attribute back to us totally legally still have this problem. It truly does feel like a Google bug, honestly. See related discussion at http://webmasters.stackexchange.com/questions/5385/page-appe... . I'm all ears if anyone knows of a way to fix this.

Why not just yank the creative commons licence and replace with one that explicitly does not allow scraping?

That would be the fastest way in my book. I've never worked out why SO allows it in the first place. Is it just to appear open and web 2.0-y, or is there a business reason. It's a proper business now, users are loyal, cancel the licence. I can't think of a single person who would say 'oh, but I much preferred to read those spam sites'.

Yes, absolutely.

As someone who has contributed a fair amount of content to SO, I would wholeheartedly support modifying the CC license on the content I've contributed. I much prefer the idea of that to the idea of allowing my answers to help build dens 'o spam like eFreedom.

Now that I think of it, my answers going straight-to-spam is a nontrivial detriment to my contributing content to SO.

Glad to see I'm not the only one.

But I really posted this comment in reply to say thanks for the blog posts on jquery/asp.net. They really got me going in the right direction - fantastic stuff.

Thank you for the kind words. It's always great to hear that someone's been able to extrapolate a useful approach/direction out of my rambling.

Do what you can to get links to your questions with the question title as anchor text.

1) Is it possible to require the question title as anchor text in the creative commons link that points to the question (and not a nofollow link obviously)?

2) Make a widget for embedding a question and chosen answer on a web page, along with a link back to the question, a la google maps.

3) Make a wordpress plugin that takes a stack overflow link and turns it into the question and chosen answer, with a link back to stack overflow.

That's been my experience as well. These sites have been a royal PITA and I'm glad to see this extension.

I don't know what the deal is, but it's weird on SO's part. If the issue is, as a sibling comment says, presentation html / js confusing G's crawler, then you email your account manager at Google and it will almost certainly be completely kosher to show the G scraper different material than is displayed, as long as the difference is true to the intent of the searcher. So you can present the G crawler just questions and answers with no javascript and no additional markup.

Good idea, but I have no idea who our account manager is at Google. Do we even have an 'account manager'? what do they manage exactly, we don't do any adsense. I'll try to find out. but I'm kinda skeptical this approach (serve googlebot one set of content, everyone else something else) is something that Google would actually want to encourage at scale on any site.

Cloaking (the name for faking out the google-bot) is NOT cool by Google. There are ways to accomplish the same thing without cloaking, for example making sure content loads first, keeping markup and presentation away from the main HTML, etc.

Are you sure? I thought they were generally ok with it as long as it was the same information the user sees, just presented differently to help Google index it better

No, cloaking is a violation of our quality guidelines.

Obvious question, but have you reached out to Matt Cutts yet?

I'm looking into this a bit now. It's not a clearcut situation, in that SO has a license that allows copying. Jeff, drop me an email and let's talk about it more (I followed you on Twitter so you can DM me).

The specific example at http://webmasters.stackexchange.com/questions/5385/page-appe... is ranking SO at #1 for me now, but if you send some more examples I'm happy to chat with people in the search quality team about this.

I'll try, but Matt is a busy guy.

He is, but he's exceptionally helpful and generous with his time. Has my utmost respect.

@codinghorror - any update from your conversation with Matt_Cutts? I'm curious on this issue from a justice standpoint, as continuously seeing efreedom results in the Google rankings just doesn't sit well with me.

I looked around eFreedom's site a bit and they are providing some additional add-on value (translations etc.) so that may also be the case. In any case, best of luck in getting SO pages ranked better.

the auto-translations are specifically against Google's TOS, just FYI. Beyond that Matt is looking into some specific oddities we found and stuff was forwarded on to the Google search quality team. Not sure what will come of it, but Matt Cutts is awesome!

It looks like the robots.txt on those translation sub domains is disallowing all crawlers.

I really can't get over how a blatantly spammy website (Adsense plastered over pages) just reorganizing SO content can have such a huge traffic velocity - check this alexa chart: http://www.alexa.com/siteinfo/efreedom.com

Wouldn't it be great if there was a way to do this for any site you didn't want to see in the search results? Is it already possible with Google personalized?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact