Hacker News new | comments | show | ask | jobs | submit login

An obvious improvement to Google whose absence shocks the hell out of me would be this:

Personal domain blacklist.

There's a lot of spammy bullshit on the web and Google seems to have given up on keeping this away from me. Fine. But for my specific searches, there's usually a handful of offenders who, if I never, ever saw them again, it would improve my search experience by an order of magnitude.

So let me personalize search by blacklisting these clowns. Why can't I filter my search results so that when I search for a programming issue, I never see these assholes from "Efreedom" who scrape and republish Stack Overflow?

I don't, personally, need an algorithmic solution to spam. Just let me define spam for my personal searches and, for me, the problem is mostly solved.

(Also blacklisted: Yahoo Answers, Experts Exchange.)

If there was a blacklist option, then one could subscribe to a blacklist maintainer like one does for Firefox Adblock -- that would be really, really nice.

I am actually suprised there is no Labs application for this, unless there is a business case against it.

If this gets upvoted enough I will build a browser plugin that does this.

"Will code for karma points"

There is already an FF extension that does this. It's called OptimizeGoogle and they need help keeping it up to date. I've offered to help but the code is quite strange (came from an older extension called CustomizeGoogle) and I'm not sure it's worth saving.

A cross-browser effort that implements a few key features from OptimizeGoogle, would be a very good idea. I'd be up for that.

Which browser? Safari, Chrome, Opera and Firefox all support extensions/plug-ins.

Maybe IE9 does, too, but that's not important. :)

People might take you more seriously if you bothered to write a little about how you're qualified to even do so in your about section.

Because he says he can?

Typically it's ok to err on the side of caution but when someone offers to do a bunch of work if you just indicate that you'd like it done the safe bet is to assume that in fact they are qualified, after all their reputation is on the line in public.

I admit that I was being too harsh. I think I was just having a grumpy day and let it spill over into my comments.

Definitely an interesting idea.

The first thought that came to mind was what happens when I disagree with a couple of items on one of these 3rd party blacklists?

Then I thought, FORK IT and make the changes you want. You could even merge in lists from other people. Github for blacklists?

Just a small note, SO puts all of its content under a Creative Commons license[1][2], which they are (in my non-lawyerly opinion) following.

Now, this doesn't mean that filtering them wouldn't be useful to you, since at first glance it appears they're solely a duplicate. Just pointing out that they're not actually doing anything wrong, and they're (probably) not scraping.[3]

SO has specifically said that this is okay.[4]

1: http://wiki.creativecommons.org/Case_Studies/StackOverflow.c...

2: http://creativecommons.org/licenses/by-sa/2.5/

3: http://blog.stackoverflow.com/2009/06/stack-overflow-creativ...

4: http://blog.stackoverflow.com/2009/06/attribution-required/

> SO has specifically said that this is okay.

It doesn't look like Jeff is that okay with it, especially when it comes at the cost of Stack Overflow's own ranking:

Sorry, this is absolutely necessary, otherwise we get demolished by scrapers using our own content in Google ranking. – This is from a question about Stack Overflow's SEO strategy:


Yeah, it's not a _great_ situation. I'd probably filter them out too. But then again, maybe not. This is a complicated topic, and I'm still not quite awake...

I mean, their use of SO's content may be legal but it doesn't mean they aren't dicks. It's a wholesale ripoff of another site's content that adds absolutely no value to anyone but the publishers themselves. It inconveniences users and it harms the good work of Stack Overflow by robbing them of rankings they deserve.

In the same way I can call someone's mother bad names because it isn't illegal, it doesn't mean I should do it, because I can follow the letter of the law 100% and still be an asshole. Overall, my policy is that it's best not to be an asshole and it annoys me when others can't share that basic ethos.

The same happens with Wikipedia as well. It's free-content, because that's sort of the point of the project. And reusing that content is great and encouraged. But just rehosting the exact contents of en.wikipedia.org with ads slapped on is a bit lame. Legal, but it's not any sort of interesting reuse, just adding more noise to the internet.

Wikipedia gets so much traffic from google that the harm there is minimal compared to what's happening to stackoverflow.

Truth. That's why I'm waflling, on one side, it sucks that they're solely copying things. This is never something that I'd personally do.

That said, you could make an argument that the value they're adding is SEO and promotion, it's pretty impressive to be able to out-rank SO...

You know, I am starting to think more and more that copyright should ALLOW peer-to-peer sharing but not republishing. That's where the line should be drawn. There should be a clear definition of publishing, e.g. serving content upon request to anyone immediately on demand. That way, if the authorities can download something copyrighted from a public source which is not the original author, they can go prosecute that source. Just a thought. Ideas?

It's just easier to eliminate intellectual property all together.

It's actually not easier to eliminate intellectual property altogether. IP laws haven't been enforceable for years, but so far that's resulted in revenues declining across certain industries rather than going away completely. Think of all the change in the past 15 years the entertainment industry has faced and imagine how much bigger in magnitude--and how utterly sudden and catastrophic--immediate IP repeal would end up being.

They deserve it. This is capitalism: evolve or die. I won't shed a tear.

New media will make it work anyway. IP is not needed for content producers to survive and even thrive.

I'm not saying whether it's a good or bad idea, I'm saying whether it's easy or not. There's a difference between "creative destruction which will end up good in the long run" and "easy"--in fact, they're nearly exact opposites. Just look at the fallout from this last recession for an example.

(On a side note, capitalism is defined by the legal enforcement of property rights. Abolishing intellectual property is probably the exact opposite of capitalism. The word you're looking for is "market".)

Ah, this is a longer discussion than I'd like to have at the moment, but you're still assuming that IP is actually 'property.' It's not, at least in my mind.

You're still right though: market would be a better fit. Markets and capitalism are pretty much interchangeable in my mind, which is why I made the slip.

Property rights, as far as capitalism is concerned, are an artifact of law, not some fundamental philosophical truth. For capitalism to exist in a given domain, the government has to enforce property rights in that domain. That includes everything from cap-and-trade (where there are property rights to air emissions), water rights, real estate, equity in businesses, and yes, even copyrights and patents. Whether any of this actually constitutes "property" in a philosophical sense is an especially worthless genre of philosophical argument.

really, so you would be totally OK if someone copied your article and slapped their name on it, or took your fictional characters and wrote some stories where they are made to be the scum of the earth?


too bad it's not the case for many other people :)

Relieving your users' copyright of their own published content hosted on your own platform is very dubious territory legally.

It's fine that they have an expressed policy that says it's okay, but I'd keep it at that and not refer to terminology like CC licenses.

Huh? All content on the site is CC licensed. You agree to allow your content to be released under CC when you sign up for the site. It's perfectly reasonable to discuss things in the terms of CC.

There's nothing dubious about this legally at all.

Like EULAs, license agreements, ToS's, etc.

I prefer how YouTube handles it[1]:

“You shall be solely responsible for your own Content and the consequences of submitting and publishing your Content on the Service. You affirm, represent, and warrant that you own or have the necessary licenses, rights, consents, and permissions to publish Content you submit; and you license to YouTube all patent, trademark, trade secret, copyright or other proprietary rights in and to such Content for publication on the Service pursuant to these Terms of Service.

For clarity, you retain all of your ownership rights in your Content. However, by submitting Content to YouTube, you hereby grant YouTube a worldwide, non-exclusive, royalty-free, sublicenseable and transferable license to use, reproduce, distribute, prepare derivative works of, display, and perform the Content in connection with the Service and YouTube's (and its successors' and affiliates') business, including without limitation for promoting and redistributing part or all of the Service (and derivative works thereof) in any media formats and through any media channels. You also hereby grant each user of the Service a non-exclusive license to access your Content through the Service, and to use, reproduce, distribute, display and perform such Content as permitted through the functionality of the Service and under these Terms of Service. The above licenses granted by you in video Content you submit to the Service terminate within a commercially reasonable time after you remove or delete your videos from the Service. You understand and agree, however, that YouTube may retain, but not display, distribute, or perform, server copies of your videos that have been removed or deleted. The above licenses granted by you in user comments you submit are perpetual and irrevocable.”

[1]: https://www.youtube.com/t/terms

(Can someone tell me the <pre> syntax or something appropriate for a blockquote?)

> (Can someone tell me the <pre> syntax or something appropriate for a blockquote?)

    prefix with four spaces

There used to be a help button next to the text box for replies. Where has it gone?!

I don't know, but here's the link: http://news.ycombinator.com/formatdoc

Hmm, I indented it, before I submitted it, but maybe that's not supported, or Sublime Text doesn't indent by four spaces.

Regardless, thanks for the tip.

That's how YouTube does it, because YouTube is a channel for individuals to share their own content. Wikipedia, on the other hand, requires CC-BY-SA/GFDL licensing, because Wikipedia is a project to develop a knowledge base.

The question is whether StackOverflow is closer to YouTube or Wikipedia. I think it's closer to Wikipedia because it's a curated reference source, not just a medium for self-expression.

I know where you're coming from, but I'd rather say that Wikipedia articles have editors, not authors.

The articles are in a constant flux of change, and I don't know if anyone deserves more attribution than others for contributing to an article.

Knoll might be a more relevant example, but I haven't really checked it out in a while. (Who has, really.)

Legal but unethical.

I am torn. Why bother CC licensing if they didn't want you to do this, you know? My first instinct is "What a jerk!" but they're also spreading more knowledge around, which is something I can't really fault.

They're not spreading knowledge around, though. By copying the content without adding anything, they're removing impressions from the real site, which is where all of the relevant related content lives: comments, upvotes, discussion, etc. They're preventing the knowledge from getting out there.

ehhh I think we'll just have to disagree on this. Content that was once in one place is now in multiple places.

I recognize I may be being overly simplistic.

Slightly off-topic, but if you use Chrome, there is the 'stackoverflowerizer' extension[1]:

> Always redirect to stackoverflow from pages that just copy content, like efreedom, questionhub, answerspice.

[1] https://chrome.google.com/extensions/detail/gledhololmniapej...

blekko has that feature. every result has a "spam" button underneath it. Click the link, and the host will be added to your personal /spam slashtag. Everything on your /spam list gets negated from all of your results by default.

Very handy. I put ehow.com on mine and never see results from them.

At one time google did provide this, if you were logged in , there was an [X] option to remove that result from your searches, as well as a voting up and down mechanism. I think it was just an experimental feature, but I wish they would of kept it, and expanded on it. If google is reading.. please bring it back

I remember that. The trouble was that it was result-specific, not site-specific. There wasn't a way to kill an entire site, at least that I was aware of.

One idea that comes to mind to deal with the wikipedia / stackoverflow problem is result clustering. With Google News, they have done a pretty good job of clustering articles on a single story. They are getting better at deriving the original source in many cases. The simple act of duplicate detection should enable them to identify sites that scrape content and show them as duplicate results.

In the interests of results diversity, you don't want the same content repeated ten times on the first page, although this has the side effect of pushing the original source onto the second page if you guess wrong.

I use the OptimizeGoogle (and previously CustomizeGoogle) Firefox plugin for this:


You can kind-of do this manually:

Someone could probably get a script going to add that to any queries you have. Unfortunately (for me), it breaks Instant.

32 words is the limit to a Google search, so it'd be difficult to block all sites you wanted to.

I never knew that. Thanks for the heads-up!

There must be someone at HN somewhere in the Google pyramid who can tell someone with decent clout about this proposal.

There must be some way Google's search engine could learn by looking at the blacklists people uses.

Problem is that this was tried before, with SearchWiki. When it launched, it was widely derided as being useless and distracting. Its usage numbers didn't show widespread adoption. And then when it was removed, there was much rejoicing.

Now, it's possible that SearchWiki just needed a few more iterations, and with a few details changed, could be a big success. There have been a few other recent launches that were tried years ago, didn't work then, but had a few more iterations and now are big successes. I could at least raise the issue. But unless I can tell a convincing story about why people would use this when they didn't use SearchWiki, it may be an uphill battle to get resources devoted to this.

I think SearchWiki solved a different problem. It was about, approximately, globally curated results for certain searches. What's being described is personalized curation, at least that's how I read it. I'd definitely be into such a feature, and it really doesn't seem like a tremendous undertaking to make it something opt-in via Labs. I also recall SearchWiki being about specific results, which is not desired behavior for the personal curation experience I have in mind.

Amen. They could put it here: http://www.google.com/experimental/

Compare that to the GMail labs. It's pitiful.

They already have the exact opposite curation feature: the star system. And it's crazy.

When I search, and click one of the 10 results, and the result turns out to be satisfying, the last thing I want to do is click the back button and star it.

When the result turns out to be spam I necessarily have to hit the back button and try again. Staring me in the face is the now-purple link spam - let me X it.

Personal blacklists are the least Google could do, because my SERP is never going to be perfect. Feeding those blacklists back into the general SERP population is an interesting research project.

The searchwiki votes should have fed into a global spam filter. Search-specific curation is useless and personal spam lists fail to make use of economies of scale. Spam filtering is all about scale.

I would blacklist ezinearticles. It's an SEO farm with no other purpose than to target keywords and redirect traffic somewhere else. The value you get is minimal per click compared with Wikipedia.

I believe you could achieve this by creating a Google Custom Search Engine (CSE).

> I don't, personally, need an algorithmic solution to spam.

Not now you don't. But if everyone started using your approach then the spammers would adjust their behaviour and use many domains instead of one.

Which would kill their chances at gaining significant ranking for each of their spam domains.

Similarly, I'd like a blacklist or advanced search operator that, for certain queries, would allow me to exclude all sites with AdSense on them. Might have to wait for Blekko or Bing to provide that, though.

This is something a browser extension could do (though not as well as the engine itself, of course). There is Chrome extension called "Google Blacklist"[1] but it didn't seem to have any effect when I tried it out. Perhaps others will have better luck.


Very good point. Scrapers really need to be dealt with and you should be able to define what you class as spam on a personal level in Google. It's astounding they haven't done it yet.

Personal domain blacklist.

For whatever it's worth, blekko has this. It's one of the main reasons I switched to blekko over Duck Duck Go for the majority of my searching.

The only reason we don't have it is we don't have accounts. I'd be happy to add it as a cookie setting for now if people would use it.

Also, I'm also happy to take requests to ban these stupid sites for everyone.

The only reason we don't have it is we don't have accounts.

It's even really a complaint. I am glad that when I show up on DDG (which I still do several times per day) I get the same high-quality results without regard to who I am.

I'm also happy to take requests to ban these stupid sites for everyone.

The problem is that I have, for example, en.wikipedia.org marked as spam, simply so that their juice doesn't overwhelm my search results. It makes sense for me, but I suspect it's not even close to what your average user wants or expects.

In any case, thanks for the recent addition of non-Google options for searches when DDG runs out of results. Small as it may seem, I consider that a major step in the right direction.

An user marking a web as spam is hardly a problem.

Haven't you used Gmail? If I tag a site as spam, DDG shouldn't show it to me. If a thousand users mark it as spam... then it starts to be clear that is a shady site, and DDG should eliminate it from its system.

Uhm, but now we have a vector for script kiddies to ban sites from a... who knows, maybe in some years... major web search.

Perhaps definitively removing a site should be done by a human operator.

I find that DDG does a pretty good job of filtering spam sites on its own. That's the biggest reason that I use it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact