

Ask HN: Seeding a new site with data from sites you don't own? - rufugee

I've been developing a stackoverflow-like site for a niche market. The challenge with question and answer sites like these is that you really have to build up a critical mass of existing answers to attract any attention (or search engine, for that matter).<p>So, I've been thinking about "borrowing" existing questions/answers from similar sites. However, I'm concerned with whether I might get into legal hot water?<p>I recall that there was a bit of an outcry when it was revealed that www.thedailyplate.com was scraping data from existing sites, but now they're a very popular site with a deal with Lance Armstrong and livestrong.com, so I guess it worked out for them.<p>What does HN think? Yes, I'd feel a little slimey doing it, but is it a legal thing to do?
======
jacquesm
Look at the bottom of stackoverflows pages:

site design and logo is © 2009 stackoverflow.com llc; user contributed content
licensed under cc-wiki with attribution required

<http://creativecommons.org/licenses/by-sa/2.5/>

<http://blog.stackoverflow.com/2009/06/attribution-required/>

So it looks like as long as you play by their rules you'll be fine.

And if you don't play by their rules you've basically told the world now that
you did so with intent.

Also, to save you some time and them a lot of server load here is the data
ready for download:

[http://blog.stackoverflow.com/2009/06/stack-overflow-
creativ...](http://blog.stackoverflow.com/2009/06/stack-overflow-creative-
commons-data-dump/)

~~~
rwolf
The sites you will be competing with (at least the ones cc'ing their user
content) would probably say something bombastic like "we're better, so go
ahead and try to use our content to beat us"--I can certainly imagine the kids
at Stack Overflow saying something macho like this on their radio show.

I say go for it. Us HNers have a major crush on Stack Overflow, and if you can
drive us to your site (without spamming the google front page a la Nabble),
you win. Bonus points for cc'ing your user content so the next fad can take
all of your business.

~~~
rufugee
Sorry if I was unclear...I'm not trying to compete with StackOverflow. I'm
trying to borrow answers from places like Yahoo Answers, etc. My site is a
question/answer site for a niche (think quilting, or folks who collect rocks,
or home schoolers...that sort of niche).

~~~
jacquesm
If you are going to copy - not borrow - stuff from Yahoo answers you probably
should be warned that Yahoo has a small army of lawyers at their disposal who
could probably make your life quite miserable if you started copying their
information.

------
soychicka
I agree with Daniel, and I've had this discussion with others starting up new
sites focusing on UCC recently.

My personal opinion: the only time it's okay to scrape data is when it's
actually "data" - e.g., addresses, events, etc. Scraping conversations or
articles - content that people took the time and energy to craft - even though
it may be legal, just isn't right.

But I've come across this practice with increasing frequency over the past few
months... and every site I've come across that engages in this practice or
-even worse - uses it as their business model - has been put on my blacklist
(e.g., BigResource, etc.).

Reasoning? It takes 10 sec. for a page to load. After I've seen 6 pages in a
row that have the same exact content, I've wasted over a minute of my life,
and want to throttle someone.

First impressions are the ones that last, and if the first impression a user
gets of your site is one that makes them vow not to use your site again,
you'll end up far less successful than if you were to wait and allow your
content to grow organically.

A better option: look for topics that haven't been covered elsewhere (or at
least topics that aren't easily found through google), and ask and answer the
questions yourself (maybe even using a different account so it looks 'real')
with a mind towards SEO.

And I've found this to be such an annoyance that yes, I'm cross posting this
response on my blog.

~~~
jacquesm
The only way I see that you could use that data in a constructive way is to
mine the hell out of it and to use it to add value to it somehow. That's going
to be pretty tricky but it certainly seems to be a better goal than to just
re-invent the wheel.

------
DanielStraight
This makes me seriously reconsider even posting to StackOverflow anymore. I
_HATE_ websites that just repost stuff from other sites (think Nabble). You
give the appearance of having a community of users (the only thing about sites
like StackOverflow which is actually useful), but you really don't. If you
find something cool on a site like that, there's nothing you can really do
about it. It's like a figure in a wax museum. It looks like the real thing,
but you can't interact with it. If you post a reply, you're going to get
nothing in return, because the community is at StackOverflow, not at the other
site.

To me, this is like lying to your users, and I don't see how anything good can
come of it. I never, NEVER use Nabble despite the fact that they infest Google
results like vermin. Please don't do this.

------
gtani
if you go to searchyc.com, bing, google with terms like "mashup copyright",
unfortunately most of them are about music mashups.

<http://searchyc.com/copyright> going thru this is a legal education

<http://news.ycombinator.com/item?id=411555>

[http://www.boston.com/business/articles/2009/01/23/lawsuit_o...](http://www.boston.com/business/articles/2009/01/23/lawsuit_over_website_links_in_spotlight/)

And, oh, yeah, IANAL or bus dev person or MBA anything like that.

