
Ask HN: Should I consider a startup based on scraped data? - un-devmox
A friend who is a sales associate within this particular industry complained to me about how hard and time consuming it can be to search for a particular item. He said if I could build a search engine that searches the top 500-1000 sites for this industry, it could be &#x27;really&#x27; valuable. My target market for this search engine would be the owners and associates of the sites I would be scraping.<p>The data I would be scraping are images and its associated description. I would only store and display thumbnail images. Without an image, the description would be fairly worthless. For each image&#x2F;item, a link would lead directly to the original website.<p>One business model I am considering, and the most obvious, is a subscription based web app.<p>While at PyCon last month I showed a few people a prototype. One person, an employee at Google, said, &quot;Be careful.&quot; He was alluding to potential copyright and legal issues. &quot;But,&quot; I said, &quot;I&#x27;m not really doing anything different than Google.&quot; He countered, &quot;Google has lots of lawyers.&quot; Ahhhh, message heard loud and clear!<p>I understand, in general, copyright and fair use [0]. But, I don&#x27;t want to be writing letters to the owners of the original content arguing this fact let alone wind up in court. What advice or experiences can you share that might helpful?<p>[0] http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fair_use
======
kjhughes
First of all, don't _scrape_ , and don't call what you're doing _scraping._
Scraping immediately connotes theft in the sense of taking something which is
not meant to be taken.

Instead, _index_. Indexing, on the other hand, connotes supplementation in the
sense of adding value to that which is already there. Have the thumbnails,
excerpts of the descriptions, and whatever secret sauce you've not mentioned
add value to the owners' data. Provide traffic or some other measurable
benefit to them.

Don't rely merely on Fair Use (or weak interpretations of the doctrine).
Provide value to the data owners, and be ready to respect their wishes if they
chose not to accept the value proposition you offer.

------
bengali3
IANAL, but to keep your expenses low and get traction sooner, here's my
advice:

Unless you're going against obvious warnings for each site, then scrape first,
make it free, and ask questions later. IF you're successful quick enough, you
will be a force in itself and your marketplace will be one where everyone
wants to remain listed. Speed & adoption wins, stay under the radar as long as
you can. you want people to love your product so it doesn't get pulled and/or
makes people want it back. When you get notified, respond immediately. Very
important: PROFIT LATER. Once you are taking payments, some could say you are
making money off of their data, and they'll want a piece of that money. If
it's a free service, less feathers to ruffle, less of a target. Cease & desist
will stop you from pulling THEIR data. Getting sued for the money you brought
in will ultimately stop you from pulling ANY data.

>a link would lead directly to the original website.

Track this heavily, this is the value you are adding to the data providers you
are scraping from. If they see business growth coming from your space, they'll
support you. Get allies early.

The innocent 'I wanted to build a tool to reduce headaches to help the
community' is best defense here. (So don't post online anywhere stating
otherwise...) Trying to get approval from a large enough 'chunk' of the data
providers without some numbers behind you is a waste of precious time.

Good Luck!

~~~
un-devmox
Thank you!

>The innocent 'I wanted to build a tool to reduce headaches to help the
community' is best defense here.

Seems like the best defense as well as the truth. I hope they would see it
that way.

>Trying to get approval from a large enough 'chunk' of the data providers
without some numbers behind you is a waste of precious time.

That's part of my dilema. It would be hard to get some sort of approval
otherwise.

------
Animats
Well, first obey "robots.txt".

Our SiteTruth system does some web scraping. It's looking mostly for the name
and address of the business behind the web site. We're open about this; we use
a user-agent string of "Sitetruth.com site rating system", list that on
"botsvsbrowsers.com" and what we do is documented on our web site. We've had
one complaint in five years, and that was because someone had a security
system which thought our system's behavior resembled some known attack.

About once a month, we check back with each site to see if their name or
address changed. We look at no more than 20 pages per site (if we haven't
found the business address looking in the obvious places, a human wouldn't
have either). So the traffic is very low. Most scraper-oriented sites hit
sites a lot harder than that, enough to be annoying.

We've seen some scraper blocking. We launch up to 3 HTTP requests to the same
site in parallel. A few sites used to refuse to respond if they receive more
than three HTTP requests in 10 seconds. That seems to have stopped, though;
with some major browsers now using look-ahead fetching, that's become normal
browser behavior. More sites are using "robots.txt" to block all robots other
than Google, but it's under 1% of the several million web sites we examine.
We're not seeing problems from using our own user-agent string.

So I'd suggest 1) obey "robots.txt", 2) use your own user agent string that
clearly identifies you, and 3) don't hit sites very often. As for what you do
with the data, you need to talk to a lawyer and read _Feist vs. Rural
Telephone_.

------
davidjairala
From personal experience, it's quite the headache, even if you stay within
legal parameters, you will run into site owners who are less than thrilled
about what you're doing (possibly understandably so).

I ran into several people who wrote cease and desists, which I honored, and
into several others who started banning our IP addresses, etc, disallowing us
specifically via robots.txt, etc.. There are obviously ways to get around
these issues, but the main question is, morally, would you want to go around
them? Are you willing to go against website owners who flat out don't want you
scraping their data? Would you be willing to fight them legally for your right
to do so?

Ultimately, that's what it came down for me, I just felt really crappy about
it and stopped.

~~~
runbycomment
Agreed that it can be a headache, but wanted to offer an alternative
perspective.

Personally, I feel that inclusion in Google constitutes public access to the
data. As long as I'm not logged into an account on their system, I feel
ethically justified about scraping their data.

In other words, I do not feel compelled to respect robots.txt if that file
does not also block googlebot.

Legally it may be another issue, but ethically I consider inclusion in Google
as an announcement that this information is public.

~~~
fencepost
Ignoring/bypassing robots.txt is probably a bad idea unless you're going to
never even look for it and are going to try to plead incompetence if someone
comes after you.

In the early stages you probably won't be robots.txt'd because you're
insignificant.

In later stages, you're hoping to not be robots.txt'd because you're providing
a worthwhile service not just for users but for the site.

At neither stage should you force companies that want you not indexing their
content to go beyond basic means (robots.txt) because the more serious
measures are all going to cost them more money (tracking / blocking your IPs,
C&D, DMCA requests to your provider requesting that the entire site be taken
down because there are thousands of infringing items, lawsuits seeking
(damages | court costs | costs for dealing with your circumvention of
technical measures to keep you out of the site), finding of friendly
prosecutors, etc.).

You don't want to go down that more expensive road.

~~~
runbycomment
Also worth mentioning: as long as you're scraping facts and combining them in
a novel way, copyright law is much less relevant.

This opeartes in what I consider a legal grey area. Don't make it obvious that
you're scraping, only scrape public information, transform the results, proxy
your requests, all contribute to lowering the legal profile (which is my only
concern, as I feel I am acting within my own ethical limits).

~~~
anseljh
Eek. This is only kinda true. You ought to talk to a copyright lawyer and get
a handle on derivative works and data compilations. You can get started by
reading this Supreme Court case:

Feist Publications, Inc. v. Rural Tel. Service Co., 499 U.S. 340 (1991)
[https://casetext.com/case/feist-publications-inc-v-rural-
tel...](https://casetext.com/case/feist-publications-inc-v-rural-telephone-
service-company-inc)

Disclaimer: IAAL but IAN _Y_ L.

------
btown
> My target market for this search engine would be the owners and associates
> of the sites I would be scraping.

If the product is for competitive analysis or price-comparison purposes, which
is the only conclusion I can draw from that sentence (why else would you
scrape your peers?)... then Market Leader A is highly incentivized to try to
shut down any provider that feeds _their_ content in an actionable way to
their smaller competitors B and C. Even if A could theoretically benefit from
B and C's information just as much, B and C have more to gain than A does, and
that's dangerous to A. And A does have an argument that their proprietary
content is not being used under fair use. It might not stand up in court, but
their legal department can still make your life a living hell, and if they
deem the threat large enough, they probably have enough resources to bleed you
dry without breaking a sweat.

Perhaps the potential upside of addressing this market is worth the legal
risk. I am not a lawyer. But as soon as you get reasonably big, you'll paint a
target on your back.

------
MehdiEG
This is the sort of "startup" that I've seen commonly done by self-proclaimed
"serial entrepreneurs".

Hire a developer for next-to-nothing / hour in the Philipines, India or China.
Get them to build a quick-and-dirty scraping tool that's focused on a specific
industry. Then try to flog it to slightly shady businesses. Try to stay under
the radar for as long as you can and make as much money as you can while
you're there. Sooner or later, you'll get busted and shut down - no big loss
to you.

The people I've seen do that typically have a dozen or so of such "startup"
going at any one time and they just keep shutting one down to start another.

This is not the sort of startup that will get you the fame and respect of the
tech startup world. But it can certainly make you money if you have the
"right" mindset. Just don't bet the farm on it.

------
larrydag
Here's my opinions on the matter.

1) Build a MVP prototype with the scraped data. Don't worry about the business
model. Yet VERY IMPORTANT make sure you are allowed to scrape the data in the
first place. Work out an agreement that you are interested in the data but
don't give away your methods.

2) Pitch the idea FIRST AND ONLY to the data owners. Suggest to them the
usefulness of their data. They may want to invest in YOU to build it out. If
the data owners are hard to approach then reach out to mentors that have
networks connections.

3) Fall back and last resort is to build up your own data. This will be tough
and tricky. You might have to build your own search engine (or similar type
data feeding app). You at least own the data.

As conculsion, content ownership is king in the online media world. Make sure
you follow the appropriate channels. Talk to the data owners about interest in
their data. Get aggreements in place for access without giving away
proprietary methods.

~~~
un-devmox
Great advice! Would you build the MVP first before pitching to the data
owners? My prototype only indexed (one time scrape) 10 sites and still relies
on a fair bit of imagination from the business owner. I'm thinking an MVP
would have to index at least 100 or so sites before being at all useful.

~~~
larrydag
It depends of your definition of MVP. I believe MVP is just enough to show you
have a working concept that could have the potential for revenue. Since I'm a
data guy I'm always going to say more data is better.

------
declan
If you honor robots.txt or provide a straightforward way for sites to opt-out
of your search engine, you're in better shape than you would be otherwise.

Google honors robots.txt but few site owners enable it because of the cost of
delisting. By contrast, the cost of delisting from your specialized search
engine is low, so you might see some of your content dry up.

In the U.S., at least, you do not have the legal right to connect to a site if
the owner as requested that you stop -- see eBay v. Bidder's Edge. Fair use
has nothing to do with that point (fair use deals with what use you can make
of the information once you obtain it, not with any right to obtain it in the
first place).

Talking to a lawyer is always good advice.

------
theaccordance
I would be very hesitant to invest or subscribe to a product that solely
relies on data scraping. You're asking for trouble if you don't obtain
permission first to include another company's data within your product.

The one legal case that always comes to mind in terms of data scraping is
Craigslist Inc. v. 3Taps Inc.
[http://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc](http://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc).

~~~
nodelessness
There are entire businesses that have full 150 members team that do "big data"
work that is essentially just scraping a tonne of data off of many places on
the web.

If the company's data is an aggregate from many different sources and how can
the original sources' claim be established?

~~~
fragmede
Plagarism/copyright infringement can be proven through the use of 'trap
streets' \- wait for the site to scrape the fake data off your site, which
could only have come from you, since you made it up.

[http://en.wikipedia.org/wiki/Trap_street](http://en.wikipedia.org/wiki/Trap_street)

~~~
pbhjpbhj
>Plagarism/copyright infringement can be proven through the use of 'trap
streets' //

Copyright doesn't protect data it protects presentation. On a map the part of
the map with the trap street has usually been traced, the presentation is
copied. If you use the same map to compile a street listing then you've not
copied you've used the information embedded in that presentation.

If I create a webpage with all the event information held on a particular pin-
board then that is not copying, if I add a thumbnail or other image of each
poster that is normally copying. The information is free (Database Rights like
EC Directive 96/9/EC not withstanding).

Plagiarism is not generally illegal except as it imposes on personal
contracts/agreements and on other IPR (eg copyright). For example I can recite
an out-of-copyright work verbatim -- that is a work in the public domain -- on
my website with no attribution (or even a fake one) and there is generally no
tort or crime committed regardless of how morally wrong most people would find
that.

This is not legal advice.

~~~
dragonwriter
> Copyright doesn't protect data it protects presentation. On a map the part
> of the map with the trap street has usually been traced, the presentation is
> copied. If you use the same map to compile a street listing then you've not
> copied you've used the information embedded in that presentation.

It also protects collections of information, and its quite possible for
repackaging the same collection of information with a different presentation
to be found to be a derivative work. ISTR cases related to copyrighted medical
code sets and the like where this was the case.

~~~
pbhjpbhj
>its quite possible for repackaging the same collection of information with a
different presentation to be found to be a derivative work //

If you do it by copying a creative work.

The collection must be deemed to be a creative work. The information held in a
medical code would be unlikely to be merely factual.

WRT USC see
[http://en.wikipedia.org/wiki/Database_and_Collections_of_Inf...](http://en.wikipedia.org/wiki/Database_and_Collections_of_Information_Misappropriation_Act)
for example.

------
mandeepj
Few tools for you -

[https://import.io/](https://import.io/) \- totally free and scraps data very
quickly

[http://espion.io/](http://espion.io/) \- automated headless browser for
scraping data

[https://www.kimonolabs.com/](https://www.kimonolabs.com/) \- turns websites
into data APIs

Note - You can't embed images into your site and expect them to be loaded from
another site. Site owners can block this type of behavior to avoid overuse of
bandwidth.

------
waffle_ss
I say go for it. I'm building that exact kind of application right now in my
spare time targeted towards firearms and ammunition (should be launching in
the next couple months). I've contacted a couple sites and one of them even
gave me a dedicated JSON feed that I could use instead of scraping, although I
opted not to use it over data integrity concerns.

I'm being very careful to write polite crawlers, but if a site really doesn't
want me to crawl their site, I would of course de-list them.

Your site model might be a bit different since you say you're targeting the
retailers as users, but I don't anticipate much trouble from my approach as
I'm targeting the consumers and simply driving them towards the retailers'
sites. If anything it's like free advertising for the retailers' products.

edit: also if you're really targeting retailers, Semantics3[1] might already
be doing what you're planning to do (depending on the industry)

[1]: [https://www.semantics3.com/](https://www.semantics3.com/)

------
egze
There are thousands of startups that scrape data and are quite successful. A
certain job listing site comes to mind. Don't worry too much about getting
sued.

------
femto113
This isn't legal advice, just practical advice.

1) Find a way to market this as win-win for you and the scraped sites. If
you're perceived a net benefit to all involved, you will probably succeed. If
you're not then you won't (for any of a number of reasons, including legal
conflicts). 2) It is immeasurably easier to get forgiveness than permission,
so I would not even try for the latter. That said you should honor any
predeclared restrictions like robots.txt or clear terms of service. 3) Test
out traction and interest as quickly and cheaply as possible. Launch as soon
as you've got something usable (don't sweat whether it is "useful", as that is
not really your decision to make).

------
davemel37
I once posted a scraping gig on getafreelancer and got a terrifying private
message from a detective in Kentucky, which in turn got my account banned.

Turns out the site owners brother was a Supreme Court Judge in Kentucky.

Legal or Not...Be prepared to piss off some people, and some of those people
might even have political klout.

I guess you have to break a few eggs to make an omelet. Good Luck.

------
crdb
You might want to look at the YC-backed company Semantics3, which has indexed
60 million unique products and over 4 billion URLs... all their data is
available as APIs with pricing proportional to the number of API calls:
[https://www.semantics3.com/](https://www.semantics3.com/)

------
sourabh86
I have two android apps which depend on scraped data. I took permission in one
case (good people at basecamp did not mind!) and did not require any
permission for another because I was showing data only to the intended user
(just in a handy way). My learning...Never be totally dependent on someone
else's website/product. The second website went down around 15-20 days back
because of some country wide server upgradation activity and my app
installs/rating are going down since then. All those people who were giving 5
stars and praising the app are now abusing it with one star!

------
meritt
Are you certain something like this doesn't exist already? If there are
500-1000 sites I gotta imagine someone has already built this. Shopping feed /
aggregators are nothing new.

e.g. [http://searchenginewatch.com/sew/study/2097413/shopping-
engi...](http://searchenginewatch.com/sew/study/2097413/shopping-engines)

------
gargarplex
Consider a business model like 'Magic' where people pay you to search, and
your employees leverage your internal system (built on scraping) to deliver
excellent results.

~~~
un-devmox
Thanks, I am also considering that type of model as well.

------
kaolinite
As ever, I'm not a lawyer - you should talk to one. However:

I suspect that if you ever become big enough to start getting legal threats
from those sites, you'll already be in a pretty good place. I wouldn't worry
about legal stuff yet, the main problem is actually building the thing. That
said, make sure you set up a limited liability company and, as far as I know,
you should be safe.

As you're scraping 500-1000 different websites, if one or two complain, you
can just remove them from the website. They probably won't want to anyway if
their competitors are on there too.

You should probably make sure you have a link on the website to a
complaints/takedown page too.

~~~
wspeirs
I agree, I'd "go for it" and worry when it becomes a real problem. If you grow
fast enough, then everyone will want to be on there (maybe you could even sell
access to the #1 spot) much like Google. If you don't get any traction, then
no one will notice or care.

I'd chalk this up as a "good problem to have"... then again, Grooveshark
probably thought the same thing:
[https://news.ycombinator.com/item?id=9468476](https://news.ycombinator.com/item?id=9468476)

------
markbnj
I spent the last two years building, deploying, and maintaining a pretty large
custom search engine based on scraping. I agree with most/all of the business
comments made in the thread. From a technical perspective the main thing to
keep in mind is that scraping is a dirty process, more so when you're scraping
from smaller firms that often have out of date and quite horrible sites. It's
not something you can build and just run. Sites will break, fail to respond,
change their markup, etc. You system has to be very tolerant, or you'll be in
babysitting mode 24 x 7.

------
taylorwc
> He said if I could build a search engine that searches the top 500-1000
> sites for this industry, it could be 'really' valuable

Let's say that you can get past any legal issues with scraping... Don't dive
into a startup based solely on this anecdote. Figure out what you can do to
size the market. He says it's valuable. How valuable? Do customers understand
why they need this? Are the spending any money on something similar today? How
would you target them and sell to them?

I'd treat these questions as equally important as the legality when it comes
to "should I start?"

------
wrath
I've built several businesses that either relied in-part to scrapping/indexing
websites or solely relied on scrapping/index websites. I We never achieved the
success of Google but we did get large enough to be noticed by some sites
(Amazon for example). We did face legal issues but of a different kind. There
were a few bugs early on that made us hit websites too much and we did receive
a couple of cease and desist letters. We fixed our problem and explained the
situation to the site owner and everything was resolved.

The only "fair use" type issue that we encountered was using logos from
websites. E.g. Displaying the logos of the websites we indexes on our site.
Once again, nothing serious came of it. I believe our marketing department
removed the logos and put text instead.

Personally, I wouldn't worry about these issues until it becomes a problem.
When it becomes a problem it means you're on to something and you're
disruptive enough to get some attention. It's a good problem to have IMO.

------
flog
I built a product around the Twitter firehose, which was a publicly published,
accessible, terms-of-service'd data stream... then they killed our (and many
other's) products when they closed up access. Keep that in mind, and add on
the risk that you're probably not even allowed access to scraped data, and I
would suggest the answer is "No."

------
meshko
How about you... talk to a lawyer?

~~~
GFischer
That's good advice, but you also have to know when it's good to ignore legal
advice (it depends on how risk averse you are I guess :) ).

Many startups flaunt current laws and are very succesful (see AirBnB or Uber).
I think PG wrote something on this (mostly on the "hackers beat the system"
sense).

------
jabagonuts
I don't know about the legal aspects, but if the sites you are scraping do not
want to be scraped, it can turn into an arms race. They figure out a way to
block your scrapers, you figure out a way around it, they block you again and
so on. Even if it is legal, there are plenty of other things I'd rather spend
my time building.

------
kujenga
What this boils down to is incentives. Most of the issues with copyright that
bring on legal action come from sites that aggregate data which they did not
create, and then market in some way to make advertising revenue off of it.
Sites that aggregate and repost news stories, for example, fall under this
category because they end up taking advertising revenue from the sites which
they draw their content from. Content creators in this area will then
aggressively go after these sites because they hurt the bottom line.

On the other hand, your concept sounds like it would draw business to this
industry, so the incentives may very well align with the very companies whose
data you are scraping. I worked on a concept for a startup where we had
similar, but in our research we never had any issues come up because our aims
were aligned with the providers of the data that we were scraping.

------
swehner
Can you just set up a custom search engine with the URL's of the sites for the
industry? [https://cse.google.com/cse/](https://cse.google.com/cse/)

Your effort would be just to paste in those URL's, no need to develop /
maintain any site of your own.

------
maxxxxx
In a sense that's what google does.

Some years ago I did a project that involved scraping and we got some letters
from layers and got blocked from some websites. Make sure to know what the
legal situation is, otherwise lawyers' letters can be scary.

------
tacostakohashi
It sounds like you have a good handle on the legalities, but on a more
practical level if the sites you are scraping don't want to be scraped, it
would be pretty easy for them to block you, obfuscate/change the page
structure at any time to make your scraping impractical, etc. Of course, you
will be able to play along too by obfuscating your source address and
improving your scraping, but it could turn into a time consuming game of walls
and ladders.

Of course your startup idea may still be worthwhile, but in the longer term
you'll be at the mercy of the content owners (who might even be fine with it,
or want to acquire you).

------
btbuildem
If you're small, no one will notice / be worried that you're scraping their
data, and it won't be worth their while to sue you.

If you do make it big, hopefully you will have enough profits to play the
lawyer game.

------
georgespencer
Ahhh, you aren't scraping, you're building an aggregator! Google _does_ have a
bunch of lawyers but the hardest part is building and selling something. Solve
any perceived illegality later.

------
ian_d
I think getting sued is really dependent on the size of the company you're
scraping vs. what kind of 'business' you're cutting out from under them.
There's a number of sites like [https://gripsweat.com](https://gripsweat.com)
(mine) that are important to collectors/niche users but are basically to small
to bother with otherwise.

~~~
moepstar
Where are you getting the data from and how? API or via scraping?

------
jbrisson
Apart from the legal aspects, a problem I see is the day you'd have X
subscribers paying to get your (aggregated) content, what if some sources
(playing cat/mouse with you, or not) refactor their web sites (basically
F*#king your data pipes). Then you'll have to turn around quite fast because
you'll have tens and tens customers yelling at you.

------
adanto6840
Mind adding your email address to your profile, or a throw-away email? Or just
shoot me an email (email is in my profile).

I've been down this path before and would love to chat and am happy to provide
any help or guidance that I can. I have no "golden" answers FWIW, but I do
know the positives and negatives, and have even been to Federal court WRT
scraping. :-)

------
GFischer
IANAL, but I considered something of the sort (a site aggregating real estate
listings), and was immediately warned of the legal implications.

[http://fairuse.stanford.edu/overview/website-
permissions/lin...](http://fairuse.stanford.edu/overview/website-
permissions/linking/)

------
aditiyaa1
There are already existing providers for this kind of service.
Ex:[http://kapowsoftware.com/](http://kapowsoftware.com/)

I am not sure, what is different in your approach?

------
polyomino
Start with scraped data, then when they start blocking you (legally or
otherwise) create a predictive market.

Then, you're a platform for other people's scrapers and you'll provide
perfect(ish) data.

------
jayzalowitz
As someone who's previous startup did something not unlike what you are going
after.

You don't need to worry if you respect the robots.txt and such.

