Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Should I consider a startup based on scraped data?
111 points by un-devmox on May 5, 2015 | hide | past | favorite | 70 comments
A friend who is a sales associate within this particular industry complained to me about how hard and time consuming it can be to search for a particular item. He said if I could build a search engine that searches the top 500-1000 sites for this industry, it could be 'really' valuable. My target market for this search engine would be the owners and associates of the sites I would be scraping.

The data I would be scraping are images and its associated description. I would only store and display thumbnail images. Without an image, the description would be fairly worthless. For each image/item, a link would lead directly to the original website.

One business model I am considering, and the most obvious, is a subscription based web app.

While at PyCon last month I showed a few people a prototype. One person, an employee at Google, said, "Be careful." He was alluding to potential copyright and legal issues. "But," I said, "I'm not really doing anything different than Google." He countered, "Google has lots of lawyers." Ahhhh, message heard loud and clear!

I understand, in general, copyright and fair use [0]. But, I don't want to be writing letters to the owners of the original content arguing this fact let alone wind up in court. What advice or experiences can you share that might helpful?

[0] http://en.wikipedia.org/wiki/Fair_use

First of all, don't scrape, and don't call what you're doing scraping. Scraping immediately connotes theft in the sense of taking something which is not meant to be taken.

Instead, index. Indexing, on the other hand, connotes supplementation in the sense of adding value to that which is already there. Have the thumbnails, excerpts of the descriptions, and whatever secret sauce you've not mentioned add value to the owners' data. Provide traffic or some other measurable benefit to them.

Don't rely merely on Fair Use (or weak interpretations of the doctrine). Provide value to the data owners, and be ready to respect their wishes if they chose not to accept the value proposition you offer.

IANAL, but to keep your expenses low and get traction sooner, here's my advice:

Unless you're going against obvious warnings for each site, then scrape first, make it free, and ask questions later. IF you're successful quick enough, you will be a force in itself and your marketplace will be one where everyone wants to remain listed. Speed & adoption wins, stay under the radar as long as you can. you want people to love your product so it doesn't get pulled and/or makes people want it back. When you get notified, respond immediately. Very important: PROFIT LATER. Once you are taking payments, some could say you are making money off of their data, and they'll want a piece of that money. If it's a free service, less feathers to ruffle, less of a target. Cease & desist will stop you from pulling THEIR data. Getting sued for the money you brought in will ultimately stop you from pulling ANY data.

>a link would lead directly to the original website.

Track this heavily, this is the value you are adding to the data providers you are scraping from. If they see business growth coming from your space, they'll support you. Get allies early.

The innocent 'I wanted to build a tool to reduce headaches to help the community' is best defense here. (So don't post online anywhere stating otherwise...) Trying to get approval from a large enough 'chunk' of the data providers without some numbers behind you is a waste of precious time.

Good Luck!

Thank you!

>The innocent 'I wanted to build a tool to reduce headaches to help the community' is best defense here.

Seems like the best defense as well as the truth. I hope they would see it that way.

>Trying to get approval from a large enough 'chunk' of the data providers without some numbers behind you is a waste of precious time.

That's part of my dilema. It would be hard to get some sort of approval otherwise.

Well, first obey "robots.txt".

Our SiteTruth system does some web scraping. It's looking mostly for the name and address of the business behind the web site. We're open about this; we use a user-agent string of "Sitetruth.com site rating system", list that on "botsvsbrowsers.com" and what we do is documented on our web site. We've had one complaint in five years, and that was because someone had a security system which thought our system's behavior resembled some known attack.

About once a month, we check back with each site to see if their name or address changed. We look at no more than 20 pages per site (if we haven't found the business address looking in the obvious places, a human wouldn't have either). So the traffic is very low. Most scraper-oriented sites hit sites a lot harder than that, enough to be annoying.

We've seen some scraper blocking. We launch up to 3 HTTP requests to the same site in parallel. A few sites used to refuse to respond if they receive more than three HTTP requests in 10 seconds. That seems to have stopped, though; with some major browsers now using look-ahead fetching, that's become normal browser behavior. More sites are using "robots.txt" to block all robots other than Google, but it's under 1% of the several million web sites we examine. We're not seeing problems from using our own user-agent string.

So I'd suggest 1) obey "robots.txt", 2) use your own user agent string that clearly identifies you, and 3) don't hit sites very often. As for what you do with the data, you need to talk to a lawyer and read Feist vs. Rural Telephone.

From personal experience, it's quite the headache, even if you stay within legal parameters, you will run into site owners who are less than thrilled about what you're doing (possibly understandably so).

I ran into several people who wrote cease and desists, which I honored, and into several others who started banning our IP addresses, etc, disallowing us specifically via robots.txt, etc.. There are obviously ways to get around these issues, but the main question is, morally, would you want to go around them? Are you willing to go against website owners who flat out don't want you scraping their data? Would you be willing to fight them legally for your right to do so?

Ultimately, that's what it came down for me, I just felt really crappy about it and stopped.

Agreed that it can be a headache, but wanted to offer an alternative perspective.

Personally, I feel that inclusion in Google constitutes public access to the data. As long as I'm not logged into an account on their system, I feel ethically justified about scraping their data.

In other words, I do not feel compelled to respect robots.txt if that file does not also block googlebot.

Legally it may be another issue, but ethically I consider inclusion in Google as an announcement that this information is public.

Ignoring/bypassing robots.txt is probably a bad idea unless you're going to never even look for it and are going to try to plead incompetence if someone comes after you.

In the early stages you probably won't be robots.txt'd because you're insignificant.

In later stages, you're hoping to not be robots.txt'd because you're providing a worthwhile service not just for users but for the site.

At neither stage should you force companies that want you not indexing their content to go beyond basic means (robots.txt) because the more serious measures are all going to cost them more money (tracking / blocking your IPs, C&D, DMCA requests to your provider requesting that the entire site be taken down because there are thousands of infringing items, lawsuits seeking (damages | court costs | costs for dealing with your circumvention of technical measures to keep you out of the site), finding of friendly prosecutors, etc.).

You don't want to go down that more expensive road.

Also worth mentioning: as long as you're scraping facts and combining them in a novel way, copyright law is much less relevant.

This opeartes in what I consider a legal grey area. Don't make it obvious that you're scraping, only scrape public information, transform the results, proxy your requests, all contribute to lowering the legal profile (which is my only concern, as I feel I am acting within my own ethical limits).

Eek. This is only kinda true. You ought to talk to a copyright lawyer and get a handle on derivative works and data compilations. You can get started by reading this Supreme Court case:

Feist Publications, Inc. v. Rural Tel. Service Co., 499 U.S. 340 (1991) https://casetext.com/case/feist-publications-inc-v-rural-tel...

Disclaimer: IAAL but IANYL.

In your opinion, how much of that situation's complexity is eliminated simply by scraping the Google cache of a site?

I also wonder how possible it is to hide behind proxies, especially if they are owned by entities in other countries. If a site I'm scraping is unable to identify who does the scraping, it seems difficult for them to prove "this guy uses our data and must be scraping us".

The more you have to jump through hoops to get the data (or hide that you're getting it or that you're the one getting it), the more it sounds like doing this for the wrong reasons.

Also, since this is presumably something you're going to be doing as a hobby (money creates trails), the unfortunate reality is that "right" and "wrong" in copyright law matter much less than "Oh crap, I'm being sued for $500k in $further_away(New York|California), how do I defend this?" That's why you don't ignore the polite way of saying "go away" which is robots.txt or the rude way which is a C&D - if a lawsuit (the mean way) is the first communication you have from a company, odds are pretty good that an attorney can help because judges are busy and don't want lawsuits to be the first thing unhappy companies try.

I understand what you're saying, we just come from very different perspectives. Most of my results are after significant transformation and combination, resulting in models to test against. I'm not very concerned with copyright violation, as I rarely (never?) re-publish copyrighted information.

Have there been any court cases where a person scraping public information has been found in the wrong? I know of the LinkedIn case from Jan 2014, but in that case the offenders were creating LI accounts to scrape private information. I believe that Craigslist lost it's case against e.g. padmapper, didn't they?

While I respect what you're saying in your first sentence, I view it differently. Setting aside the legal issues, I see it as someone trying to control use in a public space. I don't consider that a valid reason -- if it's public, I can consume it. Avoiding detection is a reaction to sites trying to create rules that I interpret as invalid.

If a company tried to block off a public road without legal backing, I would consider it not only my right but also my duty to traverse that road. [mediocre analogy, but it does represent my opinion fairly accurately.]

The things that jump out at me there are "that I interpret as invalid" and "Have there been any court cases where a person scraping public information has been found in the wrong?"

Tackling the second one first, I'd like to rephrase that: "Have all the court cases where a person was scraping public information been found in their favor and they were awarded all attorney fees and expenses?"

As far as "that I interpret as invalid" the courts exist to decide between varying interpretations of rights and laws. I've never heard that "inexpensively" was expected to be part of that description. I'm not saying that you're wrong - I'm just saying that there's a significant difference between "I'm taking on a coding and data analysis project" and "I'm taking on a coding and data analysis project with a big helping of legal distractions."

I'm not fully up on the Craigslist vs padmapper/3taps case - was it ever actually fully decided? And how much did fighting that case cost 3taps? Looking at the statement on their website it doesn't sound all that victorious, and I can't help but suspect that even ignoring whatever financial impact there was the distraction and demands of the case must have had a serious effect on any projects 3taps was working on (or considering and back-burnering) during that time.

As a counterexample since you said you were going to be keeping and displaying thumbnails, I'll toss out the artwork from "Kind of Bloop" (see http://waxy.org/2011/06/kind_of_screwed/) which was a highly-pixelated (and maybe only 8-color?) transformation of a photo of Miles Davis. TL;DR, Andy Baio ended up paying ~$32k to settle the case not because he thought he was wrong but because it was the least expensive option.

I'm not saying don't do it - I'm just saying that you should go into it with your eyes open and don't do things that will exacerbate any non-technical problems you may run into. That may be a chilling effect, but at least you can bring a coat.

Your public road analogy is very wrong. A better analogy would be a private road with a sign saying "Google streetview welcome. runbycomment stay out." Would you feel entitled to drive down the private road? Would the owner allowing Google to drive down the road make you feel entitled to do it?

We aren't discussing a public space. We're talking about a private server. They pay for hosting and bandwidth. Why do you feel entitled to use it?

Why is ignoring robots.txt a bad idea? The information's being made publicly accessible...

For the crass and practical reason, because A) Anyone can sue for anything (caveat: as long as it's not so egregiously stupid as to get them slapped down by a judge) B) techies' definitions of "egregiously stupid" and judges' definitions of "egregiously stupid" may not have very much overlap

As a simple example imagine that the owner of a local shop REALLY didn't like you to the extent that he had his door painted with six-inch letters at eye level "PNathan KEEP OUT!" It's a publicly accessible shop, but if you walk in and he calls the police, will your having ignored that sign make a difference in their interactions with you? How about if you've both ignored that sign and come in wearing a disguise?

I like the analogy.

> ...will your having ignored that sign make a difference in their interactions with you?

For sure.

> How about if you've both ignored that sign and come in wearing a disguise?

Not if he finds out. But the disguise will make it even worse if he does find out.

So it boils down to: Can you hide yourself good enough to not beeing detected (includes beeing detected by showing information that is presumably crawled rather than detecting the process of crawling)? It is a risk that you may take by weighting assumed loss (court case) and gain (money from using crawled data).

I may add: A clever "data provider" will inject some hidden beacons into their data that makes it easy for them to later detect that data in other websites. So actually you can always be detected, because you must have crawled that data from them.

Just to make sure I understand your reply correctly, are you saying that if a robot.txt file disallows your specific crawler but allows googlebot you'd see no problem with crawling it?

> My target market for this search engine would be the owners and associates of the sites I would be scraping.

If the product is for competitive analysis or price-comparison purposes, which is the only conclusion I can draw from that sentence (why else would you scrape your peers?)... then Market Leader A is highly incentivized to try to shut down any provider that feeds their content in an actionable way to their smaller competitors B and C. Even if A could theoretically benefit from B and C's information just as much, B and C have more to gain than A does, and that's dangerous to A. And A does have an argument that their proprietary content is not being used under fair use. It might not stand up in court, but their legal department can still make your life a living hell, and if they deem the threat large enough, they probably have enough resources to bleed you dry without breaking a sweat.

Perhaps the potential upside of addressing this market is worth the legal risk. I am not a lawyer. But as soon as you get reasonably big, you'll paint a target on your back.

This is the sort of "startup" that I've seen commonly done by self-proclaimed "serial entrepreneurs".

Hire a developer for next-to-nothing / hour in the Philipines, India or China. Get them to build a quick-and-dirty scraping tool that's focused on a specific industry. Then try to flog it to slightly shady businesses. Try to stay under the radar for as long as you can and make as much money as you can while you're there. Sooner or later, you'll get busted and shut down - no big loss to you.

The people I've seen do that typically have a dozen or so of such "startup" going at any one time and they just keep shutting one down to start another.

This is not the sort of startup that will get you the fame and respect of the tech startup world. But it can certainly make you money if you have the "right" mindset. Just don't bet the farm on it.

Here's my opinions on the matter.

1) Build a MVP prototype with the scraped data. Don't worry about the business model. Yet VERY IMPORTANT make sure you are allowed to scrape the data in the first place. Work out an agreement that you are interested in the data but don't give away your methods.

2) Pitch the idea FIRST AND ONLY to the data owners. Suggest to them the usefulness of their data. They may want to invest in YOU to build it out. If the data owners are hard to approach then reach out to mentors that have networks connections.

3) Fall back and last resort is to build up your own data. This will be tough and tricky. You might have to build your own search engine (or similar type data feeding app). You at least own the data.

As conculsion, content ownership is king in the online media world. Make sure you follow the appropriate channels. Talk to the data owners about interest in their data. Get aggreements in place for access without giving away proprietary methods.

Great advice! Would you build the MVP first before pitching to the data owners? My prototype only indexed (one time scrape) 10 sites and still relies on a fair bit of imagination from the business owner. I'm thinking an MVP would have to index at least 100 or so sites before being at all useful.

It depends of your definition of MVP. I believe MVP is just enough to show you have a working concept that could have the potential for revenue. Since I'm a data guy I'm always going to say more data is better.

If you honor robots.txt or provide a straightforward way for sites to opt-out of your search engine, you're in better shape than you would be otherwise.

Google honors robots.txt but few site owners enable it because of the cost of delisting. By contrast, the cost of delisting from your specialized search engine is low, so you might see some of your content dry up.

In the U.S., at least, you do not have the legal right to connect to a site if the owner as requested that you stop -- see eBay v. Bidder's Edge. Fair use has nothing to do with that point (fair use deals with what use you can make of the information once you obtain it, not with any right to obtain it in the first place).

Talking to a lawyer is always good advice.

I would be very hesitant to invest or subscribe to a product that solely relies on data scraping. You're asking for trouble if you don't obtain permission first to include another company's data within your product.

The one legal case that always comes to mind in terms of data scraping is Craigslist Inc. v. 3Taps Inc. http://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.

There are entire businesses that have full 150 members team that do "big data" work that is essentially just scraping a tonne of data off of many places on the web.

If the company's data is an aggregate from many different sources and how can the original sources' claim be established?

Plagarism/copyright infringement can be proven through the use of 'trap streets' - wait for the site to scrape the fake data off your site, which could only have come from you, since you made it up.


>Plagarism/copyright infringement can be proven through the use of 'trap streets' //

Copyright doesn't protect data it protects presentation. On a map the part of the map with the trap street has usually been traced, the presentation is copied. If you use the same map to compile a street listing then you've not copied you've used the information embedded in that presentation.

If I create a webpage with all the event information held on a particular pin-board then that is not copying, if I add a thumbnail or other image of each poster that is normally copying. The information is free (Database Rights like EC Directive 96/9/EC not withstanding).

Plagiarism is not generally illegal except as it imposes on personal contracts/agreements and on other IPR (eg copyright). For example I can recite an out-of-copyright work verbatim -- that is a work in the public domain -- on my website with no attribution (or even a fake one) and there is generally no tort or crime committed regardless of how morally wrong most people would find that.

This is not legal advice.

> Copyright doesn't protect data it protects presentation. On a map the part of the map with the trap street has usually been traced, the presentation is copied. If you use the same map to compile a street listing then you've not copied you've used the information embedded in that presentation.

It also protects collections of information, and its quite possible for repackaging the same collection of information with a different presentation to be found to be a derivative work. ISTR cases related to copyrighted medical code sets and the like where this was the case.

>its quite possible for repackaging the same collection of information with a different presentation to be found to be a derivative work //

If you do it by copying a creative work.

The collection must be deemed to be a creative work. The information held in a medical code would be unlikely to be merely factual.

WRT USC see http://en.wikipedia.org/wiki/Database_and_Collections_of_Inf... for example.

A lawyer would probably be a better person to answer this question, but if I had to venture a guess:

1. Identifying products that the source company is the sole distribute of

2. Logs that are able to identify the crawler and what it accessed.

As far as I know, CL "lost" on that one - as part of the proceedings they claimed that the ads posted by users were CL's property, people freaked out about it, and the court set a precedent that the data are the property of those who posted it (ie, CL doesn't own the content of the postings). So 3taps (and Padmapper, where the CL legal initially attacked) keep ticking along.

That case is still alive and kicking, actually. Craigslist just filed a motion to amend its complaint (again).

Full docket: https://www.pacerpro.com/cases/158665

Motion: https://s3.amazonaws.com/pacer-documents/N.D.%20Cal.%2012-cv...

Thank you - good to know of the new developments.

> A friend who is a sales associate ..

I don't think the target user will care where the data comes from if it saves/makes them money.

> I don't think the target user will care where the data comes from if it saves/makes them money.

I may also be scraping the targeted user or targeted user boss' website. That's part of the conundrum.

If you are potentially giving them free marketing and possibly driving them more/better sales then it shouldn't be an issue, right? If they think their sales will be cut because of your tool, they will raise flags. if they think its a great resource, they'll share it.

trying to relate, but I worked with a transportation brokerage firm. Many fleet 'operators' did retail and wholesale sales (ie to brokers like my co & other fleet operators). It was an interesting dynamic, but most of the sales team was only worried about sales and quality. For the fleet operators, retail clients were more profitable, but wholesale made up the volume.

One issue could be if you are showing a 'price estimate' where they would not like one shown, but ask your sources what their issues would be. Your user base is the asset and your ally, make sure they are comfortable with what they're getting in return.

Few tools for you -

https://import.io/ - totally free and scraps data very quickly

http://espion.io/ - automated headless browser for scraping data

https://www.kimonolabs.com/ - turns websites into data APIs

Note - You can't embed images into your site and expect them to be loaded from another site. Site owners can block this type of behavior to avoid overuse of bandwidth.

I say go for it. I'm building that exact kind of application right now in my spare time targeted towards firearms and ammunition (should be launching in the next couple months). I've contacted a couple sites and one of them even gave me a dedicated JSON feed that I could use instead of scraping, although I opted not to use it over data integrity concerns.

I'm being very careful to write polite crawlers, but if a site really doesn't want me to crawl their site, I would of course de-list them.

Your site model might be a bit different since you say you're targeting the retailers as users, but I don't anticipate much trouble from my approach as I'm targeting the consumers and simply driving them towards the retailers' sites. If anything it's like free advertising for the retailers' products.

edit: also if you're really targeting retailers, Semantics3[1] might already be doing what you're planning to do (depending on the industry)

[1]: https://www.semantics3.com/

There are thousands of startups that scrape data and are quite successful. A certain job listing site comes to mind. Don't worry too much about getting sued.

This isn't legal advice, just practical advice.

1) Find a way to market this as win-win for you and the scraped sites. If you're perceived a net benefit to all involved, you will probably succeed. If you're not then you won't (for any of a number of reasons, including legal conflicts). 2) It is immeasurably easier to get forgiveness than permission, so I would not even try for the latter. That said you should honor any predeclared restrictions like robots.txt or clear terms of service. 3) Test out traction and interest as quickly and cheaply as possible. Launch as soon as you've got something usable (don't sweat whether it is "useful", as that is not really your decision to make).

I once posted a scraping gig on getafreelancer and got a terrifying private message from a detective in Kentucky, which in turn got my account banned.

Turns out the site owners brother was a Supreme Court Judge in Kentucky.

Legal or Not...Be prepared to piss off some people, and some of those people might even have political klout.

I guess you have to break a few eggs to make an omelet. Good Luck.

You might want to look at the YC-backed company Semantics3, which has indexed 60 million unique products and over 4 billion URLs... all their data is available as APIs with pricing proportional to the number of API calls: https://www.semantics3.com/

I have two android apps which depend on scraped data. I took permission in one case (good people at basecamp did not mind!) and did not require any permission for another because I was showing data only to the intended user (just in a handy way). My learning...Never be totally dependent on someone else's website/product. The second website went down around 15-20 days back because of some country wide server upgradation activity and my app installs/rating are going down since then. All those people who were giving 5 stars and praising the app are now abusing it with one star!

Are you certain something like this doesn't exist already? If there are 500-1000 sites I gotta imagine someone has already built this. Shopping feed / aggregators are nothing new.

e.g. http://searchenginewatch.com/sew/study/2097413/shopping-engi...

Consider a business model like 'Magic' where people pay you to search, and your employees leverage your internal system (built on scraping) to deliver excellent results.

Thanks, I am also considering that type of model as well.

As ever, I'm not a lawyer - you should talk to one. However:

I suspect that if you ever become big enough to start getting legal threats from those sites, you'll already be in a pretty good place. I wouldn't worry about legal stuff yet, the main problem is actually building the thing. That said, make sure you set up a limited liability company and, as far as I know, you should be safe.

As you're scraping 500-1000 different websites, if one or two complain, you can just remove them from the website. They probably won't want to anyway if their competitors are on there too.

You should probably make sure you have a link on the website to a complaints/takedown page too.

I agree, I'd "go for it" and worry when it becomes a real problem. If you grow fast enough, then everyone will want to be on there (maybe you could even sell access to the #1 spot) much like Google. If you don't get any traction, then no one will notice or care.

I'd chalk this up as a "good problem to have"... then again, Grooveshark probably thought the same thing: https://news.ycombinator.com/item?id=9468476

I spent the last two years building, deploying, and maintaining a pretty large custom search engine based on scraping. I agree with most/all of the business comments made in the thread. From a technical perspective the main thing to keep in mind is that scraping is a dirty process, more so when you're scraping from smaller firms that often have out of date and quite horrible sites. It's not something you can build and just run. Sites will break, fail to respond, change their markup, etc. You system has to be very tolerant, or you'll be in babysitting mode 24 x 7.

> He said if I could build a search engine that searches the top 500-1000 sites for this industry, it could be 'really' valuable

Let's say that you can get past any legal issues with scraping... Don't dive into a startup based solely on this anecdote. Figure out what you can do to size the market. He says it's valuable. How valuable? Do customers understand why they need this? Are the spending any money on something similar today? How would you target them and sell to them?

I'd treat these questions as equally important as the legality when it comes to "should I start?"

I've built several businesses that either relied in-part to scrapping/indexing websites or solely relied on scrapping/index websites. I We never achieved the success of Google but we did get large enough to be noticed by some sites (Amazon for example). We did face legal issues but of a different kind. There were a few bugs early on that made us hit websites too much and we did receive a couple of cease and desist letters. We fixed our problem and explained the situation to the site owner and everything was resolved.

The only "fair use" type issue that we encountered was using logos from websites. E.g. Displaying the logos of the websites we indexes on our site. Once again, nothing serious came of it. I believe our marketing department removed the logos and put text instead.

Personally, I wouldn't worry about these issues until it becomes a problem. When it becomes a problem it means you're on to something and you're disruptive enough to get some attention. It's a good problem to have IMO.

I built a product around the Twitter firehose, which was a publicly published, accessible, terms-of-service'd data stream... then they killed our (and many other's) products when they closed up access. Keep that in mind, and add on the risk that you're probably not even allowed access to scraped data, and I would suggest the answer is "No."

How about you... talk to a lawyer?

That's good advice, but you also have to know when it's good to ignore legal advice (it depends on how risk averse you are I guess :) ).

Many startups flaunt current laws and are very succesful (see AirBnB or Uber). I think PG wrote something on this (mostly on the "hackers beat the system" sense).

I don't know about the legal aspects, but if the sites you are scraping do not want to be scraped, it can turn into an arms race. They figure out a way to block your scrapers, you figure out a way around it, they block you again and so on. Even if it is legal, there are plenty of other things I'd rather spend my time building.

What this boils down to is incentives. Most of the issues with copyright that bring on legal action come from sites that aggregate data which they did not create, and then market in some way to make advertising revenue off of it. Sites that aggregate and repost news stories, for example, fall under this category because they end up taking advertising revenue from the sites which they draw their content from. Content creators in this area will then aggressively go after these sites because they hurt the bottom line.

On the other hand, your concept sounds like it would draw business to this industry, so the incentives may very well align with the very companies whose data you are scraping. I worked on a concept for a startup where we had similar, but in our research we never had any issues come up because our aims were aligned with the providers of the data that we were scraping.

Can you just set up a custom search engine with the URL's of the sites for the industry? https://cse.google.com/cse/

Your effort would be just to paste in those URL's, no need to develop / maintain any site of your own.

In a sense that's what google does.

Some years ago I did a project that involved scraping and we got some letters from layers and got blocked from some websites. Make sure to know what the legal situation is, otherwise lawyers' letters can be scary.

It sounds like you have a good handle on the legalities, but on a more practical level if the sites you are scraping don't want to be scraped, it would be pretty easy for them to block you, obfuscate/change the page structure at any time to make your scraping impractical, etc. Of course, you will be able to play along too by obfuscating your source address and improving your scraping, but it could turn into a time consuming game of walls and ladders.

Of course your startup idea may still be worthwhile, but in the longer term you'll be at the mercy of the content owners (who might even be fine with it, or want to acquire you).

If you're small, no one will notice / be worried that you're scraping their data, and it won't be worth their while to sue you.

If you do make it big, hopefully you will have enough profits to play the lawyer game.

Ahhh, you aren't scraping, you're building an aggregator! Google does have a bunch of lawyers but the hardest part is building and selling something. Solve any perceived illegality later.

I think getting sued is really dependent on the size of the company you're scraping vs. what kind of 'business' you're cutting out from under them. There's a number of sites like https://gripsweat.com (mine) that are important to collectors/niche users but are basically to small to bother with otherwise.

Where are you getting the data from and how? API or via scraping?

Apart from the legal aspects, a problem I see is the day you'd have X subscribers paying to get your (aggregated) content, what if some sources (playing cat/mouse with you, or not) refactor their web sites (basically F*#king your data pipes). Then you'll have to turn around quite fast because you'll have tens and tens customers yelling at you.

Mind adding your email address to your profile, or a throw-away email? Or just shoot me an email (email is in my profile).

I've been down this path before and would love to chat and am happy to provide any help or guidance that I can. I have no "golden" answers FWIW, but I do know the positives and negatives, and have even been to Federal court WRT scraping. :-)

IANAL, but I considered something of the sort (a site aggregating real estate listings), and was immediately warned of the legal implications.


There are already existing providers for this kind of service. Ex:http://kapowsoftware.com/

I am not sure, what is different in your approach?

Start with scraped data, then when they start blocking you (legally or otherwise) create a predictive market.

Then, you're a platform for other people's scrapers and you'll provide perfect(ish) data.

As someone who's previous startup did something not unlike what you are going after.

You don't need to worry if you respect the robots.txt and such.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact