
Ask HN: How Much Can I Scrape? - brentr
I am working on a financial software project. I have written code in Python to get all of the historical price data for each stock in the S&#38;P 500. I have tested the code using an input file of five ticker symbols and the code runs perfectly. I would like to get data for all 500 stocks in the S&#38;P 500, however, I don't know if collecting this much data would go well with Yahoo. I have implemented my program so that it only sends out one request per minute, but I am still worried about turning my system loose.<p>Has anyone else done something similar? For the people who own their own sites, how do you view scraping? Should I contact someone at Yahoo first?
======
m0nty
Disguise your scrapes as a browser, so include an Explorer or Firefox browser
ID string. Randomize the times between scrapes, so it looks more like a human
being doing it. Make sure the scraper takes "coffee breaks" every now and
then. Run the service from several servers at once, if you have them. I would
guess your program is fairly low overhead (mine always have been) so contact
friends and ask to use their server or home PCs. Extra credit for designing a
cloud-like infrastructure where PCs could come-and-go without missing any data
:)

~~~
lacker
Might be easier just to use Tor than to use multiple servers. I think the only
thing that will matter is # of requests per IP address per time period. It
won't be a human looking at the logs, it'll be an automated process, so it's
unlikely that randomizing the time between scrapes will matter.

------
spc476
How far back? EOData (<http://eoddata.com/>) has 15 years of pricing
information for $20 for a number of exchanges, which would certainly save time
and isn't horribly expensive for what you get.

~~~
oakmac
From my experience the only reliable source of historical stock data is the
exchange source. Nasdaq.com records go back to January 2nd, 1990.

------
enomar
Hate to be obvious, but reading their API terms of service might be a good
place to start...

------
lacker
If you contact someone at Yahoo, the response will be, do not scrape us in any
way.

The problem with financial data is that Yahoo (like most other sites where you
might find this data) doesn't generate this data themselves. They license it
from other companies, and the licensing agreement typically prohibits or
greatly restricts Yahoo's ability to provide the data to third parties.

That said, if Yahoo is not aware that you are scraping them, they cannot stop
you. They certainly do have anti-scraper algorithms (you will start getting
http 999 errors) but they will not kick in until you cross some invisible
threshold. You can probably use Tor with no problem.

Although, if you get large enough that someone notices, you will probably get
some sort of cease & desist letter. Depending on your goals that might not be
a problem for you.

------
oakmac
I did this exact same thing a few years ago only stripping the data from
nasdaq.com using Perl. I used to hit their site once every 2 seconds times
roughly 2500 stocks every day for about 6 months. I would only grab as much
data as I needed. They never contacted me or blocked my IP address. I also had
a friend who was doing the same thing for a longer period of time.

From experience, I would not recommend getting your data from Yahoo. I looked
at them first, but their data is just not as good as the source.

If you would like more information or my notes on how I reverse-engineered the
nasdaq.com URL scheme please send me an email.

~~~
oakmac
I updated my profile with my email address.

------
kaens
If I had a site that was scrape-worthy, I wouldn't care about it if the people
were respectful about it (wait a second or to in between requests, don't
hammer my server).

From the business side, I could see them getting a bit grumpy about it, but if
it's publicly available information, and there's nothing in their TOS about
it, I don't see how they could do anything about it - again, unless you're
being a dick with your scraper.

Does anyone know off the top of their head if there are any relevant court
cases dealing with scraping?

------
mikkom
I've downloaded all the data they have for s&p 500 many times (I did it with
processes, spawned one donwload about every 0.1 seconds). They block your
connections if you download too fast.

If they give out csv exports as they do there is no reason why someone
wouldn't download them and use them for personal use.

I guess you already know about the CSV download but if you don't, here is a
link about it: <http://www.diytraders.com/content/view/25/43/>

I would however never ever use them in commercial product if that's what you
are asking.

------
xefyr
If you're really concerned about it you can go through an anonymizing proxy
service. But, as has been said, if you have the time, spacing out your
requests should work fine too.

------
qhoxie
I got blocked during RailsRumble for pulling too much from Y! Finance. We did
not have the time to throttle it.

You should at least try to email them and see if your restrictions can be
loosened.

------
dpmorel
We scraped Yahoo Mail for about 6 months quite heavily. We had to keep it at a
>5 minute timer otherwise we got captchas during the auth process, or we got
locked out for 24 hours with error 999.

We now have a formal agreement with Yahoo, but during the process Yahoo
indicated they had an informal open policy on scraping. Note that they have an
initiative to open up all services within the next year or so (google Yahoo
Open Strategy to read more about timelines).

------
redorb
1 request per a minute I don't think yahoo would even notice.

~~~
aneesh
True, but they might notice 1440 requests in the same day from the same IP.

~~~
flashgordon
actually the way i worked around this with both yahoo and google was to
roundrobin my requests over a list of 1500 publicly available proxy servers :D

------
bgtony
biddersedge v. ebay is particularly interesting, as is verticalone.com, now
yodlee.com.

Trespass to chattels is an old roman law which dictates what should be done if
you tresspass on my land and hurt one of my cattle, and it is used as the core
of most cases involving scraping in unauthenticated environments (like Y!
finance). They can come get you, not for taking data, but for costing them
money to support the response volumes you demand. The magic number is $5,000,
at which time it becomes a felony (or at least that was the threatening
rhetoric, which is a different story altogether). You scrape, hurt their cow
for 5k, and it is not a question of restitution, but of punishment. And in
each case the scraper is typically viewed as a "thief"... not a label that
inspires lighter punishments. see:www.biddersedge.com, yeah, exactly. nuked
from orbit... and there, in a nutshell, is the risk inherent in scraping. All
a scrapee has to do is wait for you to pass $5k... while they consider the pr
ramifications of the whole thing... how much bandwidth and resources need to
be used before the public will sympathize? 10k? 20k? 30k before they are
lauded as a hero for removing the thieving vermin?

Insidious really, scraping and scrapers are being "set up the bomb" here... to
not be viewed as enabling the liberation of data, but rather as thieves of the
resources necessary to deliver that data to the general public. Using trespass
to chattels as a precedent is therefore a brilliant stroke... apparently, they
can be taught. Or, to put it another way, scrapers aren't napster users in
dorm rooms, they are felony thieves of public resources.

Yeah, we all know that cease and desist and all other legal remedies are
jurisdictionally challenged - the net doesn’t stop at international borders.
And, historically, it seems that other countries turn a deaf ear to most cyber
crime excepting, of course, for credit card fraud.

Also, limit scraping via tor. Tor has a legitimate use which scraper volumes
would impact. Of course, there are tor nets set up for "illegitimate" use...
and they let anybody in, including folks like me, who then map all tor exit
nodes used by scrapers and interdict em all...

And, don't forget steganography... you take data (even through tor or rotating
proxies) and redisplay it, google can find it and I can ask google to tell me
where it is. Scrapers, even as very clever data middlemen, will get the
squeeze from both sides as scrapees discover where their data is being
displayed and utilize legal means to go after those storefronts, who will of
course first provide name, rank and serial number of the scraper that provided
them the data...

And what about copyrights? Lots of legal precedent here, be careful with image
redisplay. Mine field here...

------
toddcw
This might help: [http://blog.screen-scraper.com/2007/03/01/how-to-surf-and-
sc...](http://blog.screen-scraper.com/2007/03/01/how-to-surf-and-screen-
scrape-anonymously/)

------
ca98am79
what kind of data do you want? real-time? or end of day? If you just want end
of day data, it is simple. Just write a script that collects it in the middle
of the night and stores it in your database. I don't think they mind at all if
you just do it once a day for all of the stocks - I know people who have been
doing it for years. They use this:

<http://www.gummy-stuff.org/Yahoo-data.htm>

If you want real-time data, good luck. It will cost you.

------
hotpockets
I don't think you have anything to worry about. I've scraped Yahoo finance
before at about a 1 second request rate, using perl's YahooFinance module.

------
yawl
Do not crawl too fast with yahoo, otherwise you will get 999 error -- which
mean you will be banned temporarily.

Search 'yahoo 999' for details.

