Hacker News new | past | comments | ask | show | jobs | submit login

There is a closet industry for scraping any sort of data that can move markets. Fed, crop, weather, employment,etc.

Anything that is released at a certain time on a fixed calendar, you can bet that multiple parties are trying to scrape it as fast as possible.

If you can scrape this data( the easy part), put it in a structured format( somewhat hard) and deliver it in under a few seconds(this is where you get paid) then you can almost name your price.

It's an interesting niche that hasn't been computerized yet.

If you can't get the speed then the first 2 steps can still be useful to the large number of funds that are springing up using "deep learning" techniques to build a portfolio over timelines of weeks to months.

To answer the question of: > Wouldn't this require a huge network of various proxy IPs to constantly fetch new data from the site without being flagged and blacklisted?

This is why I gave the caveat of only looking at data that comes out at certain times. That way you only have to hit the server once, when the data comes out, or atleast a few hundred times in the seconds leading up to the data's release:)




That is a fairly surprising opportunity. I have experience monitoring/scraping thousands government websites for a different purpose. Considering some government sites have a round trip of well over 5 seconds, seems like it'd be a fun challenge to parse, format, and deliver it that fast.

What types of data formatting are you talking about here? Would it require a unique template for each individual site?


> deliver it in under a few seconds(this is where you get paid) then you can almost name your price.

Wouldn't this require a huge network of various proxy IPs to constantly fetch new data from the site without being flagged and blacklisted?

Or are you referencing from the time you scrape data to deliver it in under 3 seconds?


My understanding is that you need to deliver the data with a latency measured in the range of milliseconds, and even then that might not be fast enough due to direct access. Here are a couple articles in the WSJ --

"Speed Traders Get an Edge" - Feb 6, 2014 - http://online.wsj.com/news/articles/SB1000142405270230445090...

"Firm Stops Giving High-Speed Traders Direct Access to Releases" - Feb 20, 2014 - http://online.wsj.com/news/articles/SB1000142405270230377550...


A bit off topic but, if I were to scrape such data except without intention of selling it, instead using it myself... How fast are stock markets? Surely, I would know that the price of stock will increase in the next few days too and buying that stock after 1 hour, say, news hit a major news site, that would still profit me? If not, why? I mean surely you can find someone selling that said stock at all times, no ?

edit: replaced mysql with myself


>It's an interesting niche that hasn't been computerized yet.

That's quite an assertion. I'm certain it has been.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: