I have been web scraping for several months and am starting to teach it at Meetups. I'm lucky enough to work for a company that has a few pre-crawled copies of the web that I can query against and a distributed processing platform to speed up any scraping I do.
I'm running out of ideas of what to build though. I build scrapers to produce content for the company based on the data and insights I find. They are usually marketing verticals though, such as finding all websites using feedback tools (I search based on their javascript widgets), and do analysis on that info.
So if you had these resources, what would you be looking for? I love building tools that help people so any feedback/ideas would be great!
I'm also open to hearing what you would scrape for on the live web. I find that if I'm doing broad analysis then the pre-crawled copies are best, and for specific sites/information I use the live web.
Anything that is released at a certain time on a fixed calendar, you can bet that multiple parties are trying to scrape it as fast as possible.
If you can scrape this data( the easy part), put it in a structured format( somewhat hard) and deliver it in under a few seconds(this is where you get paid) then you can almost name your price.
It's an interesting niche that hasn't been computerized yet.
If you can't get the speed then the first 2 steps can still be useful to the large number of funds that are springing up using "deep learning" techniques to build a portfolio over timelines of weeks to months.
To answer the question of: > Wouldn't this require a huge network of various proxy IPs to constantly fetch new data from the site without being flagged and blacklisted?
This is why I gave the caveat of only looking at data that comes out at certain times. That way you only have to hit the server once, when the data comes out, or atleast a few hundred times in the seconds leading up to the data's release:)