- Product pricing data: Many companies collect pricing data from e-commerce sites. Latency and temporal trends are important here. Believe it or not, there are still profitable companies out there that hire people to manually scrape websites and input data into a database.
- Various analyses based on job listing data: Similar to what you do by looking at which websites contain certain widgets, you can start understanding job listing (using NLP) to find out which technologies are used by which companies. Several startups doing this. Great data for bizdev and sales. You can also use job data to understand technology hiring trends, understand the long-term strategies of competitor's, or us them as a signal for the health of a company.
- News data + NLP: Crawling news data and understanding facts mentioned in news (using Natural Language Processing) in real-time is used in many industries. Finance, M&A, etc.
- People data: Crawl public LinkedIn and Twitter profiles to understand when people are switching jobs/careers, etc.
- Real-estate data: Understand pricing trends and merge information from similar listings found on various real estate listing websites.
- Merging signals and information from different sources: For example, crawl company websites, Crunchbase, news articles related to the company, LinkedIn profile's of employees and combine all the information found in various source to arrive at meaningful structured representation. Not limited to companies, you can probably think of other use cases.
In general, I think there is a lot of untapped potential and useful data in combining the capabilities of large-scale web scraping, Natural Language Processing, and information fusion / entity resolution.
Getting changing data with low latency (and exposing it as a stream) is still very difficult, and there are lots of interesting use cases as well.
Hope this helps. Also, feel free to send me an email (in my profile) if you want to have a chat or exchange more ideas. Seems like we're working on similar things.
As for web-scale crawling of particular verticals such as products and news, you might want to try: http://www.diffbot.com/products/automatic/
We're planning on releasing support for jobs, companies, and people later.
(disclosure: I work there)
If interested, take a look at this (and related) papers: http://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf
Are you distinguishing between "I found that there are no" and "I didn't find any so far"?
Which ones that came close have you rejected, and why?
I don't remember the names of all projects that I've looked at, but the main ones were Nutch, Hetrix, scrapy and crawler4j. I've come across several companies/startups that have built their crawlers in-house for the same reasons (e.g. http://blog.semantics3.com/how-we-built-our-almost-distribut...).