1. Reformatting and content archival (lag times of hours to days are no prob).
As an example, I put together a site to archive comments of a ridiculously prolific commenter on a site I follow. I needed the content of his comments, as well as the tree structure to shake out all the irrelevant comments leaving only the necessary context. Real time isn't an issue. Up until recently it ran on a weekly cron job. Now it's daily.
2. Aggregating and structuring data from disparate sources (real time can make you money).
I work in real estate. Leasing websites are shitty and the information companies are expensive and also kinda shitty. Where possible we scrape the websites for building availability but a lot of time that data is buried in PDFs. For a lot of business domains, being able to scrape data in a structured way from PDFs would be killer if you could do it! I guarantee the industries chollida1 mentioned want the hell out of this too. We enter the PDFs manually. :(
Updates go in monthly cycles, timeliness isn't a huge issue. Lag times of ~3-5 business days are just fine especially for the things that need to be manually entered.
This is exactly the sort of scraping that Pricenomics is doing . They charge $2k/site/month. Hopefully y'all are making that much.
3. Bespoke, one shot versions of #2.
One shot data imports, typically to initially populate a database. I've done a ton of these and I hate them. An example is a farmer's market project I worked on. We got our hands on a shitty national database of farmers markets, I ended up writing a custom parser that worked in ~85% of cases and we manually cleaned up the rest. The thing that sucks about one shot scrape jobs from bad sources is that it almost always means manual cleanup. It's just not worth it to write code that works 100% when it will only be used once.
Make any part of structuring scraped data easier and you guys are awesome!
Import.io is one example, and I think there's another more recent YC-backed one. However, I tried using import.io a little while back but without much joy.
Having private access to Scrape.it, I can say that it focuses strictly on making a great tool and the ability to scrape websites without costing a fortune. I know the founders and they are extremely dedicated to making a tool that can pretty much handle anything you throw at it like AJAX, Single page apps, crawling selected links and all sub links after the page. They've just begun adding login and form support so should be able to play with those as well very soon. It only supports csv output at the moment but hopefully they can make available like API output.
Ask HN: Anybody got a visual scraping service they like?
Why wouldn't it work for PDF's?
If you're able to get the file itself, you should be able to OCR it...
Is there anything obvious that I am missing in regards to PDFs?
Again depending on the application, the mixed quality of OCR isn't always a deal breaker, but it's not always as simple as it might appear.
You may want to give that a try if you haven't looked at it before.
Can you mention some of those domains? I'm interested. I had worked on one such project earlier, for a financial startup.
I saw a service recently that emails app store reviews/ratings for a fee. Not sure if they are scraping or getting the reviews some other way. The same idea can be extended to lots of things like Amazon reviews etc. Not sure of the legal stuff though.