Hacker News new | past | comments | ask | show | jobs | submit login

I've had three primary uses of web scraping. The hard part for me has never been speed. Getting the results structured is somewhere between easy and hideously complicated.

1. Reformatting and content archival (lag times of hours to days are no prob).

As an example, I put together a site to archive comments of a ridiculously prolific commenter on a site I follow. I needed the content of his comments, as well as the tree structure to shake out all the irrelevant comments leaving only the necessary context. Real time isn't an issue. Up until recently it ran on a weekly cron job. Now it's daily.

2. Aggregating and structuring data from disparate sources (real time can make you money).

I work in real estate. Leasing websites are shitty and the information companies are expensive and also kinda shitty. Where possible we scrape the websites for building availability but a lot of time that data is buried in PDFs. For a lot of business domains, being able to scrape data in a structured way from PDFs would be killer if you could do it! I guarantee the industries chollida1 mentioned want the hell out of this too. We enter the PDFs manually. :(

Updates go in monthly cycles, timeliness isn't a huge issue. Lag times of ~3-5 business days are just fine especially for the things that need to be manually entered.

This is exactly the sort of scraping that Pricenomics is doing [1]. They charge $2k/site/month. Hopefully y'all are making that much.

3. Bespoke, one shot versions of #2.

One shot data imports, typically to initially populate a database. I've done a ton of these and I hate them. An example is a farmer's market project I worked on. We got our hands on a shitty national database of farmers markets, I ended up writing a custom parser that worked in ~85% of cases and we manually cleaned up the rest. The thing that sucks about one shot scrape jobs from bad sources is that it almost always means manual cleanup. It's just not worth it to write code that works 100% when it will only be used once.

Make any part of structuring scraped data easier and you guys are awesome!

[1] http://priceonomics.com/data-services/

There are services that cover at least part of what you mentioned. These effectively provide you a tool to visually build a scraper and then they automate the scraping in the background, creating an API or spreadsheet of the data.

Import.io is one example, and I think there's another more recent YC-backed one. However, I tried using import.io a little while back but without much joy.

I think import.io is buggy to say the least, having used it in the past to scrape some websites, it was pain to work with. Kimonolabs is still very lacking in terms of ability to handle different websites, it is very much limited to a certain portion of the web, it looks like they are more about creating APIs...APIs that people are supposed to find interesting and valuable but like the topic of this question, it seems like it's only valuable to someone who has a direct need for that dataset, by itsef would serve no interest to say.

Having private access to Scrape.it, I can say that it focuses strictly on making a great tool and the ability to scrape websites without costing a fortune. I know the founders and they are extremely dedicated to making a tool that can pretty much handle anything you throw at it like AJAX, Single page apps, crawling selected links and all sub links after the page. They've just begun adding login and form support so should be able to play with those as well very soon. It only supports csv output at the moment but hopefully they can make available like API output.

We have a non-tech intern and import.io looks like like a great tool to get him chewing up data. I'm playing with it now. Why didn't it work out for you? Beyond the wrapped browser interface being a little funky lol. (Edit: eugh, selecting data for import is really clunky.)

Ask HN: Anybody got a visual scraping service they like?

It was the data extraction and selection process I couldn't get to work. I was trying to scrape a particular search on autotrader.co.uk (I wanted more up to date results than their daily emails provide, and I wanted to filter out cars that had been written off). I don't remember all the details, but I followed the tutorial video and got to the stage where you select a single item that matches your criteria and it's supposed to extrapolate from there. However I just seemed to be stuck in an infinite loop of it asking me to do this.

I found you often have to select two, then it figures it out. I assumed it was probably because of alternating odd/even row CSS classes.

Thank you for the great feedback. I have a real estate background as well and keep wanting to find a project that would benefit that industry. It feels like a lot of what's out there is stuck in the past. I would love to help fix that. Sounds like I have a new project!

Regarding 2)

Why wouldn't it work for PDF's? If you're able to get the file itself, you should be able to OCR it...

Is there anything obvious that I am missing in regards to PDFs?

I've worked with OCRed PDFs, the main thing that should be obvious is that OCR results range from poor to horrendous. It takes a lot of manual cleanup if a high degree of accuracy is required. Or depending on why you want the text, you can adjust expectations or add layers of software such as fuzzy search algorithms to deal with the issues.

Again depending on the application, the mixed quality of OCR isn't always a deal breaker, but it's not always as simple as it might appear.

It's not the text that's the issue, it's the structure. PDFs have nowhere near as much structure as markup. You end up having to do this for dozens of layouts and it gets hurty really fast:


There are computer vision libraries that automatically extract tables from PDFs. For example, http://ieg.ifs.tuwien.ac.at/projects/pdf2table/.

You may want to give that a try if you haven't looked at it before.

>For a lot of business domains, being able to scrape data in a structured way from PDFs would be killer

Can you mention some of those domains? I'm interested. I had worked on one such project earlier, for a financial startup.

What about legal implications? Do you get permission from the sites you crawl?

Legality of scraping is a subtle issue - I wrote up my take on it here: https://blog.scraperwiki.com/2012/04/is-scraping-legal/


Thank you for answering. Though private and internal, you are still using the data for profit, is that correct? Does it mean that one can't sell the data, but can use it for analysis etc and still profit from it?

I saw a service recently that emails app store reviews/ratings for a fee. Not sure if they are scraping or getting the reviews some other way. The same idea can be extended to lots of things like Amazon reviews etc. Not sure of the legal stuff though.

I deleted it 'cause I'm not comfortable having those details online. An extremely long story short, in our domain we're using the scraped data exactly as the owners intend, albeit via machines instead of people. Consult a lawyer.

How can I contact you?

My username at google's email service.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact