Hacker News new | past | comments | ask | show | jobs | submit login

For the last 7 years I have worked at a company that does specialized job listing web scraping. And on nearly a weekly basis I encounter other programmers who say, "pssssh, I could do that in a weekend."

It does seem disgustingly easy, but once you move from "getting" the data to "understanding" the data it becomes a beastly nightmare. So thank you to the OP for helping raise "give scrapers some credit" awareness.

Except scrapers that don't respect robots.txt or meta noindex - a pox on their houses....




But there will always be hordes of new, excitable hackers who fail to appreciate the complexity of finishing real-world tasks (and I'm not talking only intelligent scraping). The "awareness" is ephemeral.

Of course, by the time they come around, they will have driven down market prices as well as hackers' image on the whole, with their "Easy! 2 days max, here's the expected cost", invariably followed by "Give me an extra week or two, I'm almost there".

Eventually they'll learn, then come and vent on HN. The circle of life.


It isn't just upstart programmers. How often have we all had discussions with a manager or client who wants something like this done quick and doesn't believe it could have many complexities? "There you go, you got the data, now all you need to do is make the program understand it. Easy!"


Damn. I'm just really glad to know I'm not the only one.

One of my company's biggest sellers is a tool that does web scraping. And since we charge a monthly fee for it (I don't think most people realize that our server fees alone run nearly $2,000 a month for this, and that's with amazing deals on web hosting), we get a lot of "Pfft--how hard could that really be to build?"

The answer is: It's easy--if you're only scraping one site, in small quantities. The minute you try to implement in Unicode (we have) or bundle it all in a report, or keep an ongoing cache of what your scraper found, or heck, scale it to millions of scrapes per month, all cached and indexed--now you are looking at something difficult.

We have a talented team, we're 16 months into this, and we've since seen competitors pop up all over the place. They all get a few customers and then run into scaling issues. It's one of those markets that seems seductively easy, but really isn't. And since there are many competitors lowballing, we've had to focus even more development time on awesome features to stay ahead of the curve.

I wish articles like this woke people up. Unfortunately, once you're committed and have a few customers, it's easy to rationalize going deeper. But web scraping can be a tricky industry to turn a profit in, especially if you are relying on a small and fickle customer base.


> And since there are many competitors lowballing, we've had to focus even more development time on awesome features to stay ahead of the curve.

That is always the way, in any industry (those producing real holdable products as well as those of us producing code and other less tangible output). If you make something good and charge fairly for it, someone else will make a lower quality version & sell it cheaper and people will start expecting you to match that price without dropping any quality. The problem with software is that you often can't see the corners that have been cut until much later than with physical products (where you might have a chance of spotting the shoddy workmanship before you hand over any money), so competing on quality can be difficult (the competition can make the same quality claims as you can, whether they be true or not) meaning you end up having to compete on features.


People should realize that search engines are essentially just huge, generic scrapers.

I've written my share of scrapers and they are almost always a major PITA unless you're just doing a drive-by for one or two DOM elements. Scraping often will end up to be extremely time consuming, although there is rarely significant challenge to it. Just a lot of monotonous, mind-numbing, tedious work.

My advice to most freelancers is: avoid scrapers until you get some teammates to share in the agony.


And on nearly a weekly basis I encounter other programmers who say, "pssssh, I could do that in a weekend."

Hehe, I guess some of those also ask: "Why can't you just parse the HTML with a regex?" ;-)

http://stackoverflow.com/questions/1732348/regex-match-open-...


I spent some time about a year ago fussing with getting a web spider/scraper going; it was simple enough to download data, but actually putting it into a database that was domain-, content-, and time-aware was impressively complex and I put it on the backburner.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: