
ScraperWiki - an online tool to make scraping simpler and more collaborative - phreeza
http://scraperwiki.com/
======
10ren
The tutorials as live demos are great. You should say "live demo" somewhere,
because "tutorial" makes it sound like text.

The front page is a bit split-personality: you have the idea of data mining;
and the idea of sharing data sets. I think you should pick one to be the
dominant focus, with other one present as an aside. It might make sense to
focus on producers at the beginning ("scrape data and share it"), and once
you've built up datasets, switch the emphasis to consumers ("wikipedia of
datasets"). Maybe it's worthwhile to check how similar producer-consumer
websites did it at the beginning: wikipedia, youtube, flickr.

BTW: What do you think of an interface like excel has for web-scraping: it
displays the webpage, and you select the bit you want visually (no coding, not
even html)?

------
keefe
I hesitate to say I'm working on a startup, but I've been working on a piece
of software for a few years now. One of the key components is a scraper, so I
have pretty serious interest in this topic.

It looks like they have thought things through pretty well, but I looked
around and didn't find interesting data or useful code.

screen scraping is a bot that pulls the visual data from the screen and
analyzes it. A bunch of tutorials talk about web scraping and call it screen
scraping.

# Built in source control

really???

~~~
extension
_screen scraping is a bot that pulls the visual data from the screen and
analyzes it. A bunch of tutorials talk about web scraping and call it screen
scraping_

A more useful definition would be "extracting structured data from human
interfaces"

~~~
keefe
except it seems to imply pulling data from interfaces other than the screen,
so screen scraping is a subset of ui scraping. I still see this as distinct
from scraping data from web pages, because that's basically an entity
extraction problem which lies underneath the ui scraping problem.

------
slig
See also Dapper: <http://open.dapper.net/> It's visual and very easy if don't
need anything fancy.

~~~
tsycho
Wow....just checked this out, created a simple "dapp"....it was pretty
awesome.....I had never heard of Dapper before.

Thanks for the link :)

------
DanielBMarkham
This is a great idea. I haven't started on the site yet, so forgive the
perhaps-dumb question: is there a standard meta-language for scraping sites? I
don't think x-path works with funky-html. So does anybody have something that
would universally describe how to, say, get a users pictures from flickr? Or
latest comments from Digg?

Something like that -- where the community could then develop cross-platform
tools to implement it -- would truly be ground-breaking.

(I don't think this is related to the nature of the html/regex debate, but I'd
be happy to be educated)

Too many providers are taking data that the user knows intuitively that they
own and sticking it behind a wall. Anything that helps get that data back out
is awesome in my book.

~~~
amccloud
Parsley is great for this. <http://github.com/fizx/parsley> For Python:
<http://github.com/fizx/pyparsley>

There use to be a website for sharing "parselet" scripts.

~~~
DanielBMarkham
I find it amazing that somebody came up with a universal language for data
extraction and nothing is being done with it.

~~~
fizx
Hi, I wrote Parsley.

We (tectonic and I) also wrote an in-browser IDE, and a Javascript-driven web
crawler that runs on headless Firefoxes on EC2. We wrote something similar to
scraperwiki, that integrated a simpler version of the IDE.

We got a couple consulting deals, building smart web crawlers for clients, and
about 15 passionate open-source users. I think the primary problem is that
unless you're scraping many sites, it's easier to write 50 lines in
Ruby/Python/$language_you_already_know, than to learn a new langauge and cut
it to 5 lines.

If there's interest in digging this project up, please contact at
kyle@kylemaxwell.com.

~~~
DanielBMarkham
I'd like something in .NET that I could throw a parselet/blob/widget/thingy at
and it would return a list of important stuff from a website that I was
authenticated on. Others could do the same from other platforms using the same
setup. And then the parslets would be independent of the platform. Ideally it
would be integrated into the browser or the O/S. This means when I sign up for
your service X, I also download my widget for it. I can then access the data
anytime I like without having to play inside your walled garden.

Not sure I'd want to pursue it right now, but shit, it'd be a game changer on
all kinds of levels. Especially if it allowed two-way communication.

EDIT: In fact, I know just how I'd deploy it -- as a shell mod in windows. You
have a drive Y which is really a NortonBackup, no reason why you can have
drives and folders that represent any sort of online service you have that
stores your stuff -- FB updates, Twitter, etc. Let the O/S worry about
synchronization and all of that. Why read 27 ads for FB games when you can
just go to X:\Facebook\FriendsStatus and read all of the unadorned status
updates, which is why you visit anyway?

~~~
fizx
This would be really easy on unix with fuse. It's been a while since I've don
windows development.

~~~
DanielBMarkham
Ain't nothing but a thing. Assuming there's a C library that would fetch the
page, apply the widget, and return the list, it's probably 3-4 days of work.

Using my Markham's Estimating Tool (double the number and go to next higher
units) that comes out at 6-8 weeks.

Windows shell extensions are all COM nonsense. You have to know the magic
numbers and know where to put them. The rest is pretty straightforward code.

------
extension
This is a nifty idea. I wonder how an application developer would use it.
Would you pull data directly from this site, using the most recent scraper
code? That would require fixed schemas for the scrapers, or at least backwards
compatible schemas.

This might make your app robust against changing data formats, but would also
leave it vulnerable to vandalism, which could potentially be far more damaging
for this site than for something like Wikipedia.

But, unlike wiki pages, scrapers don't need to improve incrementally; they
either work or they don't. So a community verification step before a new
scraper goes into "production" might be a feasible way to deal with vandalism.

------
tszming
Typo in the description of this page: <http://scraperwiki.com/editor/new/PHP>
(i.e. search Python).

Anyway, the IDE approach looks very very nice. I have done similar thing
before, but using Selenium/jQuery: <http://github.com/tszming/Selenium-Google-
Scrapper>. I still believe my jQuery approach is more flexible for screen
scrapping :)

------
dorifornova
How to compare and choose web scraping tools I recommend reading these series
of posts is dedicated to executives taking charge of projects that entail
scraping information from one or more websites.
<http://www.fornova.net/blog/?p=18>

------
shortformblog
Ooh. As a journalist who dabbles in data, this could be quite awesome to keep
around. I'll keep this in mind.

~~~
dustrider
They run Hack and Hackers days for journo's, recently in Liverpool and
Birmingham. I believe they've got videos up of them too.

We're actually talking to them about hosting one in South London in the next
month or two.

Not sure what their plans in the US are, but it could be worth dropping the
idea to some fellow journo's and seeing if there's an oganisation willing to
act as host.

EDIT: forgot to mention, if you're interested in data you might also want to
take a look at the open data initiative (data.gov.uk) and the guardian's data
store <http://www.guardian.co.uk/data-store> apologies for the UK focus.

~~~
natrius
<http://hackshackers.com/>

------
pronoiac
I'm intrigued by this. I once crawled a wiki that later went away, & the
database was lost too. I've been looking for other people recreating a wiki
from HTML (including the edit pages, that is) for years since then, without
success.

~~~
Vivtek
I'd be happy to, if I understand you correctly that you want somebody to do
the parsing into a database-ready format, or a database you've already got
hosted somewhere. Email's in my profile.

------
crazydiamond
Ruby has a library named scrubyt

<http://scrubyt.org/>

~~~
astrofinch
Python has

<http://www.crummy.com/software/BeautifulSoup/>

~~~
evilduck
I was under the impression BeautifulSoup is no longer actively maintained.

(Edit: Which makes me sad, because I loved their documentation writing style)

~~~
krakensden
It's not :(

------
krakensden
If you want/need a maximum tolerance of broken HTML, you can use Selenium to
scrape, and as long as the markup doesn't break firefox, it won't break your
script. (I've had some bad experiences with BeautifulSoup)

------
cmurphycode
Great idea. I'm convinced that we can find a way to tap into all of the
wonderful data on the Internet. Right now, it's just too hard, which is one of
the delights of reading a well-researched article.

------
Sukotto
Please add signup via OpenID

------
cmelbye
Cool idea, I wish it had Ruby with the hpricot library though.

