

Delicious's Data Policy is Like Setting a Museum on Fire - cwan
http://www.readwriteweb.com/archives/deliciouss_data_policy_is_like_setting_a_museum_on.php

======
PaulHoule
Back in 2004 I wanted to use Kleinberg's hub-and-authorities algorithm on
Delicious, and I ran a crawler on it anyway, despite the robots.txt file. I
got blocked, and when I complained, I got an email from the founder telling me
to buzz off.

I've long seen the no-crawling policy of Delicious plus the Roach Motel API
that was all about getting people to put their data in but not about letting
people get it out as the dark side of "Web 2.0"; often we hear about an API as
if it were a gift, but it's often a self-serving effort to take our data and
give nothing back in return.

~~~
tibbon
Could you use of EC2, proxies and tor (horrid bandwidth of course) to get
around some of the limiting?

~~~
PaulHoule
I've done that kind of stuff and, against an advanced opponent, you tend to
lose. (Although you can roll the average webmaster)

Remember that IP addresses have a market price of about $3 /month, and that's
what an honest proxy cost Honest proxy providers rent machines in data centers
and have them bind to a wide range of addresses, all in the same netblock. If
you're coming from 20 different addresses in a netblock (paying $60) a month,
you still look suspicious. These guys might have machines in several data
centers, but they can't put you into hundreds of different netblocks.

The economics might get better if you're sharing the proxies with other
people, but those other people are up to the Devil's work, and are busting
their asses 24-7 getting the IP addresses in everybody's block lists.

As for Tor, quite a few organizations block or limit Tor traffic... Databases
of active Tor gateways are available, and sites like Wikipedia use them...
Wikipedia won't let you make anonymous edits from Tor, because they don't like
dealing with griefers who use Tor.

Now, some people will use hacked machines as proxy servers. A botnet can
create a nearly indetectable cloud of IP addresses, but as far as I'm
concerned, use of a botnet is an ethical line I won't cross.

~~~
tibbon
Sound like some fun stuff to profile a site when you're doing a security
screening of them. I'll have to keep this in mind.

------
ams6110
The loss would be a shame, but calling it a "sick tragedy" is a bit of a
stretch. Unmaintained, the data will get stale fairly rapidly, and it won't
take long for another service to step in. There's a vacuum here, and someone
will fill it.

~~~
wslh
It's called history. The web is not only about the new, in a hundred years all
these data will be important for others.

~~~
drivingmenuts
Seriously? Somehow I doubt the presence of or absence of lolcats and 4chan
will make much difference to future generations.

We're already drowning in data. We need to start making some executive
decisions about what's important and what's not. If it turns out we're wrong -
we'll deal with it.

~~~
jerf
Are you a historian?

Find one and ask them how many parts of their body they'd pay to spend even
ten minutes in the town square listening to the mundanities you so causally
consign to the bit bucket.

There's more information there than you think, more than you can even see,
because you are a product of the time that generated it.

~~~
xyzzyb
Very much this. Big events are well covered and documented. It's the ephemera
that I find most fascinating.

<http://rs6.loc.gov/ammem/rbpehtml/pegenre.html>

------
tibbon
Err, _if_ I was writing a scraper today, I'd just ignore the robots.txt

I dunno if Geocities had a similar robots.txt, but it didn't stop several
groups from archiving it (which was the right thing to do in either scenario).

~~~
PaulHoule
Well, I've worked for organizations that had active defenses against
crawlers... Make too many HTTP request and hour and ~poof~ a deny directive
goes in the .htaccess file, or if they really like to play a rough game
they'll firewall you.

I know delicious had active defenses because I ran afoul of them.

~~~
kevindication
And really, slowing your connections down to rates that are measured in baud
would be the most fun.

------
adambyrtek
I'm not a native English speaker, but shouldn't the phrase be "library on
fire" instead of "museum on fire"? The analogy comes from the ancient Library
of Alexandria, which was set on fire by Caesar.

~~~
J3L2404
From Wikipedia:

'...both the pagan historian Ammianus Marcellinus and the Christian historian
Orosius wrote that the Bibliotheca Alexandrina had been destroyed by Caesar's
fire. The anonymous author of the Alexandrian Wars writes that the fires
Caesar's soldiers had set to burn the Egyptian navy in the port of Alexandria
went as far as burning a store full of papyri located near the port. However,
the geographical study of the location of the historical Bibliotheca
Alexandrina in the neighborhood of Bruchion suggests that this store cannot
have been the Great Library. It is most probable here that these historians
confused the two Greek words bibliothekas, which means “set of books”, with
bibliotheka, which means library. As a result, they thought that what had been
recorded earlier concerning the burning of some books stored near the port
constituted the burning of the famous Alexandrian Library.'

------
_corbett
I’m a huge user of delicious, with 2906 bookmarks and 3100 tags. In fact, if I
were to pick one Web 2.0 site, it would be this simple straightforward one. I
use to organize new lines of research (into an intellectual matter or
something as inane as a hotel) and to keep track of anything I find
interesting or useful on the web, particularly those things which took more
than a few minutes of googling to discover. It's a huge supplement to my
memory.

Really sad to see it go.... If yahoo had asked me to pay I'd happily have done
so (I pay for Flickr, Spotify, Last.fm, RTM and many other oft-used services
happily)

------
iuguy
Yes it's a PITA, I subscribe to certain tags via RSS to get interesting stuff
to read. The loss of that is quite big for me. Saying it's like setting a
museum on fire is a bit too far though.

------
akshayubhat
I can only think of using 80legs as a crawler, since its distributed enough to
make sure that you don't run into any IP address based rate limitation. But
it's just a guess.

~~~
slig2
You probably can't do it, because 80 respects robots.txt AFAIK.

------
unexpected
I don't understand why Yahoo doesn't try charging for it. Monetizing it was
tough, definitely, but it's been shown that actually CHARGING customers (as
opposed to going with a straight advertising model can work).

If you're going to shut it down anyway, what's the harm in trying? Maybe have
a "stay of execution" for a quarter - tell users you're going to charge
$10/month for the service, and see how many users sign up. If you can break
even, why not keep it?

~~~
bruceboughton
Wouldn't charging for it give users an expectation that the service would stay
around for longer than a quarter. What if it's still not profitable? Now you
have to close down with paying customers.

~~~
unexpected
You're right - but given the situation, I think you could outline user
expectations and see what happens. Services that charge money close down all
the time.

I was envisioning something like reddit gold. They have more
users/subscribers, and seemed to have a lot of success with their
monetization.

------
rb2k_
> Nope, Yahoo! blocks all automated extraction of data from Delicious.

Uhmmm... this worked for me:

curl --user user:password -o DeliciousBackup.xml -O
<https://api.del.icio.us/v1/posts/all>

I only have 280 links on there, so maybe it is limited somehow. I really hope
it is, otherwise this would be REALLY a poor job on the part of readwriteweb.

~~~
zephyrfalcon
That's only to export your own bookmarks. The author was talking about
extracting all (or any non-trivial number) of the bookmarks that are publicly
available.

------
glebk
This python API into Delicious could give you a place to start:

<http://www.michael-noll.com/projects/delicious-python-api/>

Unfortunately Delicious will throttle you if you hit the service more often
than once a second so you might not be able to get too much valuable
information.

------
chapel
I took a peek through the site and I really don't see a way to scrape
everything, or even most stuff off of their site. You can get the 200 pages of
the most recent bookmarks for any particular tag, but that seems to be about
it.

~~~
jonknee
User pages go back much farther, perhaps all the way. So if you can find a
decently large set of users you should be able to come away with a large
chunk.

~~~
mikeklaas
It's possible, but would take years at the level of rate limiting they do.

------
agentultra
It's not terribly difficult to backup your bookmarks using the API. I wrote a
script a while ago that does just that and dumps everything in a neat little
sqlite DB.

I'm sad to see delicious go as it's a great collaborative tool and has awesome
powers when combined with instapaper.

(btw, if anyone wants a copy of my script you can get in touch with me through
my site listed in my prof)

~~~
docgnome
They also have an export tool.

------
Stwerner
Funny, I was playing around with scraping bookmarks off delicious a while ago
with a rails app.

~~~
tdoggette
Now is the time to post that on github if it works at all.

------
cilantro
I hope posterous is hard at work building a delicious clone!

edit: Not completely sure why you all hate this comment, but fwiw I was being
sincere not snarky.

