Hacker News new | past | comments | ask | show | jobs | submit login
CKAN - The CPAN of data (ckan.net)
77 points by ig1 on April 6, 2011 | hide | past | favorite | 14 comments



Every time something interesting that I want to read shows up on the front page, and clicking on it yields a spinning tab while the other end flaps about in agony, I can't help but wonder how many administrators know how to use ab(8). The rate of dead sites on HN is really a shocker, since commodity cloud that can stand up to HN's load -- as opposed to the shared hosting of yesterdecade -- is so widely available. I would have thought that Slashdotting would be a historical problem by now...

Is it not common sense to test the hell out of something before someone who would submit it to HN is even aware of its existence? WordPress is pretty bad for this (since most admins follow the directions and don't bother tweaking), and I've heard that Drupal can be too. Without tweaking and caching stuff, you'll fall over quick in front of this many eyeballs.

I know, easy for me to say.


Hi, I'm one of the CKAN devs. Just wanted to say the site is fully functional again (we've up the cached to be a bit more aggressive).

As a side note we have indeed tested with ab :) Our problem is we continue to find the AWS instances we use somewhat unpredictable in their response to load (largely due, we believe, to the fluctuations in CPU "stealing" as load varies across the other instances that share the same physical box).

Anyway, if you want to know more about ckan have a look on http://ckan.org/about.


I have a bit of experience with Xen. If you're actually seeing a whole lot of steal (how much?), that's a bad sign because it means you're on a box with a lot of contention. In an ideal world, Xen should steal very little from you. I'm burning all four cores available to me on one of my personal Linodes, and the platform is barely stealing anything. Here's vmstat -s and uptime from that Linode for comparison:

       409198 non-nice user cpu ticks
     60878563 nice user cpu ticks
       166987 system cpu ticks
    811571786 idle cpu ticks
      4486779 IO-wait cpu ticks
           25 IRQ cpu ticks
        15388 softirq cpu ticks
       766577 stolen cpu ticks

    12:06:10 up 13 days, 14:11,  3 users,  load average: 4.00, 4.01, 4.05
I've had the pedal to the floor for a couple of days on the CPU, and only 766 kticks have been stolen (total) since I booted. If you're seeing a lot more steal than that, your host is working pretty hard to schedule the domUs fairly.

Wouldn't dare to assume that I know better how to run operation than you do, just sharing my experiences with Xen. Netflix had a solution to this -- unfortunate that it was necessary, but a solution nonetheless -- which was to monitor steal closely and spin up a new instance if it skyrocketed: http://blog.sciencelogic.com/netflix-steals-time-in-the-clou...

Given the opportunity, I'd like to point out that I meant no disrespect in my original comment, if it wasn't clear. I was speaking more from a generality and not about CKAN specifically, a fact lost on those mindlessly downvoting me.


No disrespect was taken. The hacker news coverage came as a big surprise. We like to turn any caching mostly off and we know this is a risk. This is because we do not want the possibility of any stale data as this annoying to the type of users we have. We are working on a better cache invalidation scheme but this has not been a big priority.

Your feedback is appreciated, thank you.

Edit: Our amount of steal was much much higher than that.


Have you considered implementing some sort of script to scan some of the large biological databases and add links/metadata for the datasets they contain?

Looking at what's in CKAN now, it seems that it's mostly datasets that are a bit more easily understood than most of the biological data that's out there, but at the same time indexing and accessing biological data is a HUGE problem for researchers in this field.

There are currently some big databases such as the data stored by the UCSC genome browser (genome.ucsc.edu/downloads.html) and all sorts of expression/small RNA data available from GEO (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo/), and lots of other slightly more esoteric databases like flybase.org, which specializes in fruit fly data.

Truly doing a proper job of indexing/classifying all of this is a close-to-impossible task (and in many cases requires specialized knowledge), but there are an absurd number of publicly available biological datasets out there. If you wanted to rapidly expand the number of entries you have you could use a script to index one or two of the big databases like GEO, and fill in the metadata from what they already have.

Of course, I can also understand why you might prefer to have the majority of the datasets in your site be the sort of thing most people (or at least, non-biologists) can interpret vs. something that's highly specialized like this. Not to mention, keeping up with all the new data, and properly filling in all the metadata could be a real can of worms.


Sorry for the late reply. I sadly do not understand the concerns of this field very well. There are many very large datasets referenced on ckan, mainly links to huge triple stores. There are many biological data sets also eg flybase as mentioned. These triple stores are too big to do any decent dynamic linking against which is big shame.

If you get the opportunity could you repost this to ckan-discuss@lists.okfn.org. There are people on that list that understand these issues far more than me and they would love to hear from anyone interested.


On google's cache: http://webcache.googleusercontent.com/search?q=cache:yNyANy-...

And I bet the poster (ig1) is not the site's admin.


I never spoke to ig1 in my comment. "You'll" there was in the general.


You are correct.


seems to be back up now... the http headers suggest caching with a time of >30 mins is now in effect... i think caching and cache poisoning is generally hard to get right for dynamic data-driven sites...


We actually started using the Czech instantion of CKAN: http://cz.ckan.net/ to publish information about datasets of Government data that are available.


Interesting side note. CPAN was itself a copy of CTAN (TeX's repository).

CPAN - The CTAN of Perl.


Is this just a rival to Infochimps or is there something different with CKAN that I've missed?


I wish the tags page actually included frequency counts. And coud be sorted by frequency.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: