In our spare time, we're researching this dataset in detail. Here are some questions that we're interested in. Would love to hear other ideas and to have folks dig into the data. I think this dataset may be of interest to hackers, researchers and marketers.
1. Are the trajectories (e.g. rank vs time) for all popular posts of the same shape? They look ~logarithmic.
2. Are there identifiable clusters when you look in 4d space for rank vs points vs comments?
3. How does the impact of a post depend quantitatively on its respective cohort. I.e., what's a good model to normalize performance based on what else was happening that day?
4. What fraction of posts have comment threads that are "hijacked" by the first comment? Is their a quantitative way to find this, perhaps by looking at (2) above?
5. What are more detailed metrics to collapse "performance" of a post onto a single number?
6. How does performance on HN compare to reddit, etc?
7. How is the HN community different than other communities, if at all?
8. Given the time-dependent data, can we create a good estimator for the number of active HN users per day? Or can we at least create a relative ranking of the number of unique users between different days?
# of comments vs points would be interesting, and if you are willing to crawl the comments then diversity or complexity of commentary would also be interestin g- eg several multi-threaded discussions vs. a string of 'this is awesome' comments on some popular but shallow topic (eg Huble space telescope imagery or somesuch, which tends to attract much admiration but not necessarily a lot of discussion).
Very cool. I like the way that you show 'means vs time' on the left panel and then you can dig into the actual distribution on the right panel. FYI, I think that, e.g., http://weather-explorer.com/history/country/US/state/WA/city... should read "Daily High Distribution", not "Average Daily High" or something. The mean trend line is on the left panel, and the right shows the the entire distribution. I'd be curious to see also what the 2nd and 3rd moments look like vs time, to see if the weather has an equal 'spread' month over month or if it tightens up for certain periods of the year.
Also, you need to drop in a full post with commentary, analogous to what you did with your "learning python" post. More feedback about tools, resources, learning sites, etc.
Sure, nobody claims that CouchDB is the first to have replication. MVCC was invented decades back, and hash trees have been around forever. I believe those two combined are compelling, and if you follow, e.g. risk, you'll see they're heading down the same path.
0) CouchDB will never lose your data. Period. Not many other stores are 'append only, copy on write'. If you're data is transient, you may not care about that, but many apps expect the DB to never lose or corrupt data. Take it down with 'kill -9'? no problem, it's guaranteed to be consistent on disk.
1) I think document DB's are as good or better than a key value store like riak. It's great to have the choice, at a later point in time, to reach inside your documents, build indexes, etc.
2) The biggest wart with couchdb from a scaling point is the single server, master-slave, and master-master. There is no dynamo style clustering, ala cassandra, risk, etc. We added that in our own stack in '09 and it's finally hit the Apache CouchDB repo in a refined state, you'll see it in Apache CouchDB 2.0
3) Finally, the biggest wart from a usability standpoint is the need to build materialized views. Ad hoc queries are painful. In Apache CouchDB most folks use Elastic Search in conjunction. In Cloudant we embedded lucene into each cluster node so you can do the the obvious things: 'GET http://...?q=name:"Mik*" AND age:[25 TO 34] & sort...'
Good points. Replication is certainly not painless to setup, and I've had trouble with continuous mode simply failing. I'm sure those warts will be worked out though.
Lucene and Elastic Search go a long way, it is just one more service to configure and maintain. Thats been an annoyance for me. If Lucene could be built right into couchdb that would be a major improvement.
However, that still doesn't let me cherry pick the values I want from a document. When the app is in a dynamic language, the cost of deserialization can add up.
Building some kind of xpath expressions to pull out specific parts of the doc would free up developers from spending as much time writing views, and would likely be much more performant to have that operation take place server-side. Maybe that should be an Elastic Search feature though and not Couch.
There is indeed a deterministic but arbitrary choice of the "winner", but there is a big difference with other systems -- no data is discarded. The hash tree branches and both edits are saved. You can choose which branch of the tree you want to start editing, and you can listen to a feed for documents that have conflicting edits and resolve them according to your application logic. The real choice is to never discard data.
I've only just started developing for Android (Foundbite's next version) so can't really give an opinion. Visual Studio is great to work with though. There are only 4 screen resolutions for Windows Phone so in that sense it's easier to develop the UI.