I'm really impressed by test_unit hitting #2. Most people who wander off into ruby bring back their TDD enthusiasm to the other languages they use. That's quite an impact for the entire industry when you think about it. Kudos to all involved for that :)
Thanks! The name's "test/unit", but I won't hold it against you ;-). It's pretty sweet that people have gotten so much mileage out of code I wrote, and if it's helped us as a profession - even a little bit - to write better code, I'm thrilled.
20MB/sec is pretty dreadful, and 2 hours is not timely for that quantity of data.
I'm guessing you're paying a high price for the convenience of using a database? The kind of query you did, I'd run using grep on the command line source, possibly combined with a summarizing program written in Ruby.
I love the power of the unix shell. find | cat | grep would get the job done just as well if you had all the source in an accessible file tree, but I don't think you'd see any performance increase as the bottleneck is still random reads from EBS.
The single 3k iops EBS volume being used delivers a max theoretical speed of 24MB/s with 8k pages. I'm fine living with 20MB/s in practice.
In fact, postgres does inline (de)compression and optimizes for sequential reads, so it's likely the shell would be slower for this workload given the apples to oranges characteristics. I'd love to see any performance tests making this sort of comparison, they're always educational.
Even with a database it's dreadful. At 20MB/sec they need to value the time they have to wait very low before it'd be cheaper/faster to buy a small server outright and put a couple of ssd's in it if they do this kind of analysis more than a couple of times.
Or even load it up with enough memory to keep everything in RAM during normal operations. I can't remember the last time I worked on a system that did less than a couple of hundred MB/sec... And we generally buy servers in the $3k-$6k range, so nothing ridiculous.
Clarifying (since I think your post is easy to misinterpret if someone does not follow the link): The Time class is part of core ruby. Requiring time from stdlib adds some additional methods. Thus it is not necessary to require time to use the Time class, but requiring time _does_ add additional functionality.
Very interesting. I'm now wondering if Travis CI is collecting any code usage statistics? I'd imagine that they would have a more application-centric view on the rubygems ecosystem. Also, since they are actually executing code, they could potentially collect data on constants and method calls, I believe.