The most obvious questions:
What language are they using to run ExtractBot?
How did they identify the expensive function?
Where was this expensive function (CSS bot is mentioned, is this their code or did they use a lib in which the fix would be of interest to others)?
Is the ExtractBot home page demo form purposefully broken at present due to HN load, or just broken for me?
Without knowing those things, some guesses: It's a scraper in Node or Ruby, it uses a load of existing libs that were not written with performance in mind. Those libs pull apart HTML and extract text values which are returned in a JSON doc (or something). They wondered why they were running so hot, managed (intentionally? fortuitously?) to spot some loop that needn't exist, or an expensive function that could be avoided by some cheap caching.
My initial assumption is that the 2.2M pages per ~18h are the main workload. This is also supported by the chart at the bottom, outside of the 18h timespan there is hardly any baseload. The blog additionally gives the following facts: 18 c1.medium instances and ~60% utilization after the optimization (taken from the chart).
Now this allows us to calculate the time per page. First the time for the total workload per day is num_machines(cpu_time_per_machine)=18machines(18h*0.6)=194h of processing per day.
On page level this is than 194h/2.2M=317ms per page.
This feels really slow, and should even be multiplied by two to get the time per cpu core (the machines have two cpu cores)! I would guess that the underlying architecture is probably either node.js or ruby. Based on these performance characteristics the minimum cost for this kind of analysis per day is $25. For customers this means that on average the value per 1k analyzed pages should be at least $1.13. I think this is only possible with very selective and targeted scraping, given that this only includes extracting raw text/fragments from the webpages and does not include further processing.
" 18 c1.medium (2 cores x 2.5 units) = 3,700 per second"
90GHz = 3,700 per second
24 million cpu cycles per one parsing routine
sounds like bloated js, python, or server side php
Firstly - my first HN front-page, yay!
So, this was a little unexpected to say the least. As has already been pointed out, this post was written about 14 months ago now, and yes details are a little light. I'll ignore the usual HN hospitality and answer a couple of the more pressing questions:
1. this was a very early MVP at the time, it was not a production-ready piece of enterprise software, so no it was not built for out and out speed in the first instance.
2. yes there were probably much better options than c1.medium on AWS, but see (1).
3. yes it uses off-the-shelf libs, see (1).
4. no the website is not meant to work, I never got round to finishing it up, see (1).
5. sadly, I don't have the original git commit to reference (I was at my private repo limit and removed it) but yes, it was essentially a simple 1-line optimisation. IIRC it was something being evaluated in a loop that didn't need to be. Very mundane indeed. No tools used to identify it, I just knew it was being called a lot.
Ironically, the post was meant as a linkbait to drum up a bit of interest in the tool and see if it was worth developing (hence the tabloid title). It didn't get any traction and so the project kinda halted.
That seems sad if true.
Some details about what tools you used to find the culprit, or how you optimised it, might be more interesting.
Without reading the article, I figured there was only one likely explanation—the function gets called a lot.
But you can optimize their usage in your code.
They probably could save that amount of money by maybe trying different machines/providers.
I wonder what's the annual worldwide cost of parsing XML for example.
Unless it's for something very specific, of course.
Example pricing (from OVH):
(You'd probably need to do some homework to find the most price effective platform for your workload. For CPU intensive workload, I usually start with https://www.cpubenchmark.net/ given than OVH gives the exact reference of the CPU they're using).
Thankfully, there are a plethora of dedicated server providers out there, which can easily beat Amazon's uptime record while still being considerably more cost effective without going as cheap as OVH.
There was also a rule about always showing your work. ;) The details of the actual optimization are less important than the means by which one measured where the CPU time was going, and how the expected result was confirmed.
Many people are happy to play for internet points, whether they're SO points, number of patches submitted, bugs fixed, or something else. I'd be really happy if performance was one of the games they can play.
It's not an either-or. Maybe the best long-term approach would be to write it in a high-level style favouring readability, productivity and adaptiveness, while also being conscious not to use features that could lead to performance problems. Maybe you'd write it in a compiled (native/efficient VM) language with garbage collection and be mindful of memory layout, instead of writing it in Rubyand then having to rewrite some parts in another generally faster language later. Then you also wouldn't have to have that "Ugh, now I need to bust out C++/C/Go/Whatever and rewrite that stuff... oh never mind it's probably fast enough" when you start to run into performance problems.
 More precisely an implementation of a language which is compiled...
 Though maybe Ruby is compiled in the canonical implementation, for all I know.
No Details, No how they tracked down the issue, No mention of how they captured the timing. It kind of read like the status reports I send each week to my boss. But lighter.
I'm currently using munin, but would like something more cluster-oriented (as-in, seeing the big picture, not individual servers)