The most obvious questions:
What language are they using to run ExtractBot?
How did they identify the expensive function?
Where was this expensive function (CSS bot is mentioned, is this their code or did they use a lib in which the fix would be of interest to others)?
Is the ExtractBot home page demo form purposefully broken at present due to HN load, or just broken for me?
Without knowing those things, some guesses: It's a scraper in Node or Ruby, it uses a load of existing libs that were not written with performance in mind. Those libs pull apart HTML and extract text values which are returned in a JSON doc (or something). They wondered why they were running so hot, managed (intentionally? fortuitously?) to spot some loop that needn't exist, or an expensive function that could be avoided by some cheap caching.
My initial assumption is that the 2.2M pages per ~18h are the main workload. This is also supported by the chart at the bottom, outside of the 18h timespan there is hardly any baseload. The blog additionally gives the following facts: 18 c1.medium instances and ~60% utilization after the optimization (taken from the chart).
Now this allows us to calculate the time per page. First the time for the total workload per day is num_machines(cpu_time_per_machine)=18machines(18h*0.6)=194h of processing per day.
On page level this is than 194h/2.2M=317ms per page.
This feels really slow, and should even be multiplied by two to get the time per cpu core (the machines have two cpu cores)! I would guess that the underlying architecture is probably either node.js or ruby. Based on these performance characteristics the minimum cost for this kind of analysis per day is $25. For customers this means that on average the value per 1k analyzed pages should be at least $1.13. I think this is only possible with very selective and targeted scraping, given that this only includes extracting raw text/fragments from the webpages and does not include further processing.
" 18 c1.medium (2 cores x 2.5 units) = 3,700 per second"
90GHz = 3,700 per second
24 million cpu cycles per one parsing routine
sounds like bloated js, python, or server side php