We also believe in measuring everything you can. We're interacting with many APIs across many boxes. Statsd + graphite are the tools we use to understand what's happing at runtime.
Graphite has a lot of warts, but it's really powerful once you get used to it. There are plenty of pretty interfaces you can put over graphite, but nothing really matches it for ease of ad-hoc queries.
Typically I'll use graphite to view ad-hoc metrics and build reports. When I find I'm repeatedly viewing a particular graphite report then I'll "hard-code" it in gdash  for the rest of the team.
We use this combo to track thousands of separate metrics and we've been pretty happy with it so far.
Implementation was easy. statsd is pretty simple to deploy and graphite wasn't too difficult either. To add statsd reporting to your code, it's essentially one line to create the statsd socket, another line of code to declare each timer or counter, and another one to increment. I think more time was spent determining what name to give each metric than it was implementing it in this project.
Now that I'm at dotCloud, I'm working with a much larger distributed system and we use it here also. We liked it enough to build some statsd hooks onto our RPC layer we use for just about everything. Now every time a component makes a remote procedure call, a counter for that call is incremented and the response time is sent to statsd. It's been very useful for troubleshooting odd behaviors and correlating events across the platform.
As people who work with complex distributed systems, we can't know exactly what they're doing. We'll think we know, and sometimes we'll be close. Other times we'll think we know, and then we'll wake up at 2AM because something failed horribly. By being able to monitor the system's behavior (sometimes in gross detail), we can get a little closer to knowing what's really going on.