People working on distributed tracing systems all tend to eventually come up with a similar architecture. Even going back to the IPS research system made at University of Wisconsin in the late 1980s (that's the earliest distributed tracing system I know of).
They all tend to do tracing via minimalistic, low overhead logging of RPC calls between machines. They tend to do tracing via low-level libraries which application developers can ignore. The trace systems seems to be good at uncovering latency bottlenecks. I am ignorant of what success systems like Zipkin or Google's Dapper may or may not have had in areas outside of latency checks.
We're using this at Twitter to better understand usage patterns for services upstream and downstream. For example, Gizmoduck, the user store at twitter is backed by memcache, and some disk-based storage behind the cache. While we can view individual traces that hit by memcache, the aggregate info shows us both the proportion of traffic for services calling Gizmoduck, as well as the proportion of time Gizmoduck spends in memcache versus the backend store.
Furthermore, it can be useful for notifying of unusual behavior. If a service's aggregate durations has changed since yesterday, perhaps that's something we want to look at. Or if the ratio of traffic from some upstream service doubles, that's interesting to know.
As for where you hook into the stack, it's definitely a tradeoff. The lower level you go, the wider your coverage, but it also becomes more generic and potentially less actionable data.
I've worked on building a similar system at AppNeta (it's basically commercial Google Dapper). Here's some slides from a lightning talk I gave about distributed tracing at Surge last year: http://www.slideshare.net/dkuebrich/distributed-tracing-in-5...
How do you propagate the Zipkin info from thread to thread inside your code? An example - a request comes in, we generate a new request ID and pass this down into the processing code. Part of this code executes an async call to an external service with essentially a callback (in reality, it's a scala.concurrent.Future executing on an arbitrary context or an akka actor) - how do you properly rehydrate the Zipkin info when the response comes back? The only way we could think of is some sort of fiendishly complex custom ExecutionContext that inspected the thread local state at creation time and recreates that in the thread running the callback, or just have pretty much every method take an implicit context parameter. Neither of those solutions worked well, so we've largely bailed on the concept for the bits of our code that don't execute in a linear/blocking fashion.
for every single async interaction.
BTW, to the readers at home, note how almost all our core infrastructure is open-sourced.
Could you please elaborate on that? Does this mean that Twitter sees no danger in exposing these? Which parts of the infrastructure would be considered secret sauce and not open source? Or does it not matter when the company is as big as Twitter, since the core strength lies in user base, not technical infrastructure? What does Twitter primarily seek to achieve when it open sources its stuff? Talent acquisition or brand image or other benefits of open source such as collaboration?
You may be surprised (amazed?) to learn that, internally, 100% of composition happens in this manner. We have a massive code base, and we've not seen this be an issue.
Further, we've worked with the Scala community to standardize the idea of an "execution context" which helps make these ideas portable, the particular of the implementation transparent to arbitrary producers and consumers of futures, so long as they comply to the standard Scala future API. (Twitter futures will when we migrate to Scala 2.10.)
This isn't a unique problem, but using consistent libraries goes a long way (which works well internally at Twitter).
If you wanted to record and forward traces inside your own services, this is relatively straightforward (though not super trivial) to do. There are other implementations beyond the Finagle one in development by the community - I would suggest hopping on to the zipkin-users Google Group.
"We closely modelled Zipkin after the Google Dapper paper"
And yeah errormator is not a tracing solution at all.