Hacker News new | past | comments | ask | show | jobs | submit login
Twitter's Tool For Debugging At Scale (twitter.github.io)
137 points by cdl on July 25, 2013 | hide | past | web | favorite | 28 comments

Hey, developer on zipkin here. Happy to answer any questions you have.

I'm happy to see the video of the Zipkin presentation at Strange Loop was put up a few months ago ( http://www.infoq.com/presentations/Zipkin ).

People working on distributed tracing systems all tend to eventually come up with a similar architecture. Even going back to the IPS research system made at University of Wisconsin in the late 1980s (that's the earliest distributed tracing system I know of).

They all tend to do tracing via minimalistic, low overhead logging of RPC calls between machines. They tend to do tracing via low-level libraries which application developers can ignore. The trace systems seems to be good at uncovering latency bottlenecks. I am ignorant of what success systems like Zipkin or Google's Dapper may or may not have had in areas outside of latency checks.

One of the projects I've been working on is doing more aggregate analysis of zipkin data: https://github.com/twitter/zipkin/pull/276

We're using this at Twitter to better understand usage patterns for services upstream and downstream. For example, Gizmoduck, the user store at twitter is backed by memcache, and some disk-based storage behind the cache. While we can view individual traces that hit by memcache, the aggregate info shows us both the proportion of traffic for services calling Gizmoduck, as well as the proportion of time Gizmoduck spends in memcache versus the backend store.

Furthermore, it can be useful for notifying of unusual behavior. If a service's aggregate durations has changed since yesterday, perhaps that's something we want to look at. Or if the ratio of traffic from some upstream service doubles, that's interesting to know.

Latency isn't the only thing--if that's all you wanted, you wouldn't need the tracing, just log the latency at each step. The real amazing thing is the structure of requests. Having a record of distibuted N+1 problems is pretty amazing, for instance.

As for where you hook into the stack, it's definitely a tradeoff. The lower level you go, the wider your coverage, but it also becomes more generic and potentially less actionable data.

I've worked on building a similar system at AppNeta (it's basically commercial Google Dapper). Here's some slides from a lightning talk I gave about distributed tracing at Surge last year: http://www.slideshare.net/dkuebrich/distributed-tracing-in-5...

We built something similar last year for our internal services, but ran into problems when dealing with asynchronous code and threading context.

How do you propagate the Zipkin info from thread to thread inside your code? An example - a request comes in, we generate a new request ID and pass this down into the processing code. Part of this code executes an async call to an external service with essentially a callback (in reality, it's a scala.concurrent.Future executing on an arbitrary context or an akka actor) - how do you properly rehydrate the Zipkin info when the response comes back? The only way we could think of is some sort of fiendishly complex custom ExecutionContext that inspected the thread local state at creation time and recreates that in the thread running the callback, or just have pretty much every method take an implicit context parameter. Neither of those solutions worked well, so we've largely bailed on the concept for the bits of our code that don't execute in a linear/blocking fashion.

We actually built something quite similar to that, but how do you avoid having to artificially insert a save() and restore(context) call all over the place? It looks to me like you'd have to do something like val context = Local.save() val eventuallyFoo = someServiceClient.makeCall("data").map { result => Local.restore(context) //do work }

for every single async interaction.

That's what we do, but it's built into the RPC library we use, finagle: https://github.com/twitter/finagle/blob/master/finagle-core/...

BTW, to the readers at home, note how almost all our core infrastructure is open-sourced.

That's intriguing.

Could you please elaborate on that? Does this mean that Twitter sees no danger in exposing these? Which parts of the infrastructure would be considered secret sauce and not open source? Or does it not matter when the company is as big as Twitter, since the core strength lies in user base, not technical infrastructure? What does Twitter primarily seek to achieve when it open sources its stuff? Talent acquisition or brand image or other benefits of open source such as collaboration?

More or less 100% of asynchronous composition is done via Futures[1] whose default implementation[2] does this for you.

[1] https://github.com/twitter/util/blob/master/util-core/src/ma... [2] https://github.com/twitter/util/blob/master/util-core/src/ma...

I mean this as politely and constructively as possible, but this looks like it has quite a few easy to fall into failure modes. If I'm not using a twitter Future, or if I've been given a Future from somebody else, it would look like I need to ensure to surround every compositional operation with this save and restore, otherwise my context is permanently lost. Additionally, I have to trust that any code I hand a closure to will do the right thing, otherwise my context is potentially lost since I have no guarantee on which thread that closure will actually be executed on. It seems like it would be virtually impossible to avoid this happening at least once in even a simple codebase, and when context is lost, it happens silently.

You are right that this fails when you move outside of the model. However, we don't.

You may be surprised (amazed?) to learn that, internally, 100% of composition happens in this manner. We have a massive code base, and we've not seen this be an issue.

Further, we've worked with the Scala community to standardize the idea of an "execution context" which helps make these ideas portable, the particular of the implementation transparent to arbitrary producers and consumers of futures, so long as they comply to the standard Scala future API. (Twitter futures will when we migrate to Scala 2.10.)

This is true, and I look forward to the Typesafe guys fixing both the default implicit ExecutionContext as well as any contexts that Akka creates. I actually implemented an ExecutionContext that does exactly what is described above, but ultimately we had to abandon it, since pretty much any library that deals with scala standard Futures in 2.10 has places where we could not provide our custom context. I can't wait until you guys get that ported upstream, because until then, I've explicitly banned the use of thread locals across our entire stack.

Yes, if you "go off the reservation" and outside of the Twitter Future, you will lose your automatic trace identifiers.

This isn't a unique problem, but using consistent libraries goes a long way (which works well internally at Twitter).

(For fun, I just searched through our entire codebase. The only manipulation of locals anywhere is in the util library.)

Could you explain a bit more how it works because, if a see the big picture, I fail to see how it could be useful outside Twitter's stack. I mean it seems to heavily rely on Finagle, right ? Please correct and explain if I'm wrong.

Zipkin depends on the services in the stack to be able to capture trace identifiers, record and send to the collector timing information, and forward trace identifiers for outbound calls. This is easy in Finagle as all of this logic is already abstracted away for most of the protocols. Note that Finagle is open source so its ready for use today.

If you wanted to record and forward traces inside your own services, this is relatively straightforward (though not super trivial) to do. There are other implementations beyond the Finagle one in development by the community - I would suggest hopping on to the zipkin-users Google Group.

How is this different from, say, sending straces or something similar to Splunk?

Zipkin is designed to piece together operations that are distributed across a fleet of systems. IE: nginx calls your rails server which calls memcache then calls your login server which hits a mysql database. It's most useful when your serving architecture is distributed across many many machines and many many different services.

Hey guys, the images on the project page are missing. Here's a Pull Request that fixes that: https://github.com/twitter/zipkin/pull/278

Merged thanks

Hey forgot images in the other files. Sent another one. Sorry 'bout that. https://github.com/twitter/zipkin/pull/279

How is this not just a clone of Dapper?

Very first paragraph of website:

"We closely modelled Zipkin after the Google Dapper paper"


an open source version of New Relic?

New relic is not a distributed tracing system.

if you are looking at "something like new relic" than https://errormator.com is an interesting solution - doing some things differently and shifting focus on differnt aspects.

And yeah errormator is not a tracing solution at all.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact