Wednesday, November 12, 2008

X-Trace

What's the X stand for? Extensible? Extra?

Anyhow, X-Trace is Rodrigo and George's work on pushing tracing into the appropriate layers to get a view of the entire system. With this, debugging is easier. It does this in the obvious ways, and has the obvious problems.

The big problem for this is adoption. To use X-Trace, you have to instrument it, which means writing specific code in each protocol you want to trace. That's a huge headache. However, they deal with this well, allowing (limited) incremental adoption as well as focusing on an area (datacenters) that has the amount of top-down control needed to push such a sweeping change.

Performance is another problem. They analyzed the performance hit of their daemons, and showed it's not a huge problem. However their apache trace took a 15% hit. Assuming this is high as it's research code, that's still huge. The obvious thing to me is to instrument only a limited set of things, or to be able to turn it off until you have an error, then try to reproduce it on a limited number of servers running the service. I think they eventually did this as well.

I wonder how related this is to debugging on multicores. I argued with andy about wanting a distributed GDB, and this is approaching the ballpark of such a system.

1 comment:

Randy H. Katz said...

X is for cross layer.

Their experience was that instrumenting protocols for making them X-traceable wasn't that bad, though it does require a reasonable implementation of the protocol. Some implementations are more difficult to embed than others.