-
Notifications
You must be signed in to change notification settings - Fork 316
Scope of OpenTracing, scope of API repositories, RPC or Tracing #33
Comments
@kriskowal this is important stuff, thanks for reaching out... I almost wonder if it would make more sense to have a quick call or something (happy to take notes and summarize takeaways here)? If you're up for that I'm happy to schedule a hangout for all who are interested. Otherwise I can reply on github. Let me know. |
Agree important, context propagation is the feature for an open tracing standard. Would be interested in joining such a discussion. |
Want to write my view on this, as similar discussions are happening elsewhere. First, I want to clarify the terminology, not because mine is better, but to avoid ambiguity. I will refer to request context as a context propagated in-process, such as Go's I will refer to distributed context as a context propagated between processes, across RPC boundaries. Data stored in that context must be serializable, and once added to the context it remains visible both in-process and propagates to all levels of the distributed call tree. Or to put it in tracing terms: data in distributed context is propagated to all future children of the current Span and their descendants. Finally, there is an RPC request itself, which is a logical abstraction of the message sent on the wire during remote call. The format of the message is specific to a given RPC framework, but we expect it supports the transmission of the distributed context, either as opaque data, or as part of the "headers" map. Note that request context and request provide the propagation fabric, in-process and between processes respectively. The various properties mentioned in the original post may belong to one or more of the contexts described above. For example:
|
I can see the temptation to put things like propagation of security tokens here, for lack of a propagation api. It is also tempting as this project began named distributed context propagation api! DCP is commonly implemented in RPC apis, but I wouldn't go so far as to think they are cursed to never be useful outside, or that a similar effort as we are doing here for tracing couldnt be done for context propagation. It certainly feels an accidental conflation to stick duties like auth token propagation in a tracing api. Each new area of responsibility we consider, logging, attachment handling, arbitrary context propagation etc.. They all add weight to the effort and should be carefully considered. I have heard numerous complaints that people can't keep up as it is. |
I've mentioned it in other places that DCP shouldn't be promoted for general application domain coding. Putting infrastructural things like the correlation-id, user-id or auth-token, into the DCP makes a lot of sense to me. Infrastructural systems like Zipkin, Kibana, and what-not can effectively be used together . The definition of infrastructural here has emphasis, and there is no clear cut definition to it. Some users will use DCP more of the application side of things. That's ok. DCP is a valuable addition in my opinion but, I'm asking that we make sure documented examples focus on the obviously infrastructural. When users do come along and use DCP for a crap load of application domain stuff (instead of using proper designed parameterisation) and their tracing system grinds to a halt because of the added payload, we can say "look DCP really isn't designed to be doing normal application parameterisation for you". |
To me, the question is not what kind of data people put in the DCP, but how DCP is achieved. If people find it useful to propagate an auth-token, it's their business. If they want to stuff 1Mb of application data - ultimately it's their business too, all we can do is provide guidelines, which we did in the API for Trace Attributes. @adriancole I think DCP implementations by RPC apis are inherently non-portable - I can build my whole stack around finagle, or grpc, or tchannel, and then suddenly I need to talk to Cassandra, or Redis, or Hadoop stack - then what? DCP has to be vendor neutral, which rules out RPC apis. But I can see a state where OpenTracing is available at all levels, including RPC frameworks and big data systems, so applications have a clear API for saving some value to a DCP and retrieving it 5 levels down in the SOA. And because OpenTracing defines the encoding API, it provides an interop bridge between different RPC apis. I do not dispute that DCP can exist without tracing, while tracing cannot exist without DCP. I just think that DCP by itself does not have strong enough incentive for all frameworks to implement it, and tracing has a better story, and can kill two birds with one stone. |
Let me be frank, but keep in mind it's directed ultimately at what we express not what we provide.
No. It you provide an API, from there the customer is always right.
At the application domain, this is shockingly terrible systems design, and I hope that we not promote such a thing. It's a promotion of bringing past headaches of global variables into an even more complex world of distributed programming, i shudder from fingers to toes. |
How do you propose we do that, aside from a stern warning that we already have in the API docs?
again, are you talking about documentation or the actual API? I don't see how what is being discussed here is different from say grpc Context, which warns against abusing it, but does not have any API level capability of preventing people from doing stupid things. |
Basically, but clear not stern. Provide examples that are clearly infrastructural examples and can't be confused with application code examples. Provide documentation that clearly states the intended usage for DCP. You can't enforce, or make explicit, everything in the API. A good example is the avoidance of null parameters: often the best you can do is clearly document that nulls are not acceptable input and in the few situations they are those parameters will be annotated with @nullable |
@yurishkuro I think we agree about context propagation, and I think we are very close to having a complete API for it already. The remaining piece is around a standard for implicit request context propagation, which I'll make the case for here. The reason I think it valuable to consider standards around what is above referred to as request context has to do with interoperability/compatibility of instrumentation. If instrumentation is happening separately in a library vs the framework it is used in, the instrumenters of each need to agree on where they will find the request context. As an example, one could imagine that an application framework is instrumented by placing request context in some sort of framework-specific request-global field. If an HTTP client library or ORM is instrumented for use in conjunction with this framework, it must now somehow be framework-aware. This presents a serious reusability challenge to module-by-module instrumentation. It could be decided that solving this problem is outside the scope of OT 1.0 because all users are expected to manually instrument their applications. This might make sense in terms of scope-limiting, but it will pose a serious challenge to library integration and adoption. I'd prefer a future where libraries can come with OT hooks built in (ala DTrace), allowing for reusable instrumentation. As far as I can tell, this requires some way for instrumentation developers to agree on a request context propagation mechanism. In TraceView, we've gone with thread-local in the majority of cases, monkey-patching extensively for evented systems and thread pools in order to keep request context propagated. (In some cases, like nginx or apache module, the work is simple enough that per-request structs are more than sufficient for this request context storage.) However, OT should not dictate a particular storage mechanism. It seems that the biggest variable in what request context propagation mechanism is appropriate is the concurrency model, which is variable by application. Instead of demanding a particular context store, we might provide an API that allows instrumentation to discover where a request context is stored: it could be initialized by frameworks (things that call This isn't well thought out yet, but for discussion purposes, an example API might look like this:
You can imagine handlers that read/write thread-local, or in an evented framework, pull from a global which can manage per-request metadata. This might be the wrong design. But it seems important to have some sort of interface if we want to make instrumentation of libraries as portable as tracers. Thoughts? |
@dankosaur I agree that we should have some sort of a story about in-process propagation, although I would be very hesitant with putting it as a blocker for OT v1.0 We at Uber are trying to solve this exact problem, e.g. in one scenario where a Python, Tornado-based server makes a downstream call over TChannel (Uber's RPC). The server is instrumented implicitly via monkey patching (lib here), and the context is propagated via thread-local + Tornado StackContext combination request_context.py. TChannel has no idea about it, because its own propagation mechanism is explicit (in all languages). We haven't quite solved it, but the approach I want to take is to have a static registration on TChannel object (set at app init time) that takes a hook that can retrieve tracing context from somewhere. This sounds very much like your What's interesting about this approach is that the decision about the actual method of in-process propagation is in the hands of the application itself. That's important because it affects how the application does multi-threading internally. It would be good to have a way for the handler-based approach to also work with explicit propagation. One other complication in this overall problem is that in order to support thing like I would like to stop at this point and wait for @kriskowal to respond if in-process propagation is the direction he wants to explore in this issue or spin it into another one, since the original question was broader in scope, and I would like to settle that one first. |
Yes let's see if it makes sense to split this out. I think it will be difficult to combine explicit & implicit propagation, requiring a bit of extra work on the part of the explicit propagation implementer. Something like invoking the |
if you want to be an OpenTracing implementation, you now have to implement 2 birds with one stone, perhaps, just saying that maybe not everyone can or An alternate route is to develop a DCP api separately from the tracer |
@adriancole I'm not convinced that play is "free" -- regardless of extended use-cases for I can see the argument for limited scope: some will want to approach this as they might a statsd type solution, manually instrumenting their app and without holding much hope for an ecosystem of instrumentation. Given the limited set of instrumentation likely to be available at OT 1.0, that will certainly be the majority use-case at the beginning. On the flip side, if there is a default notion of implicit context propagation (which can be disabled/configured for manual explicit propagation), that would provide both a model and potentially out-of-the-box convenience for instrumentors in many languages. |
@adriancole a few thoughts:
Practical idea: we could (?) add a "capabilities" section to the sort of implementation-introspection call proposed here: opentracing/opentracing.io#33 ... That would remove the "pay to play" requirement. (The other initial "capabilities" candidate (in my mind) would be human-readable log messages, btw) It would also allow us to be clear about what must be implemented (i.e., what's not a negotiable capability), as those features would not be present in the set of optional capabilities. |
By way of summary, @yurishkuro proposes nomenclature for “request”, “request context”, “distributed context propagation”, and “in-process context propagation”. I’ll track these definitions hereafter. @yurishkuro Proposes two alternatives to solving “distributed context propagation”, one of which (A) is to piggy-back a key value store API on spans, and the other (B) is to provide spans alone and leave the concern of incorporating them in “requests” and “traces” as an exercise to RPC libraries. I am in favor of the latter (B) because I am not convinced of the utility or wisdom of fanning out an arbitrary key-value store to all downstream requests. I believe it unwise because it establishes an ad-hoc global namespace shared by multiple services (as argued by @michaelsembwever and @bensigelman). This will cause request sizes to bloat as they go downstream. I also don’t believe DCP over requests will be useful. There are specific properties of “request contexts” (properties of incoming requests) that should be propagated in very specific ways to “requests” (properties of outgoing requests). Copying is seldom sufficient. There are also corresponding response and response context properties, each with their own semantics particularly when merging multiple responses into an aggregate response. Solving the problem of RPC propagation and submitting ad-hoc key-values scoped to spans to a trace collector collectively take the wind out of propagating context with a key-value store on requests. RPC libraries would also miss some performance opportunities if they had to serialize and deserialize this bundle at every hop. @dankosaur is in favor of solving context propagation. Would you be satisfied if we took a layered approach where OpenTracing focused on tracing and another collaborative and open-source project used OpenTracing as part of a comprehensive RPC library with pluggable transport protocols and pluggable encoding schemes? @bensigelman I am open to a conference call. I believe @yurishkuro may be in the best position to organize a chat. |
Let's have a hangouts call this week, please pick the time slots that work: http://doodle.com/poll/5kvcvag2vkga38fz |
Thanks, @yurishkuro: I am confirming that the early time slots are 1pm PT, not 1pm ET, correct? |
@bensigelman correct - I had the timezone enabled on the poll, so you should see your local time. For me the first slot is 4pm EST. |
Nice comment @kriskowal It leaves me torn between adding the DCP functionality via
We share concerns around an ad-hoc global namespace across services. At the same time effort is underway to reduce the concepts involved for the end-user to tracing. it is and should be a very simple domain to understand. Furthermore the end-user knows this, and already having to deal with a horde of different libraries will often have a very limited patience toward how complex tracing instrumentation is. I was thinking the To deal with the risk of "request sizes to bloat as they go downstream" a rough idea off the top of my head is that the Specification implies a limit to the character length of the cascading tags map. Along the lines that this is a limited context across services for tracing/system-level stuff. (There could be a way, like a system variable or something not so visible to the main API, to increase this limit, if the end-user absolutely had to.) The concern around request-->response changes in tags is an interesting one. Kinda hoping that isn't involved in the OpenTracing DCP proposal. |
Useful background reference on Distributed Context Propagation, from Rodrigo Fonseca, co-designer of X-Trace: http://www.cs.brown.edu/~rfonseca/pubs/fonseca-tracing-1973.pptx |
@kriskowal thanks for the summary. To clarify on my end I am not sure that an RPC-library-based solution is the right approach. Certain companies have taken the step of standardizing around a certain RPC framework, using it for both send and receive sides of all services. However, in many cases I have seen, the distributed app is a hodgepodge of different libraries for communication over various channels (HTTP, AMQP, Kafka, etc.). If there's a basic standard for context propagation in-process, each RPC system can interoperate. With the current APIs, a large part of DCP is already provided for: tracers can figure out how to marshall that context for the wire, and RPC libraries will have the responsibility to implement the APIs that do so. The missing piece is in-process propagation. That's why I suggest providing an API by which different RPC libraries, frameworks, and other instrumented libraries can agree on how context is implicitly propagated in-process. I don't think it is necessary to force DCP schema complexity on the user as part of this: I suspect in majority of use-cases the Context (or "baggage" per Fonseca paper) will only be used for propagating the span ID. The schema only needs to be considered to the extent that a user is explicitly making use of it, or an RPC library that does. This is similar to the expectations with HTTP headers today. As for the bloat issue: makes sense to introduce safeguards, and a tracer (which takes care of marshalling the context for propagation) has a good interception point. |
@kriskowal, one correction:
I don't think it's unwise per se, I think it's risky and should come with caveat emptor warnings for both implementors and users. I like the idea of a "capabilities" feature for OpenTracing in general, and this would be a perfect motivating use case. The thing is, naming aside this is a really easy feature to implement if the core tracing machinery is assumed to be there already (i.e., the situation we find ourselves in). |
I agree @bensigelman. Tracing is a subset of the DCP problem, so it makes sense to expose some of the machinery we're using to solve it. |
Re video call, the only common time slot was this Fri, Jan 29, 4:00 PM EST. I will send invites. @michaelsembwever and @dkuebric - do you want to drop me a note to ys at uber.com? I don't have your emails. |
Thanks, I often learn from them.
I am definitely getting more of this, especially as the several I'm also becoming increasingly aware of the simplicity afforded by
|
Yeah. Not sure how others feel, but a capabilities struct – used responsibly! – seems like a net positive to me. (Side note: it would also help with, e.g., your concern about the noop impls and trace attributes: opentracing/opentracing.io#27 (comment)) |
Hello. I started a Node.js Zipkin client/server project a bit back (& switched to using Uber's new Thrift library!), and have been very interested in the collision of tracing and log-based compute, and this ticket represents a compelling budding promise to me. I'd greatly appreciate being able to show up to better understand other's take on this; if it's not too much of an ask I'd like to be included in tomorrow's hangout. I take it the 4:00 time is PST? [ED: EST!] It's my username at gmail.com. |
@rektide I'll add you. Do you have a link to that client/server project? |
The collector made it the furthest but between a holiday season and laying eyes on OpenTracing, traction fell away & I'm tempted to change tacts on it (very interested in applying Apache Flink). |
We on Uber’s RPC group are embarking on a project to generalize context propagation from incoming requests to outgoing requests. We would like to start a discussion about the architecture and scope of open tracing.
@yurishkuro tells us that this undertaking is within the charter of the open tracing working group, which appears to:
“Provide a library in various languages that has common types for trace spans and contexts, and the logic for propagating incoming spans to outgoing spans. Provide a sufficiently general library that bindings for arbitrary clients, servers, and trace reporters can be developed independently and swapped as need arises.”
Instrumenting specific clients (e.g., the Node.js built-in HTTP client), servers (e.g., the built-in Go HTTP server), and trace reporters (e.g., one in Python for Zipkin) is out of scope but some will likely be contributed to the commons anyway. Wire formats for transmitting spans and annotations appear to be out of scope.
Apart from dealing with trace propagation, the tracing library provides an affordance for simple context propagation, copying attributes from incoming spans to outgoing spans. For generalized RPC context propagation, copying from incoming to outgoing requests will not be sufficient for most attributes, and undesirable for performance in other cases, as each hop would bloat fan-out requests.
A context propagation library will also be in a good position to forward cancellation to the downstream call graph and perhaps even abort local work.
We have a few options for stacking this architecture. The tracing library as stands provides a tightly coupled solution for general stack propagation, but does not address cases like “timeout” propagation. The common case is that various kinds of request context need unique logic for propagation.
The text was updated successfully, but these errors were encountered: