-
Notifications
You must be signed in to change notification settings - Fork 114
Conversation
Review feedback, removing guards, updated docs
@RomanDzhabarov On the overhead, I don't mean computational and +1 for the batching of course! What I mostly mean is that I like the way LightStep's Recorder is used here. What it does, is that it expresses to the Tracer implementation, the right way, in the context of a given application, to send off-band data through http. This is very useful. Now let's assume that you have a Zipkin C++ Tracer and it implements OT. Given the lack of standardisation on the recorder, Zipkin Tracer will likely use a (likely blocking) library (e.g. CURL) to send HTTP and will force you to link it. Then you will have to hack the Zipkin implementation to allow it to (also) allow a Recorder pattern that fits the Envoy async way of working. I think nevertheless, it's highly likely you would still have to link Zipkin's other HTTP sending mechanism. This whole mess is the overhead I'm talking about. I think from what I see here, an API could look roughly like this:
Note that implementations don't have to use this, since it might be used on the constructor which is outside the main API. |
I do like @jmacd's idea. For C++ at least, I would take it one step farther and enrich the existing Span interface to require By doing so, we can specialize the type system but always have a reliable fallback. None of this would require runtime checks on behalf of the implementor. C++ would do it for us via double dispatch. (No guarantees this compiles, but I think i hope it gets the point across.)
This was one of my original thoughts when tackling the API, but it required that the Span interface be enhanced. I thought the spec was pretty rigid at the time, and wasn't sure how welcome an addition to the base span interface would be, but it does clean up some things nicely. I'm not convinced I've thought through all of the ramifications yet. I'm most concerned about when we have different tracers interoperating, but I'll save all of that for Tuesday. |
Upon further review, I'm not sure my previous idea would work either. Span's would still need about all the carriers and vice versa. The physical organization of the libraries would still be a problem :( |
Submitted a virtual inheritance rough draft in #7. Trade offs: Pros:
Cons:
I didn't want to close this PR, but I'd hate to split the discussions across two of them. I'll let the opentracing gurus decide where the discussions should continue. |
First, to summarize what was a fun and productive hangout today with @jquinn47, @SaintDubious, @lookfwd, @yurishkuro, @RomanDzhabarov, @jmacd, @tedsuo, and myself:
Anyway, @jquinn47: thank you for #7! IMO it's a worthwhile tradeoff. I feel odd cutting the (valuable) discussion in this PR off... let's wait for other opinions about #6 vs #7 and we can go from there. |
Thank you @bensigelman for arranging the hangout yesterday, that was extremely useful. And thank you @jquinn47 for quickly giving us a new API to consider. My attempt to summarize the sticking point over Carriers and Spans:
For us, there is no way to change Span types over our system. There are too many services, transport mechanisms, and languages, there is no realistic way to upgrade them all. So we have to enforce exactly one Span type. With the Virtual approach we have to play policy police with every Carrier implementer and say "You MUST cast this virtual Span to a concrete type and you may NEVER use anything else." Eventually some team will decide that within their island of services they are going to use some OTHER Span type and eventually that will leak out and the whole system becomes unstable. We feel that the virtual approach allows for a flexibility that is dangerous to the system. It would be very helpful for us if someone has some real world examples of when they might want to use this feature. |
The way I perceived it, there was no definite "Span (or anything) at runtime has to be there." What I felt was mostly that "there's a "cognitive overhead" on templates.. the more we can narrow their usage, the better. Let's keep the
To clarify this - I guess that |
One more thing I would like to clarify for the entire discussion, that, I believe, is quite C++ specific. We have to set the expectations right for the OT C++ spec/library. Please spend a few seconds to have a look on an implementation of the C++ I guess my point is quite clear. That's the source code/header file of the I think that implicitly, and partially because of other languages like Java, Go and Python, we have an unrealistic expectation on what good, simple and quality C++ code looks like. Even large companies might have C++ experts that don't use templates, and that's perfectly fine and expected and lovely for application code. Not for library code. For example let's have a quick look on google test/google mock e.g. here. I can't see anything that looks like what's so nicely described in the documentation and I seriously doubt anyone could create C++ library code that meets google mock's objectives by using just inheritance. |
If I understand correctly what you mean by "changing Span type", the situation mostly arises with higher order tracers. These are some scenarios where higher order tracers are needed:
On the topic of custom carriers, as always I find it hard to follow abstract discussions about what could happen and prefer to take concrete use cases and see how they are solved in the proposed API. The specific use case for custom carrier I have is the following. Uber uses tchannel, a kind-of RPC framework & binary wire protocol, which has built in support for zipkin-like tracing at the binary format level. Using those message slots is more efficient than passing tracing data via headers. So the instrumentation in various tchannel libraries (in different languages) perform inject/extract as a two-step process
There was a sticky point about the interface of the carrier. In Go it wasn't much of a problem due to duck typing. In Python/Java we simply used a dict/Map as the carrier to pass the 4 fields. Another concrete examples of custom carriers often arrises in Java, where various http frameworks use different types to represent http request. So what most instrumentations do is use the TextMap (or rather HttpHeaders) OT format with a wrapper around the specific framework's http request type that masquerades that request object as the standard text map OT carrier. |
Trying to keep the conversation here, but this is a response to @yurishkuro comment on #7: This is rehashing what we discussed yesterday. The problem with the approach your suggesting, in a statically built language like C++, is that we have a problem the way the libraries are physically laid out. If I have a tracer that needs to know about different carriers, today, we're asked to write this:
This library is going to be built one of two ways: 1: The tracer defines the carriers in its implementation In both situations, we have real world build problems with this library at scale. In case 1, If I have an application that only sends 'Blue' messages, but I want the tracer implementation, I'm now forced to link against all of the different carrier libraries that the tracer needs. Using your concrete example, If you only had tchannel in use in your applications, but the tracer implementation also supported protobuf-over-tcp carriers, would you want to link against that library? In case 2, if the tchannel carrier is defined outside of the tracer implementation, the tracer still needs to link against it. Now we have link lines that start to look like: Now what I did in #7 was try to do what @jmacd was suggesting. Push those details into the carriers, so that the link lines look more appropriate for your task |
Now, ignoring the CRTP vs virtual debate, I believe their is a fundamental problem with the spec. I have not yet seen how a migration path should work for switching out tracer implementations at scale. A lot of our discussion has been to about features used to accomodate this use case, but I don't see a clean way to pull it off. The demos are nifty to test a corner of the interface, but I'd posit they're not practical at scale, once multiple applications begin communicating with each other over a set protocol. Say we're using For this to work, we first need a multiplexing tracer. All of the original tasks need to be rebuilt with a multiplexor first, but only using Now once everyone has that linked in, we can start to make Once that's done, we then need to require This is a nightmare to pull off. I think its trickier than traditional library rollouts, because every application is able to communicate with every other application over this tracing protocol; its one dense graph. All of the carriers used throughout an entire organization will need to be coordinated throughout this entire process. The alternative is to have the tracer implementation do it all, which has all the negative side effects described in my previous response to Yuri. This is the line in the spec that is resonating the most with me
Since using multiple tracers has significant maintenance side effects, and there is no reason to assume more than one tracer is operating system wide to begin with, I begin to ask, why is all of this flexibility put in place? I'm going to try my hardest to avoid the scenario I've described above anyways. I hope this helps drive the discussion. I really don't care about CRTP vs virtual to be honest. The semantics of how these tasks are upgraded and interoperate are loosely defined. That's what I'm most interested in tightening up. I believe the design of the classes would come naturally once the problem is well formed and we're all talking about the same thing. |
As Ben mentioned on the call, should this be considered the fault of the spec or the nature of the problem itself? How would you solve this if there was no OpenTracing, and you had 10k apps instrumented with bespoke Fundamentally there are only two ways to solve this problem as far as I can see, either (a) follow the process you described and go through a migration period when two tracers are running in the apps, or (b) try to make the two tracing implementations interoperate. In this specific example (ironically), Jaeger tracers have the ability to read incoming Zipkin's B3 headers, and Jaeger backend has the ability to accept Zipkin data model for spans, so one could get away with interop instead of dual tracers. But that's only because Jaeger was originally built on parts of Zipkin and we had similar data models and wire propagation format. If you're trying to swap tracing systems that have more significant differences (very simple example: one uses uint64 trace IDs, the other string IDs), then interop is not possible, and dual-deployment migration is the only path. Also, if we avoid extreme examples of 10k apps (per 80-20 rule the majority of companies have much fewer microservices, say less than a 100), then the dual-tracer approach becomes a lot more feasible and not as time consuming. |
I'd agree with your last point. I'm am (for better or worse) apart of the 20% there |
You mean this comment: #7 (comment). I do not believe it introduces any library linkage issues, because the custom carrier is defined purely in the specific RPC library and the tracer does not need to know about it. The carrier in that case is just an adapter to transform the internal representation of the framework's remote request (like servlet.HttpRequest) to the interface required for the Format, like TextMapReader/TextMapWriter in Go. I am not sure if CRTP can do that, I'd think you need virtual methods for that. This usage of custom carriers is by far the most prevalent. Even Go OT uses it as part of the official API, see HTTPHeaderCarrier, which wraps http.Header as plain TextMap carrier. The other example I gave, with TChannel, is indeed where the library linkage issue comes into view. That is not just a custom carrier, but also a custom Format. This is very rare use cases as far as I know, and we managed to solve it without having cross-dependencies by agreeing to use standard (widely available) interface for the carrier itself, e.g. in Java we'd simply use |
To make this discussion more hands-on. I implemented here an ugly throw-away draft for:
To compile & run:
|
I don't want to speak too much for @SaintDubious and @lookfwd, so if they disagree, prepare to hear back from them :) I've personally gone full circle on this in my head at least 3 times now. Re-reading @yurishkuro comments, I may be looking at what I see as problems too harshly through my 'bloomberg' goggles. I'm still seeing issues with regards to where in the code Span implementations are exposed, and which components are responsible for knowing about those details. I've got many teams to convince to get this spread out, and if we can't come to a consensus here, I believe it will be even more difficult as I add more cooks to the opentracing kitchen. We (myself, @SaintDubious , and @lookfwd ) will have to be picky about the first tracer we deploy at large. We will be experimenting on a small scale over the next few weeks. (Likely with the CRTP interface since we already have it, then replacing it) to iron out a few kinks with other teams and see what speed bumps are waiting for us. CRTP has already shown itself to be problematic; we hit a template compiler bug on solaris and have seen otherwise experienced developers complain about the cognitive overload that @bensigelman was wary of. While @lookfwd , myself, and probably STL implementors may be fine with it, we're likely in the minority. In my opinion, those are enough red flags to nix it. I would love to see what @jmacd does in the I'm certain the implementation we move forward with is going to need to be "version aware" and we only make changes in our organization in a backwards compatible way. That's my problem to deal with and not necessarily OT's, but if the spec is still very fluid that would make adoption harder. I was hoping the spec would have a recommendation for that, but @yurishkuro's arguments are convincing. That's likely an intractable problem to ask this spec to solve. |
@jquinn47, after our last chat, I came to appreciate another dimension of complexity i.e. service ownership. That was the "final nail" and I'm perfectly convinced that there's a layer of OT that we will have to adopt and stick with it, unchanged, for a couple of years. It's impossible to coordinate all those people to even just relink and redeploy. I don't think we should miss all the developments and the opportunities in the tracing industry because of this, though. "All problems in computer science can be solved by another level of indirection" and this brings the concept of OT "virtualisation" I vaguely mentioned (continued after the image). You have a very lightweight Tracer that "talks OT v.1.0" (and will stick to it, likely, for a decade). It does no transformation or anything clever whatsoever. Maybe it could have some tiny or moderate (configurable by polling the ot-proxy) intelligence around sampling but that's about it. All those data get forwarded to an OT proxy that must be installed on every server. You can update and configure the proxy however you like. It instantiates as many OT Tracers as it needs to talk to any implementations and those Tracers might be of any version of OT. It might have migration logic that decides routing. You own this so you do whatever you want. You (1 owner) roll out 1 service (ot-proxy) to (all) N servers and it's a brand new day without other services ever having to change. Would something like this solve the problem? Note: By getting the HOT in here right, we can prove the API is composable. If we can implement a proxy, then we can prove it's forwardable i.e. that any actions can be replicated at another place/time. Things that could prevent the latter are e.g. if we can't explicitly set the timestamps for each operation. Another issue might be that proxies should be able to communicate with each-other ID mappings ( |
One thing I'd like to highlight is that it is only an api design choice
that end up conflating how to contribute with in-band propagation with how
to contribute to out-of-band data. While we say things like 10 years etc.
keep in mind that the java library is version 0.something, and it would be
highly optimistic, if not silly for people to assume version 1.0 of
OpenTracing is a LTS idea. There's a general rule in open source I've
noticed which is that nothing is really stable until its second year of
experience. Some of these sentiments should hopefully steer next versions
of OpenTracing.
Here's some thoughts on the in-band+out-of-band apis which may or may not
help..
* Propagation interop is the most important interop in distributed tracing.
There exist today different systems that talk B3, and there single systems
that talk B3 or "similar" where similar is some slightly different variant.
inter-process propagation is a separate problem from out-of-band data.
Unless propagation works, traces are broken. Broken traces can be
difficult, if possible to stitch back together. When traces are joined
properly, whichever tracer library or transport reports out-of-band data is
possible to reconstruct into one or more trace depots. For this reason, I
think propagation (inject/extract) formats and possibly apis are
fundamental to distributed tracing.
* Propagation is a multi-system concern, and deserves system abstraction
level
In Zipkin we spend time on B3 <https://github.com/openzipkin/b3-propagation>,
as it is the most important thing, more important than which tracer api is
in use. Not only is it easier to pin down, it is used across systems which
share no code at all. Maybe this year, there will be a cleaner format
<https://groups.google.com/forum/#!topic/distributed-tracing/3pW5moQ0yyE>..
who knows. If that happens, and as Yuri mentioned, the high priority would
be at a transport abstraction... how to link together traces which use
different propagation formats. OpenTracing is right to have an api exposed
to support multi-system concerns, though in practice its non-goal of
interop has pushed testing to end implementors. For example, Jaeger have
system tests to ensure their propagation across library variants. This is
where the rubber hits the road.. the system abstraction. The library code
is usually really uninteresting.. plus most tracers I've seen at least
somewhat punt on dynamic type support.
* Propagation apis happen to be linked to the Tracer api in OpenTracing
1.0, it is mostly harmless
Not everyone wanted the current form of the tracer api, particularly the
part about inject/extract being dynamically resolved (me raises hand). When
the decision was made, it was made like that, and I don't think many
arguments here were not present then. Maybe a future OT api will make a
different choice, especially as different people are involved now. Note
that in practice, span mutation vs propagation are often implemented
decoupled. For example, both Jaeger and Brave have independent
Extractor/Injector types that can fix both the form of propagation and the
implementation of it. In Brave, it is a completely separate api (which
we stitch
back to OT <openzipkin-contrib/brave-opentracing#14> since
OT requires it). You can make instrumentation that avoids the OT
propagation api thing, but that would significantly impact re-use. I really
hope the next OT version is more modular with regards to highly decoupled
features such as propagation (in or out of process).
|
@adriancole: the propagation interop pain point is painful for you as someone who maintains a tracing system that is dealing with version skew across numerous forks/etc. The lack of standard tracing instrumentation is the pain point I hear over and over again when speaking with (50+? 100+?) engineering orgs that are integrating tracing systems. TL;DR it seems you are conflating the needs of people who spend their time implementing tracing systems with the needs of people who spend their time trying to integrated with them. OpenTracing does indeed separate the concern of in-band propagation formats from the concern of describing the semantics of application software. In fact, that's mostly the point. As software complexity grows, we need some common way of describing behavior while making minimal assumptions about consumers of those descriptions. But indeed: distinct tracing systems have distinct in-band propagation formats (even distinct versions of tracing systems have distinct in-band propagation formats). FWIW, I agree wholeheartedly with @adriancole about the red-flag-ish-ness of talking about "10 years" with any API that is in its early stages. I feel relatively comfortable making "conceptual" guarantees about most of the core abstractions in OT, but true compile-time compatibility between whatever is in this PR (or #7) and |
@adriancole <https://github.com/adriancole>: the propagation interop pain
point is painful for *you* as someone who maintains a tracing system that
is dealing with version skew across numerous forks/etc. The lack of
standard tracing instrumentation is the pain point I hear over and over
again when speaking with (50+? 100+?) engineering orgs that are integrating
tracing systems.
TL;DR it seems you are conflating the needs of people who spend their time
implementing tracing systems with the needs of people who spend their time
trying to integrated with them.
Version skew and numerous forks (to the degree that's even true) is no more
a part of my focus today than any time I've been outside a monorepo. I
*happen* to also do a workshop occasionally, for implementors *because* I
spend most of my time with integrators. Integrators show up organically in
my world. I don't go company to company.. employees of companies (and
random people) show up from every time zone. It isn't organized.
My goal was to highlight that propagation is a worthwhile thing to focus on
as if it is busted, your traces are busted. In other words, I'm encouraging
people to accept vs perpetually reject negative feedback about decisions
made.
|
For what it's worth, I don't go company to company either, Adrian... I probably should but have been too busy. Most of these conversations I've had are ad hoc things. It's a little hard to read the tea leaves, but I think (?) you are suggesting that I'm the "people" who should "accept" rather than "perpetually reject" negative feedback. If so, I always welcome feedback, positive or negative, about anything I'm involved with. That said, if that "feedback" is actually in the form of advice to 3rd parties, it seems rather important to present the other side of the argument. (Which, in this case, is about whether (a) instrumentation of application code or (b) consistency about in-band propagation formats is the sticking point for distributed tracing deployments in the wild) |
It's a little hard to read the tea leaves, but I think (?) you are
suggesting that I'm the "people" who should "accept" rather than
"perpetually reject" negative feedback. If so, I always welcome feedback,
positive or negative, about anything I'm involved with.
The challenge I'm making is indeed pointed at you, not because of your
opinion topic is being discussed, rather you are the person who offers
meetings in response to people questioning things.
My 2p is that when something comes up where people are participating in
something like C++, but not privy to the historical reasons on something,
go ahead and find or open an issue for them! This issue description could
include the rationale which is otherwise lost being held across N issues.
This also helps others who aren't watching the threads like this or able to
attend F2F meetings.
There's already a spec repo, so most of this is just applying it
https://github.com/opentracing/specification/issues
The only really new part is the explicit idea that progress is possible
(i.e. read api past 1.0.. like a github tag or milestone), so that fresh
people have a role in shaping the future.
|
"10 years" <- This will happen anyway... it would happen with any API we would create internally or Open Source
That's the main reason I was pushing towards using OT instead of something proprietary. OT is way more mature conceptually compared to any v1.0 we would make in-house in such short time. This proxy ideas above, are such that we can decouple instrumentation code on apps that might be stuck in v1.0 for long time and be able to evolve with the OT API (after the proxy) without having to relink/deploy all apps. |
After much internal discussion, @jquinn47 and I have decided to not press for this PR anymore. So unless someone really wants to argue for it (which I strongly doubt) it is safe to close this. While we continue to like the static polymorphism and concrete type of the span when passed to the carrier, the CRTP is annoying to work with (as @bensigelman feared it would be). |
@SaintDubious @jquinn47 @lookfwd Thanks for the note, I was wondering how your conversation was proceeding. So what do you think about something like #7? |
@bensigelman We like the direction of #7 with the proxy note above but we should have more solid evidence in ~2-weeks time. |
Closing this issue, as the discussion has moved on to #7. |
This is a significant rewrite of the current opentracing-cpp implementation. I've been working on this with @lookfwd and @SaintDubious internally for the last couple weeks.
Major Changes
I took issue with the original implementation in a few significant ways:
Library Organization
The original implementation required tracers to
dynamic_cast
Writers
andReaders
into a concrete type before they could be used. This is an anti-pattern.We would go down one of two roads.
This makes it very difficult for organizations to pull down an opensource (e.g, zipkin) implementation and add their own carriers to it. If I have a
OrganziationTransportRPC
, there is no way that's getting pushed back into the open source implementation.If an organization writes their own implementation, the new library needs to be dependent on every RPC mechanism in the organization that may need to use the tracing library. Obvious side effects with code size and dependency management.
Heavy use of CRTP templates to define Reader/Writer interfaces. Functors can be passed into the API at compile time without affecting the base library. Implementations have an interface to rely on for more OO approach
Allocations
The original implementation always used
new
/delete
and exposed raw pointers to calling code.Without an interface to pass pointers back to the Tracer, this removes the ability to use custom allocators or object pools for managing spans/span contexts.
Added the
Tracer::cleanup()
method for returning resources to the Implementation. Introduced aStringRef
to avoid unnecessary creation of temporarystd::strings
at the interface levelV-Table hits
In high-throughput scenarios, the v-table is not desirable. If we're planning on providing an API for performant C++ applications, we would ideally be able to avoid as much overhead as possible in the applications we're instrumenting.
We've worked around this with a compile time pattern for polymorphism.
Spec concerns
We were missing a few things in the spec:
Minor Changes
This is obviously a large change and warrants a major version bump. We've kept @lookfwd in the loop the whole time, as he's definitely our most active opentracing advocate. I believe he's on board but I'd want him to chime in here too.