Support long running multi tenant workflows #4133

sganapathy1312 · 2022-01-29T12:45:18Z

Is your feature request related to a problem? Please describe.
We manage deployments as a series of stages with each stage being delegated to a service provider interface. Our processing layer is vert.x and backend is postgresql. We are trying to explore our visibility options for an inflight deployment workflow and we evaluated open telemetry and ran into a few open questions:

The main vert.x job scans the db and picks all pending deployments and initiates new service provider tasks. We wanted to create or reuse existing spans for each pending deployment. Similarly we want to create or reuse spans for each service producer task in its vert.x job.
We want to be able to publish a started span { not just a finished one } via a span processor to our backends.
We want to be resilient to process startups and be able to create a span from a custom start time { when the in flight service provider/deployment record was first created }.

Describe the solution you'd like

Provide ability to switch contexts based on the current record being processed.
Provide ability to spans based on custom start times.
Provide ability to publish started spans { and not just completed spans } to the collector.

Describe alternatives you've considered

Since each task within our deployment workflow is homogenous, metrics with attributes will be a great fit. But for a general purpose workflow management system, we'd be better off with native tracing system that allows users to monitor an inflight trace in the backend.

Additional context
Add any other context or screenshots about the feature request here.

jkwatson · 2022-01-29T18:07:20Z

So, at least 2 of these things are definitely already possible.

Provide ability to spans based on custom start times.

This is already possible with the existing APIs.

Provide ability to publish started spans { and not just completed spans } to the collector.

This can be done with a custom SpanProcessor and SpanExporter implementation. The collector side is definitely outside the scope of this project. If you need in-flight span support from the collector, please open up an issue in the specification for it.

Provide ability to switch contexts based on the current record being processed.

I don't exactly understand what this means, but you can manage your Context instances however you like. Is there a specific feature you're looking for here from the context APIs?

sganapathy1312 · 2022-01-30T01:37:17Z

I don't exactly understand what this means, but you can manage your Context instances however you like. Is there a specific feature you're looking for here from the context APIs?

In a particular run of the job, it could process several in flight records checking to see if terminal state is reached or if new events can be recorded. I was hoping we could be able to switch the current span based on the currently processed record { sorry if using context there, caused some confusion }. This would allow us to record events like queuing delays, extra attributes like any errors observed in the Service provider delegate and so forth.

sganapathy1312 · 2022-01-30T01:39:37Z

Thanks for the clarification on the other two requests! I recently started exploring more about Open Telemetry. It's an awesome project and is changing the way i used to think about observability.

jkwatson · 2022-01-30T04:16:09Z

I don't exactly understand what this means, but you can manage your Context instances however you like. Is there a specific feature you're looking for here from the context APIs?

In a particular run of the job, it could process several in flight records checking to see if terminal state is reached or if new events can be recorded. I was hoping we could be able to switch the current span based on the currently processed record { sorry if using context there, caused some confusion }. This would allow us to record events like queuing delays, extra attributes like any errors observed in the Service provider delegate and so forth.

I think this could be done pretty easily with some sort of Map of record -> span, but it's not something that the core APIs would probably support out of the box, unless it solves some sort of general problem (and went through the specification process to be added to the official APIs).

anuraaga · 2022-01-30T04:35:11Z

A common pattern is to attach Context to a record if it goes through multiple stages. Maybe something along these lines

pipeline.register(stage1, (event) -> {
  Context context = Context.current().with(tracer.startSpan());
  event.setAttribute(Context.class, context);
  try (context.makeCurrent() {
    logic()
  }
});

pipeline.register(stage2, event -> {
  Context context = event.getAttribute(Context.class);
  try(context.makeCurrent()) {
    logic()
  }
});

This is just a hypothetical event processing API but shows patterns that are very common in most of them of being able to add arbitrary objects to an event or record. So you would just set it in the beginning and read it in each stage of processing. Maybe this provides some pointers for what you are trying to achieve.

sganapathy1312 · 2022-01-30T04:43:01Z

For us that would mean serializing the context and storing in db along with the record state and then deserializing and making it current. Thanks! that's a very nice idea.

sganapathy1312 · 2022-01-30T04:46:42Z

In general, workflow frameworks like stepFunctions, SWF, which provide this kind of cross task views in their UIs, often can be too heavy weight and expensive in some cases. For scenarios where our workflows are long running but just a linked list of steps, there could be an argument to provide the below out of the box:

Publishing in flight spans
A way to dynamically switch spans

For the second request, the approach suggested by @anuraaga makes a lot of sense and easily adoptable.

anuraaga · 2022-01-30T04:55:53Z

I think this is the issue in the spec related to exporting in-progress spans

open-telemetry/opentelemetry-specification#373

The SDK itself is ready for this as @jkwatson mentions, a SpanProcessor that exports on both onStart and onEnd would do the export. The backend would need to render the onStart data in a reasonable way and support mutation of that span which would happen with onEnd. Not sure if this is common among backends

sganapathy1312 · 2022-01-30T05:11:24Z

Gotcha.

I just thought of another gap even in the existing workflow systems. While StepFunctions { the one i've worked with } provides great insights at a per workflow level, it still treats the step response as unstructured data and so we're unable to gain insights across workflows. Thus in multi tenant systems where each request results in a workflow for provisioning, deployments and other long running operations, it's not easy to get insights across workflows and if we want to measure things across workflow level, there seems to be a gap.

For our restricted use case { each step is homogenous and does the exact same thing but on different inputs }, I'm thinking of leveraging ot metrics with attributes. Since our collector publishes in a backend that supports easy querying as well, I'm thinking of adopting this for now.

adelel1 · 2022-07-01T09:25:02Z

The SDK itself is ready for this as @jkwatson mentions, a SpanProcessor that exports on both onStart and onEnd would do the export. The backend would need to render the onStart data in a reasonable way and support mutation of that span which would happen with onEnd. Not sure if this is common among backends

Does anyone know of any backends that will support this? I know Jaeger doesn't. It will display both spans and add a warning to the second one, stating its a duplicate.

sganapathy1312 added the Feature Request Suggest an idea for this project label Jan 29, 2022

adelel1 mentioned this issue Jun 30, 2022

Support receiving duplicate spans keeping only the latest jaegertracing/jaeger#3790

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support long running multi tenant workflows #4133

Support long running multi tenant workflows #4133

sganapathy1312 commented Jan 29, 2022 •

edited

Loading

jkwatson commented Jan 29, 2022

sganapathy1312 commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022

jkwatson commented Jan 30, 2022

anuraaga commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022 •

edited

Loading

anuraaga commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022

adelel1 commented Jul 1, 2022

Support long running multi tenant workflows #4133

Support long running multi tenant workflows #4133

Comments

sganapathy1312 commented Jan 29, 2022 • edited Loading

jkwatson commented Jan 29, 2022

sganapathy1312 commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022

jkwatson commented Jan 30, 2022

anuraaga commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022 • edited Loading

anuraaga commented Jan 30, 2022

sganapathy1312 commented Jan 30, 2022

adelel1 commented Jul 1, 2022

sganapathy1312 commented Jan 29, 2022 •

edited

Loading

sganapathy1312 commented Jan 30, 2022 •

edited

Loading