Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support long running multi tenant workflows #4133

Open
sganapathy1312 opened this issue Jan 29, 2022 · 10 comments
Open

Support long running multi tenant workflows #4133

sganapathy1312 opened this issue Jan 29, 2022 · 10 comments
Labels
Feature Request Suggest an idea for this project

Comments

@sganapathy1312
Copy link

sganapathy1312 commented Jan 29, 2022

Is your feature request related to a problem? Please describe.
We manage deployments as a series of stages with each stage being delegated to a service provider interface. Our processing layer is vert.x and backend is postgresql. We are trying to explore our visibility options for an inflight deployment workflow and we evaluated open telemetry and ran into a few open questions:

  • The main vert.x job scans the db and picks all pending deployments and initiates new service provider tasks. We wanted to create or reuse existing spans for each pending deployment. Similarly we want to create or reuse spans for each service producer task in its vert.x job.
  • We want to be able to publish a started span { not just a finished one } via a span processor to our backends.
  • We want to be resilient to process startups and be able to create a span from a custom start time { when the in flight service provider/deployment record was first created }.

Describe the solution you'd like

  • Provide ability to switch contexts based on the current record being processed.
  • Provide ability to spans based on custom start times.
  • Provide ability to publish started spans { and not just completed spans } to the collector.

Describe alternatives you've considered

Since each task within our deployment workflow is homogenous, metrics with attributes will be a great fit. But for a general purpose workflow management system, we'd be better off with native tracing system that allows users to monitor an inflight trace in the backend.

Additional context
Add any other context or screenshots about the feature request here.

@sganapathy1312 sganapathy1312 added the Feature Request Suggest an idea for this project label Jan 29, 2022
@jkwatson
Copy link
Contributor

So, at least 2 of these things are definitely already possible.

Provide ability to spans based on custom start times.

This is already possible with the existing APIs.

Provide ability to publish started spans { and not just completed spans } to the collector.

This can be done with a custom SpanProcessor and SpanExporter implementation. The collector side is definitely outside the scope of this project. If you need in-flight span support from the collector, please open up an issue in the specification for it.

Provide ability to switch contexts based on the current record being processed.

I don't exactly understand what this means, but you can manage your Context instances however you like. Is there a specific feature you're looking for here from the context APIs?

@sganapathy1312
Copy link
Author

I don't exactly understand what this means, but you can manage your Context instances however you like. Is there a specific feature you're looking for here from the context APIs?

In a particular run of the job, it could process several in flight records checking to see if terminal state is reached or if new events can be recorded. I was hoping we could be able to switch the current span based on the currently processed record { sorry if using context there, caused some confusion }. This would allow us to record events like queuing delays, extra attributes like any errors observed in the Service provider delegate and so forth.

@sganapathy1312
Copy link
Author

Thanks for the clarification on the other two requests! I recently started exploring more about Open Telemetry. It's an awesome project and is changing the way i used to think about observability.

@jkwatson
Copy link
Contributor

I don't exactly understand what this means, but you can manage your Context instances however you like. Is there a specific feature you're looking for here from the context APIs?

In a particular run of the job, it could process several in flight records checking to see if terminal state is reached or if new events can be recorded. I was hoping we could be able to switch the current span based on the currently processed record { sorry if using context there, caused some confusion }. This would allow us to record events like queuing delays, extra attributes like any errors observed in the Service provider delegate and so forth.

I think this could be done pretty easily with some sort of Map of record -> span, but it's not something that the core APIs would probably support out of the box, unless it solves some sort of general problem (and went through the specification process to be added to the official APIs).

@anuraaga
Copy link
Contributor

A common pattern is to attach Context to a record if it goes through multiple stages. Maybe something along these lines

pipeline.register(stage1, (event) -> {
  Context context = Context.current().with(tracer.startSpan());
  event.setAttribute(Context.class, context);
  try (context.makeCurrent() {
    logic()
  }
});

pipeline.register(stage2, event -> {
  Context context = event.getAttribute(Context.class);
  try(context.makeCurrent()) {
    logic()
  }
});

This is just a hypothetical event processing API but shows patterns that are very common in most of them of being able to add arbitrary objects to an event or record. So you would just set it in the beginning and read it in each stage of processing. Maybe this provides some pointers for what you are trying to achieve.

@sganapathy1312
Copy link
Author

For us that would mean serializing the context and storing in db along with the record state and then deserializing and making it current. Thanks! that's a very nice idea.

@sganapathy1312
Copy link
Author

sganapathy1312 commented Jan 30, 2022

In general, workflow frameworks like stepFunctions, SWF, which provide this kind of cross task views in their UIs, often can be too heavy weight and expensive in some cases. For scenarios where our workflows are long running but just a linked list of steps, there could be an argument to provide the below out of the box:

  • Publishing in flight spans
  • A way to dynamically switch spans

For the second request, the approach suggested by @anuraaga makes a lot of sense and easily adoptable.

@anuraaga
Copy link
Contributor

I think this is the issue in the spec related to exporting in-progress spans

open-telemetry/opentelemetry-specification#373

The SDK itself is ready for this as @jkwatson mentions, a SpanProcessor that exports on both onStart and onEnd would do the export. The backend would need to render the onStart data in a reasonable way and support mutation of that span which would happen with onEnd. Not sure if this is common among backends

@sganapathy1312
Copy link
Author

Gotcha.

I just thought of another gap even in the existing workflow systems. While StepFunctions { the one i've worked with } provides great insights at a per workflow level, it still treats the step response as unstructured data and so we're unable to gain insights across workflows. Thus in multi tenant systems where each request results in a workflow for provisioning, deployments and other long running operations, it's not easy to get insights across workflows and if we want to measure things across workflow level, there seems to be a gap.

For our restricted use case { each step is homogenous and does the exact same thing but on different inputs }, I'm thinking of leveraging ot metrics with attributes. Since our collector publishes in a backend that supports easy querying as well, I'm thinking of adopting this for now.

@adelel1
Copy link

adelel1 commented Jul 1, 2022

The SDK itself is ready for this as @jkwatson mentions, a SpanProcessor that exports on both onStart and onEnd would do the export. The backend would need to render the onStart data in a reasonable way and support mutation of that span which would happen with onEnd. Not sure if this is common among backends

Does anyone know of any backends that will support this? I know Jaeger doesn't. It will display both spans and add a warning to the second one, stating its a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Suggest an idea for this project
Projects
None yet
Development

No branches or pull requests

4 participants