BatchSpanProcessor that uses provided ScheduledExecutorService #3036

piotr-sumo · 2021-03-17T13:37:51Z

Oberon00 · 2021-03-17T15:36:16Z

Thank you for this PR! Please fill in the PR description, e.g. "Fixes #2980" if that is the case.

Oberon00 · 2021-03-17T15:39:49Z

.../src/main/java/io/opentelemetry/sdk/extension/trace/export/ExecutorServiceSpanProcessor.java

+    this.ownsExecutorService = ownsExecutorService;
+    this.executorService = executorService;
+    this.future =
+        executorService.scheduleAtFixedRate(


I think it would be better to use scheduleWithFixedDelay to avoid getting in a busy loop if export takes too long. That may mean dropped spans, but this is IMHO usually preferable to taking too much CPU time away.

jkwatson · 2021-03-17T17:16:45Z

Since we're getting so many different variant PRs for updates to the BSP (I think we're up to 4!), it might make sense to put them all into an incubator module so people have options and can try them all out to see what works best.

sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/WorkerBase.java

piotr-sumo · 2021-03-18T09:26:31Z

@jkwatson I have the question about the incubator module. Is something I should do in this PR?

jkwatson · 2021-03-18T13:35:32Z

@jkwatson I have the question about the incubator module. Is something I should do in this PR?

If it's a general question about it, a Discussion might be better so it's more visible.

jkwatson · 2021-03-19T16:39:35Z

sdk-extensions/tracing-incubator/build.gradle.kts

@@ -13,6 +13,7 @@ extra["moduleName"] = "io.opentelemetry.sdk.extension.trace.incubator"
 dependencies {
    api(project(":api:all"))
    api(project(":sdk:all"))
+    api(project(":sdk:metrics"))


I think we want to leave metrics as an implementation detail for now, like we have in the core sdk trace module.

jkwatson · 2021-03-19T16:50:42Z

I'm interested to see how the various benchmarks run against this variant. Would you mind also copying over the benchmarks for the BatchSpanProcessor as well? And, if you could run them against this variant and compare to running them against the BSP on main right now?

piotr-sumo · 2021-03-22T08:03:44Z

@jkwatson sure, I will run benchmarks and post results here.

…y-java into executor-service-batch-span-processor

piotr-sumo · 2021-03-22T08:36:15Z

@jkwatson I've tried to run jmh but unfortunately I've encountered melix/jmh-gradle-plugin#175

Execution failed for task ':sdk:all:jmh'.
> A failure occurred while executing me.champeau.gradle.IsolatedRunner
   > Error while instantiating tests: unable to set 'list' on Runner. This plugin version doesn't seem to be compatible with JMH 1.25. Please report to the plugin authors at https://github.com/melix/jmh-gradle-plugin/.

I have JDK 15 installed on my machine.
How do you run microbenchmarks from your machine?

piotr-sumo · 2021-03-22T14:12:15Z

@jkwatson It turns out that one needs JDK 11 to run benchmarks. I've run them and it turns out that the performance of this variant is a little bit worse than the original BatchSpanProcessor

piotr-sumo · 2021-03-23T08:28:11Z

@jkwatson I've merged main to trigger test re-run. Could you now take a look at benchmark results and my PR?

jkwatson · 2021-03-23T14:59:24Z

@jkwatson I've merged main to trigger test re-run. Could you now take a look at benchmark results and my PR?

I'm on vacation this week. I'll look next week if @anuraaga can't get to it.

piotr-sumo · 2021-03-26T11:32:00Z

@anuraaga could take a look at this PR?

anuraaga · 2021-03-27T06:41:00Z

...java/io/opentelemetry/sdk/extension/incubator/trace/ExecutorServiceSpanProcessorBuilder.java

+   */
+  public ExecutorServiceSpanProcessorBuilder setWorkerScheduleInterval(Duration interval) {
+    requireNonNull(interval, "interval");
+    return setWorkerScheduleInterval(interval.toMillis(), TimeUnit.MILLISECONDS);


May as well use nanos

anuraaga · 2021-03-27T06:47:44Z

...acing-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/WorkerBase.java

+
+    try {
+      final CompletableResultCode result = spanExporter.export(new ArrayList<>(batch));
+      result.join(exporterTimeoutNanos, TimeUnit.NANOSECONDS);


Sorry that it took reading code to come up with this basic question - is there a use case for sharing an ExecutorService if we're doing blocking I/O? For example, if we try to share a single thread between span processor and interval metric reader, than while we're waiting for spans to be exported, this thread is asleep doing nothing while metrics can't be processed. Or if we use two thread executor, then we may as well have used separate single-thread executors instead of opening up a can of worms for blocking I/O to affect other processes. @Oberon00 what do you think?

We had an asynchronous BSP at one point and for such a processor, sharing an executor makes a lot of sense.

while we're waiting for spans to be exported, this thread is asleep doing nothing while metrics can't be processed

The current implementation of the IntervalMetricReader does metric collection and export on the same thread, so this can really be a problem. But if it did metric export and collection on separate threads, it could share only the metric exporting thread, not the collection thread.

We had an asynchronous BSP at one point

I think the join will need to be removed and the BSP needs to use the whenComplete mechanism for this implementation to become more robust (to maintain the timeout, maybe additionally use schedule to signal the whenComplete callback to cancel the whenComplete).

This will also mean that we can no longer use scheduleWithFixedDelay but need to schedule() the next export ourselves whenever the last whenComplete is done.

@anuraaga Does that sound reasonable?

it could share only the metric exporting thread, not the collection thread.

I guess even with this, if we're not doing the I/O simultaneously it could cause issues with the export throughput. Pretty sure that's what you're thinking too just confirming :)

This will also mean that we can no longer use scheduleWithFixedDelay but need to schedule() the next export ourselves whenever the last whenComplete is done.

Yeah this seems ok to me - unless I'm missing something I think it's a prerequisite for having exporters share threads.

it could cause issues with the export throughput

I'm not sure how serious this is. Even if metrics and spans have to wait for each other, it depends on the intervals if this is a problem. In the worst case where the thread is always "busy" (at least waiting for I/O to complete) and metrics and spans cause about the same traffic, even when using two threads (one for metrics, one for spans) you get at most 2x improvement.

@anuraaga @Oberon00 thank you for the detailed explanation. I will rework my SpanProcessor.

…executor-service-batch-span-processor

…y-java into executor-service-batch-span-processor

jkwatson · 2021-03-30T16:50:26Z

FYI on the benchmarks...the CPU-based benchmarks should be unchanged, and this is intentional, as they really aren't supposed to be useful without a profiler hooked up.

...r/src/jmh/java/io/opentelemetry/sdk/extension/incubator/trace/BatchSpanProcessorMetrics.java

anuraaga

This seems reasonable to me, thanks!

Oberon00 · 2021-03-31T07:59:49Z

...c/main/java/io/opentelemetry/sdk/extension/incubator/trace/ExecutorServiceSpanProcessor.java

+    public void run() {
+      // nextExportTime is set for the first time in the constructor
+
+      continueWork.set(true);


Does this need to be an atomic field? It seems that a local boolean variable would work just as well.

@Oberon00 I don't know what would happen if the next run of the worker Runnable will be scheduled on a different thread than the previous one. If it is ok to have a local boolean variable, I will update the code.

But it looks like the boolean is never used outside this method, except in shutdown, and there it seems to change nothing as you already check isShutdown too. But I may have misread the logic.

isShutdown - should the SpanProcessor shutdown. If true, then no more work should be done.
continueWork - should the current run (loop) continue or not. If false, then at some point in the future the loop will be running again.

The question is: If continueWork.set(false); is done in WorkerBase.shutdown, is isShutdown also already true? If so, then that access to continueWork could be removed. And then we have the situation that at the beginning of run continueWork is always set before it is read and continueWork is never used outside the method. In that case it can be transformed without change of behavior to a local boolean.

@Oberon00 you are correct. I've converted continueWork to local variable.

jkwatson · 2021-04-01T17:08:45Z

...acing-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/WorkerBase.java

+import java.util.logging.Level;
+import java.util.logging.Logger;
+
+abstract class WorkerBase implements Runnable {


It's probably ok since this is just in the incubator, but this class is confusing to me. It's unclear how one would use it in general to implement a Batching SpanProcessor. The responsibilities between the abstract class and a concrete implementation are not particularly well defined, as far as I can tell.

I'd also prefer it if we can figure out a solution that doesn't involve inheritance. Can the functionality in this class be something that is either a) usable directly, without extending it or b) this class is concrete and takes an implementation of some strategy interface in order to do its work?

At the absolute minimum, this class would need detailed documentation on how to use it properly, as I suspect that there are many gotchas that have to be done just right for it to work correctly.

This class can reduce the amount of copied code from BatchSpanProcessor worker.
Let me think about other solutions you've proposed above.

@jkwatson I've removed WorkerBase class. I've introduced WorkerExporter which contains all methods useful for exporting a batch of spans.

I think that WorkerExported might be useful compared to WorkerBase.

jkwatson · 2021-04-12T15:42:22Z

...c/main/java/io/opentelemetry/sdk/extension/incubator/trace/ExecutorServiceSpanProcessor.java

+          .setUpdater(
+              result ->
+                  result.observe(
+                      queue.size(), Labels.of(spanProcessorTypeLabel, spanProcessorTypeValue)))


might as well make these labels statically allocated, so we don't have to recreate them every collection cycle.

jkwatson · 2021-04-12T15:43:12Z

...c/main/java/io/opentelemetry/sdk/extension/incubator/trace/ExecutorServiceSpanProcessor.java

+        int maxExportBatchSize,
+        long exporterTimeoutNanos,
+        BlockingQueue<ReadableSpan> queue,
+        String spanProcessorTypeLabel,


Is there a good reason to have these part of the constructor? Aren't they always the same? I'd rather just have them be constants, probably in the enclosing class.

jkwatson · 2021-04-12T15:51:09Z

...c/main/java/io/opentelemetry/sdk/extension/incubator/trace/ExecutorServiceSpanProcessor.java

+          if (lastElement != null) {
+            batch.add(lastElement.toSpanData());
+            // drain queue
+            queue.take();


Rather than peek() then later a take(), why not just poll() initially and skip the extra step?

I didn't want the thread to be blocked when the queue is empty. With jctools I can poll() and skip the extra step.

jkwatson · 2021-04-12T15:55:24Z

...c/main/java/io/opentelemetry/sdk/extension/incubator/trace/ExecutorServiceSpanProcessor.java

+          ReadableSpan lastElement = queue.peek();
+          if (lastElement != null) {
+            batch.add(lastElement.toSpanData());
+            // drain queue


this comment doesn't seem accurate. You're not draining the queue, you're just removing the element you just peek()ed at, correct?
Also, in the CPU-optimization work that was done for the main BSP, we definitely found a significant overhead in working with an ArrayBlockingQueue this way. Any thoughts to using the same signalling technique and the jctools queue here?

I'll try to plug in jctools queue.

all tests pass 👍

jkwatson · 2021-04-12T15:58:28Z

...g-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/WorkerExporter.java

+  private final AtomicReference<CompletableResultCode> flushRequested;
+  private final int maxExportBatchSize;
+
+  WorkerExporter(


Seems weird to have a public class with all public methods with a non-public constructor. Does this class need to be public right now?

nope, it can be in package scope

jkwatson · 2021-04-12T16:20:45Z

...g-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/WorkerExporter.java

+      ScheduledExecutorService executorService,
+      Logger logger,
+      long exporterTimeoutNanos,
+      BoundLongCounter exportedSpans,


rename to "exportedSpanCounter"

jkwatson · 2021-04-12T16:25:32Z

...g-incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/WorkerExporter.java

+  private final Logger logger;
+  private final long exporterTimeoutNanos;
+  private final BoundLongCounter exportedSpans;
+  private final AtomicReference<CompletableResultCode> flushRequested;


The contract on what this is for and how it's used needs to be very carefully documented if we're going to have this class be public. I might start by renaming it to something like "flushSignal" or something that makes it clear that it's a 2-way communication pathway between the span processor and this worker.

piotr-sumo · 2021-04-14T08:03:27Z

@jkwatson thank you for code review. I will address your comments as soon as I have time at work.

piotr-sumo · 2021-04-20T10:36:56Z

@jkwatson I've addressed all comments. Could you re-review?

jkwatson

Let's give it a try in the incubator!

BatchSpanProcessor that uses provided ScheduledExecutorService

f8eb30a

piotr-sumo requested review from anuraaga, arminru, bogdandrutu, carlosalberto, jkwatson, Oberon00, pavolloffay, thisthat and tylerbenson as code owners March 17, 2021 13:37

piotr-sumo added 3 commits March 17, 2021 15:00

Fixing spotless violations

6b6060a

Checkstyle fix

f974774

Spotless fix

50143a3

Oberon00 reviewed Mar 17, 2021

View reviewed changes

jkwatson reviewed Mar 17, 2021

View reviewed changes

sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/WorkerBase.java Outdated Show resolved Hide resolved

piotr-sumo added 3 commits March 18, 2021 09:03

Fixing spotless violations

53d49f0

Fixing checkstyle violations

2ffa3ba

Fixing spotless violations

61eeec1

piotr-sumo added 2 commits March 19, 2021 11:43

Moving the code to the incubator module

036c03f

Delete obsolete gradle module

16a1e35

jkwatson reviewed Mar 19, 2021

View reviewed changes

Merge branch 'main' of https://github.com/open-telemetry/opentelemetr…

02a951e

…y-java into executor-service-batch-span-processor

anuraaga reviewed Mar 27, 2021

View reviewed changes

piotr-sumo added 4 commits March 30, 2021 14:56

Better thread utilisation

2a9fa7e

Merge branch 'main' of github.com:piotr-sumo/opentelemetry-java into …

3bdd671

…executor-service-batch-span-processor

Merge branch 'main' of https://github.com/open-telemetry/opentelemetr…

34d32c6

…y-java into executor-service-batch-span-processor

After merge main

92c63fa

jkwatson reviewed Mar 30, 2021

View reviewed changes

...r/src/jmh/java/io/opentelemetry/sdk/extension/incubator/trace/BatchSpanProcessorMetrics.java Show resolved Hide resolved

anuraaga approved these changes Mar 31, 2021

View reviewed changes

Oberon00 reviewed Mar 31, 2021

View reviewed changes

Convert AtomicBoolean into local variable

3850782

jkwatson reviewed Apr 1, 2021

View reviewed changes

Remove WorkerBase and introduce WorkerExporter

40a217f

jkwatson reviewed Apr 12, 2021

View reviewed changes

More code review fixes

39a4aff

jkwatson approved these changes Apr 20, 2021

View reviewed changes

jkwatson merged commit 68b047e into open-telemetry:main Apr 20, 2021

jack-berg mentioned this pull request Nov 12, 2021

Remove deprecated ExecutorServiceSpanProcessor #3864

Merged

This was referenced Dec 19, 2021

Temurin JDK #4011

Merged

use Eclipse Temurin JDK docker image #4012

Merged

BatchSpanProcessor that uses provided ScheduledExecutorService #3036

BatchSpanProcessor that uses provided ScheduledExecutorService #3036

Conversation

piotr-sumo commented Mar 17, 2021 • edited Loading

Oberon00 commented Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

jkwatson commented Mar 17, 2021

piotr-sumo commented Mar 18, 2021

jkwatson commented Mar 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkwatson commented Mar 19, 2021

piotr-sumo commented Mar 22, 2021

piotr-sumo commented Mar 22, 2021

piotr-sumo commented Mar 22, 2021

piotr-sumo commented Mar 23, 2021

jkwatson commented Mar 23, 2021

piotr-sumo commented Mar 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkwatson commented Mar 30, 2021

anuraaga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotr-sumo commented Apr 14, 2021

piotr-sumo commented Apr 20, 2021

jkwatson left a comment

Choose a reason for hiding this comment

piotr-sumo commented Mar 17, 2021 •

edited

Loading

Oberon00 commented Mar 17, 2021 •

edited

Loading