Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to Skip Error Trace in Prometheus Exporter #36887

Open
NishantSarraff opened this issue Dec 18, 2024 · 6 comments
Open

Option to Skip Error Trace in Prometheus Exporter #36887

NishantSarraff opened this issue Dec 18, 2024 · 6 comments
Assignees
Labels
question Further information is requested

Comments

@NishantSarraff
Copy link

Component(s)

exporter/prometheus

Is your feature request related to a problem? Please describe.

Yes, the feature request is related to a problem I am experiencing with the OpenTelemetry Collector.

In my use case, I am pushing customer responses into metrics using go-grpc client:

"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"

otlpmetricgrpc.New(ctx,
	otlpmetricgrpc.WithGRPCConn(conn),
)

and my observability architecture is based on Prometheus scraping. I am using the following configuration for my OpenTelemetry Collector:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "${OTLP_GRPC_ENDPOINT}"  # Default OTLP gRPC port
      http:
        endpoint: "${OTLP_HTTP_ENDPOINT}"

processors:
  batch:
    send_batch_size: 100  # Optional: adjust batch settings as needed
    timeout: 1s

exporters:
  prometheus:
    endpoint: "${PROMETHEUS_ENDPOINT}"  # Endpoint where Prometheus will scrape metrics
    namespace: "${PROMETHEUS_NAMESPACE}"  # Set your desired namespace/prefix here
    const_labels:
      otlp: "${OTLP_LABEL}"
    send_timestamps: ${SEND_TIMESTAMPS}
    metric_expiration: ${METRIC_EXPIRATION}
    enable_open_metrics: true
    add_metric_suffixes: false
    resource_to_telemetry_conversion:
      enabled: false
  debug:
    verbosity: detailed
    sampling_thereafter: 200

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, prometheus]
  telemetry:
    metrics:
      level: ${TELEMETRY_METRICS_LEVEL}
    logs:
      level: "${LOGS_LEVEL}"
      encoding: "${LOGS_ENCODING}"  # Optional: "console" or "json"

When I push a label with an invalid UTF-8 value, such as "label4", "invalidutf8\x80", I receive the following error continuously:

20:36:01 failed to upload metrics: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8
20:36:06 failed to upload metrics: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8
20:36:11 failed to upload metrics: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8
20:36:16 failed to upload metrics: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8
....

Image: otel/opentelemetry-collector-contrib:0.109.0

Even though I have pushed the invalid label only once, I continue to receive this error for hours. Additionally, after encountering this error, no new metric operations are registered on the OpenTelemetry side. This issue prevents the system from recovering and processing subsequent valid metrics.

Describe the solution you'd like

If possible, I would like to drop the error trace. I tried setting the RetryConfig:

exporter, err = otlpmetricgrpc.New(ctx,
			otlpmetricgrpc.WithGRPCConn(conn),
			otlpmetricgrpc.WithRetry(otlpmetricgrpc.RetryConfig{
				Enabled:         false,
				InitialInterval: 1 * time.Second,
				MaxInterval:     1 * time.Second,
				MaxElapsedTime:  1 * time.Second,
			}),
		)

but still I am getting error with 5 second difference.
I tried setting

retry_on_failure:
      enabled: true

under exporters.prometheus but the otel-sidecar is saying invalid config

Describe alternatives you've considered

Alternatives can we replacing the invalid UTF-8 chars to some other character, but I believe this solution is already in consideration.

Additional context

I am using otel/opentelemetry-collector-contrib:0.109.0 version and I tried with all the latest one but was getting stuck in the same situation.

@NishantSarraff NishantSarraff added enhancement New feature or request needs triage New item requiring triage labels Dec 18, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

The error is coming from your client, right? It looks like the otlpmetricgrpc exporter is unable to push metrics over grpc because it contains invalid UTF-8. I don't think this is an issue with the Prometheus exporter.

@dashpole dashpole self-assigned this Dec 18, 2024
@dashpole dashpole added question Further information is requested and removed needs triage New item requiring triage enhancement New feature or request exporter/prometheus labels Dec 18, 2024
@NishantSarraff
Copy link
Author

I may be wrong but if it was client issue then this should have worked:

otlpmetricgrpc.WithRetry(otlpmetricgrpc.RetryConfig{
				Enabled:         false,
				InitialInterval: 1 * time.Second,
				MaxInterval:     1 * time.Second,
				MaxElapsedTime:  1 * time.Second,
			}),

@dashpole
Copy link
Contributor

Marshaling should be happening in the client. Do you see the log in the application logs, or in the collector logs?

@NishantSarraff
Copy link
Author

NishantSarraff commented Dec 18, 2024

I did get this logs on application side, but on collector side as well, debug logs stop coming.
What I observed is, every 5 seconds collector publishes this log:

Resource SchemaURL: https://opentelemetry.io/schemas/1.26.0
Resource attributes:
     -> service.name: Str(NAMESPACE_)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope NAMESPACE_ 
Metric #0
Descriptor:
     -> Name: NAMESPACE_METRIC_NAME
     -> Description: 
     -> Unit: 
     -> DataType: Sum
     -> IsMonotonic: true
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
     -> label1: Str(value1)
StartTimestamp: 
Timestamp: 
Value: 1
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}
TIMESTAMP     debug   [email protected]/accumulator.go:79   accumulating metric: NAMESPACE_METRIC_NAME      {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
TIMESTAMP    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
 TIMESTAMP       info    ResourceMetrics #0
 
 ..... AGAIN 
 ..... AGAIN

but once I pushed the invalid utf-8 string as value for one of the label, I stopped getting these logs.....

@ArthurSens
Copy link
Member

The log you're showing is what the debug exporter logs when receiving metrics. Since it stopped logging, it shows that the collector is no longer receiving metrics.

You can also take a look at metrics that the OTLP receiver exposes. I can't remember the name exactly, I'm sorry 😅 , but if they stopped increasing it tells us that your app is failing to send metrics and not the collector that is failing to receive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants