Use new classic histogram conversion algorithm #1919

dricross · 2025-10-24T16:32:00Z

Description of the issue

Integrate with new algorithm for converting OTel classic histograms to CloudWatch's Value/Counts. This significantly improves percentile metrics for classic histogram metrics.

Description of changes

Integrate with new algorithm in contrib repo: amazon-contributing/opentelemetry-collector-contrib#376

Associated integ test update to fix histogram definitions: aws/amazon-cloudwatch-agent-test#617

New Algorithm

This algorithm converts each of the buckets of the input histogram into (at most) 10 value/count data pairs aka "inner buckets". The values of the inner buckets are spread evenly across the bucket span. The counts of the inner buckets are determined using an exponential mapping algorithm. Counts are weighted more heavily to one side according to an exponential function depending on how the density of the nearby buckets are changing.

Aggregation

Currently, the cloudwatch output plugin converts inputs histograms to the internal type RegularDistribution which contains a series of value/count pairs in a map. When receiving a new histogram datapoint for the same metric within the aggregation interval, the new histogram datapoint is converted to a RegularDistribution and the two resultant maps are combined, e.g. (where weight is 1):

for bucketNumber, bucketCounts := range fromDistribution.buckets {
	regularDist.buckets[bucketNumber] += bucketCounts * weight
}

With this new conversion algorithm, we are going to delay converting the OTel histogram to a series of value/count pairs until the aggregation interval is complete to avoid ballooning the number of unique values in the aggregated metric datapoint. Instead, the original OTel histogram datapoint is preserved and incoming histogram datapoints are merged similar to how the OTel deltatocumulative processor works: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/deltatocumulativeprocessor/internal/data/add.go#L46. This makes the aggregation logic a little messy.

Adapter Receiver Histogram Metrics

The adapter receiver uses OTel histograms as a way to shepherd a series of values/counts to CloudWatch. As an example, StatsD histograms are ingested and stored as values/counts using a map in a distribution type. The distribution is converted to an OTel histogram to send it down the metric pipeline. This conversion stores the values as bucket bounds (which isn't quite right) and the counts as bucket counts. This cause len(bounds) == len(counts) violating the OTel histogram format. Additionally, the order of the "values" is not monotonically increasing, which also violates the OTel histogram format.

All the adapter receiver wants to do is send the input values/counts to CloudWatch so that percentile metrics are available. To keep this functionality, the new conversion algorithm and aggregation logic will be bypassed for adapter receiver histogram metrics. This is achieved by marking adapter receiver histogram datapoints with a special attribute.

Summary

[Unchanged] Gauge/Counter metrics are aggregated using RegularDistribution data type and output as value/count pairs at the end of the aggregation interval
[Unchanged] ExponentialHistogram metrics are aggregated using ExpHistogramDistribution data type and output as value/count pairs at the end of aggregation interval.
[Unchanged] Adapter receiver classic histogram metrics are aggregated using RegularDistribution data type and output as value/count pairs at the end of the aggregation interval
[NEW] Classic histogram metrics (besides those originating from the adapter receiver) are aggregated using OTel's pmetric.HistogramDataPoint type. The aggregated datapoint is converted from OTel histogram to Values/Counts using the new classic histogram mapping algorithm from the opentelemetry-collector-contrib repo at the end of the aggregation interval.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Use the new tools in the contrib repo so send histogram test cases to CloudWatch and then retrieve the percentile metrics. See amazon-contributing/opentelemetry-collector-contrib#376 for more details.

The existing histogram test fails as its sending invalid OTel histograms to the agent which it will now drop. Updated test repo: aws/amazon-cloudwatch-agent-test#617

Full integ test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/20035810152

Requirements

Run make fmt and make fmt-sh
Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

github-actions · 2025-11-07T00:13:50Z

This PR was marked stale due to lack of activity.

jefchien · 2025-12-02T21:35:46Z

go.mod

+replace go.opentelemetry.io/otel/sdk => go.opentelemetry.io/otel/sdk v1.35.0
+
+replace go.opentelemetry.io/collector/pdata => go.opentelemetry.io/collector/pdata v1.30.0


Can you add comments explaining why we need to pin to this specific version?

I don't recall why I needed to pin these so I tried removing them. The pin for go.opentelemetry.io/collector/pdata can go away and it compiles just fine. The pin for go.opentelemetry.io/otel/sdk seems to be related to an issue with our translator:

go: finding module for package go.opentelemetry.io/otel/sdk/internal/internaltest go: github.com/aws/amazon-cloudwatch-agent/translator/translate/otel imports go.opentelemetry.io/collector/service imports go.opentelemetry.io/contrib/otelconf/v0.3.0 imports go.opentelemetry.io/otel/sdk/log tested by go.opentelemetry.io/otel/sdk/log.test imports go.opentelemetry.io/otel/sdk/internal/internaltest: module go.opentelemetry.io/otel/sdk@latest found (v1.38.0), but does not contain package go.opentelemetry.io/otel/sdk/internal/internaltest

I think the issue here is that we're still on OTel collector version v0.124.1/v1.30.0 which uses OTel version v1.35.0 (see https://github.com/open-telemetry/opentelemetry-collector/blob/cf18559059736136ae78dcdef87548291ba6c8aa/service/go.mod#L52). Now go mod tidy upgrades us to v1.38.0 (was v1.37.0 when the PR was written) which includes some changes to the internals of the SDK that are incompatible with our collector version. I'm wondering if we should start adding explicit replaces when we do the OTel bump to be very explicit about which version of the OTel dependencies we are using.

plugins/outputs/cloudwatch/aggregator.go

jefchien · 2025-12-02T21:53:12Z

plugins/outputs/cloudwatch/cloudwatch.go

+	s := cloudwatch.StatisticSet{}
+	s.SetMaximum(cwhist.Maximum())
+	s.SetMinimum(cwhist.Minimum())
+	s.SetSampleCount(cwhist.SampleCount())
+	s.SetSum(cwhist.Sum())


What's this used for? It seems like it gets replaced on line 585 before it's ever used.

Ah good point, this isn't used. I'll remove it.

jefchien · 2025-12-02T21:55:54Z

plugins/outputs/cloudwatch/cloudwatch.go

+			s := cloudwatch.StatisticSet{}
+			s.SetMaximum(cwhist.Maximum())
+			s.SetMinimum(cwhist.Minimum())
+			s.SetSum(0.0)
+			// Only assign `Sum` if this is the first split to make sure the total sum of the datapoints after aggregation is correct.
+			if i == 0 {
+				s.SetSum(cwhist.Sum())
+			}
+			count := 0.0
+			for _, c := range batchCounts {
+				count += c
+			}
+			s.SetSampleCount(count)


Shouldn't this be created one level up? Doesn't seem specific to the dimension datum expansion.

You mean outside of the for loop? That makes sense. No need to re create the statistic set.

jefchien · 2025-12-02T21:57:04Z

plugins/outputs/cloudwatch/cloudwatch.go

+		batchValues := values[i:end]
+		batchCounts := counts[i:end]
+
+		for index, dimensions := range dimensionsList {


Can this and the buildMetricDatumDist dimension expansion logic be shared?

If we move the formation of the statistic set outside of the for loop like you suggested, then yeah we should be able to share this logic.

jefchien · 2025-12-11T00:06:30Z

plugins/outputs/cloudwatch/cloudwatch.go

 }

-// ConsumeMetrics queues metrics to be published to CW.
+// ConsumeMetrics queues metrics to be published to CW.588


jefchien · 2025-12-11T00:11:19Z

receiver/adapter/accumulator/metrics.go

 		addTagsToAttributes(h.Attributes(), tags)
+
+		// add special attribute to tell the exporter to handle this histogram datapoint differently
+		h.Attributes().PutStr(constants.AdapterReceiverAttribute, "")


How do we ensure that this doesn't show up by accident if we export this metric using a different exporter (like the awsemfexporter)?

jefchien · 2025-12-11T00:11:56Z

plugins/outputs/cloudwatch/convert_otel.go

 			return true
 		}
+		// we don't want to export the adapter receiver attribute
+		if k == constants.AdapterReceiverAttribute {


Can we add a unit test for this?

dricross force-pushed the dricross/classichistograms branch 3 times, most recently from 16632ac to c10ab8f Compare October 27, 2025 19:38

dricross marked this pull request as ready for review October 27, 2025 19:58

dricross requested a review from a team as a code owner October 27, 2025 19:58

dricross added the ready for testing Indicates this PR is ready for integration tests to run label Oct 27, 2025

github-actions bot added the Stale label Nov 7, 2025

jefchien reviewed Dec 2, 2025

View reviewed changes

github-actions bot removed the Stale label Dec 3, 2025

dricross added 20 commits December 4, 2025 13:40

Generate histogram test case inputs

4e59bea

Add expected values for each test case

1ca64bb

Add expected min. Add visualization utility

3dec2ed

fixup 0/1 bucket test cases. add unbounded histogram case

8a0a875

Add more test cases, validation on more percentiles

e21d8f1

Bring in exponential mapping function

ff63673

Update exponential algorithm. add accuracy tests

a361fbb

New algorithm passes all unit tests. Added comparison tool

41f6b42

documentation and testing updates

0d02f75

some optimization for exponential mapping cw

afbf52b

replace math.Pow

c2b07aa

spread remainder linearly

2143b12

test lognormal/webiull for all algorithms

40bbb2d

Add cloudwatch tester program

3bb1cd7

Remove intermediate entries array

f7e9cde

adjust overflow safe average calc

056030e

Remove unnecessary logic

98e68db

Remove logarithm and replace thresholds

da38150

Add some comments

b3272dd

Update test data

d3aa684

dricross added 15 commits December 8, 2025 08:02

change density detection. fix maximum value count

b7e95f2

use histograms in otel

39270c1

integrate with new conversion algorithm

fd72632

Drop invalid histograms

abac8cd

remove test data

65351b7

Move histogram tests to contrib repo

7744dce

Moved code to contrib repo

5d1e164

make fmt

10e5b33

Update pkg/aws contrib reference

2d2f7b2

fixup unit tests

838e05f

Update contrib repo

f58d56d

Skip new conversion method for adapter receiver metrics

160c59e

drop adapter receiver attribute from otel dimensions

48c0e45

Share dimension expansion logic. Remove unused var

8fdb780

Rebase

3c86421

dricross force-pushed the dricross/classichistograms branch from e6bc078 to 3c86421 Compare December 8, 2025 14:24

jefchien reviewed Dec 11, 2025

View reviewed changes

		replace go.opentelemetry.io/otel/sdk => go.opentelemetry.io/otel/sdk v1.35.0

		replace go.opentelemetry.io/collector/pdata => go.opentelemetry.io/collector/pdata v1.30.0

Use new classic histogram conversion algorithm #1919

Are you sure you want to change the base?

Use new classic histogram conversion algorithm #1919

Uh oh!

Conversation

dricross commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the issue

Description of changes

New Algorithm

Aggregation

Adapter Receiver Histogram Metrics

Summary

License

Tests

Requirements

Integration Tests

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dricross commented Oct 24, 2025 •

edited

Loading