-
Notifications
You must be signed in to change notification settings - Fork 237
Use new classic histogram conversion algorithm #1919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
16632ac to
c10ab8f
Compare
|
This PR was marked stale due to lack of activity. |
go.mod
Outdated
| replace go.opentelemetry.io/otel/sdk => go.opentelemetry.io/otel/sdk v1.35.0 | ||
|
|
||
| replace go.opentelemetry.io/collector/pdata => go.opentelemetry.io/collector/pdata v1.30.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments explaining why we need to pin to this specific version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall why I needed to pin these so I tried removing them. The pin for go.opentelemetry.io/collector/pdata can go away and it compiles just fine. The pin for go.opentelemetry.io/otel/sdk seems to be related to an issue with our translator:
go: finding module for package go.opentelemetry.io/otel/sdk/internal/internaltest
go: github.com/aws/amazon-cloudwatch-agent/translator/translate/otel imports
go.opentelemetry.io/collector/service imports
go.opentelemetry.io/contrib/otelconf/v0.3.0 imports
go.opentelemetry.io/otel/sdk/log tested by
go.opentelemetry.io/otel/sdk/log.test imports
go.opentelemetry.io/otel/sdk/internal/internaltest: module go.opentelemetry.io/otel/sdk@latest found (v1.38.0), but does not contain package go.opentelemetry.io/otel/sdk/internal/internaltest
I think the issue here is that we're still on OTel collector version v0.124.1/v1.30.0 which uses OTel version v1.35.0 (see https://github.com/open-telemetry/opentelemetry-collector/blob/cf18559059736136ae78dcdef87548291ba6c8aa/service/go.mod#L52). Now go mod tidy upgrades us to v1.38.0 (was v1.37.0 when the PR was written) which includes some changes to the internals of the SDK that are incompatible with our collector version. I'm wondering if we should start adding explicit replaces when we do the OTel bump to be very explicit about which version of the OTel dependencies we are using.
| s := cloudwatch.StatisticSet{} | ||
| s.SetMaximum(cwhist.Maximum()) | ||
| s.SetMinimum(cwhist.Minimum()) | ||
| s.SetSampleCount(cwhist.SampleCount()) | ||
| s.SetSum(cwhist.Sum()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this used for? It seems like it gets replaced on line 585 before it's ever used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point, this isn't used. I'll remove it.
| s := cloudwatch.StatisticSet{} | ||
| s.SetMaximum(cwhist.Maximum()) | ||
| s.SetMinimum(cwhist.Minimum()) | ||
| s.SetSum(0.0) | ||
| // Only assign `Sum` if this is the first split to make sure the total sum of the datapoints after aggregation is correct. | ||
| if i == 0 { | ||
| s.SetSum(cwhist.Sum()) | ||
| } | ||
| count := 0.0 | ||
| for _, c := range batchCounts { | ||
| count += c | ||
| } | ||
| s.SetSampleCount(count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be created one level up? Doesn't seem specific to the dimension datum expansion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean outside of the for loop? That makes sense. No need to re create the statistic set.
| batchValues := values[i:end] | ||
| batchCounts := counts[i:end] | ||
|
|
||
| for index, dimensions := range dimensionsList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this and the buildMetricDatumDist dimension expansion logic be shared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move the formation of the statistic set outside of the for loop like you suggested, then yeah we should be able to share this logic.
e6bc078 to
3c86421
Compare
| } | ||
|
|
||
| // ConsumeMetrics queues metrics to be published to CW. | ||
| // ConsumeMetrics queues metrics to be published to CW.588 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
| addTagsToAttributes(h.Attributes(), tags) | ||
|
|
||
| // add special attribute to tell the exporter to handle this histogram datapoint differently | ||
| h.Attributes().PutStr(constants.AdapterReceiverAttribute, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we ensure that this doesn't show up by accident if we export this metric using a different exporter (like the awsemfexporter)?
| return true | ||
| } | ||
| // we don't want to export the adapter receiver attribute | ||
| if k == constants.AdapterReceiverAttribute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test for this?
Description of the issue
Integrate with new algorithm for converting OTel classic histograms to CloudWatch's Value/Counts. This significantly improves percentile metrics for classic histogram metrics.
Description of changes
Integrate with new algorithm in contrib repo: amazon-contributing/opentelemetry-collector-contrib#376
Associated integ test update to fix histogram definitions: aws/amazon-cloudwatch-agent-test#617
New Algorithm
This algorithm converts each of the buckets of the input histogram into (at most) 10 value/count data pairs aka "inner buckets". The values of the inner buckets are spread evenly across the bucket span. The counts of the inner buckets are determined using an exponential mapping algorithm. Counts are weighted more heavily to one side according to an exponential function depending on how the density of the nearby buckets are changing.
Aggregation
Currently, the cloudwatch output plugin converts inputs histograms to the internal type RegularDistribution which contains a series of value/count pairs in a map. When receiving a new histogram datapoint for the same metric within the aggregation interval, the new histogram datapoint is converted to a RegularDistribution and the two resultant maps are combined, e.g. (where
weightis 1):With this new conversion algorithm, we are going to delay converting the OTel histogram to a series of value/count pairs until the aggregation interval is complete to avoid ballooning the number of unique values in the aggregated metric datapoint. Instead, the original OTel histogram datapoint is preserved and incoming histogram datapoints are merged similar to how the OTel deltatocumulative processor works: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/deltatocumulativeprocessor/internal/data/add.go#L46. This makes the aggregation logic a little messy.
Adapter Receiver Histogram Metrics
The adapter receiver uses OTel histograms as a way to shepherd a series of values/counts to CloudWatch. As an example, StatsD histograms are ingested and stored as values/counts using a map in a distribution type. The distribution is converted to an OTel histogram to send it down the metric pipeline. This conversion stores the values as bucket bounds (which isn't quite right) and the counts as bucket counts. This cause len(bounds) == len(counts) violating the OTel histogram format. Additionally, the order of the "values" is not monotonically increasing, which also violates the OTel histogram format.
All the adapter receiver wants to do is send the input values/counts to CloudWatch so that percentile metrics are available. To keep this functionality, the new conversion algorithm and aggregation logic will be bypassed for adapter receiver histogram metrics. This is achieved by marking adapter receiver histogram datapoints with a special attribute.
Summary
RegularDistributiondata type and output as value/count pairs at the end of the aggregation intervalExpHistogramDistributiondata type and output as value/count pairs at the end of aggregation interval.RegularDistributiondata type and output as value/count pairs at the end of the aggregation intervalpmetric.HistogramDataPointtype. The aggregated datapoint is converted from OTel histogram to Values/Counts using the new classic histogram mapping algorithm from the opentelemetry-collector-contrib repo at the end of the aggregation interval.License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Use the new tools in the contrib repo so send histogram test cases to CloudWatch and then retrieve the percentile metrics. See amazon-contributing/opentelemetry-collector-contrib#376 for more details.
The existing histogram test fails as its sending invalid OTel histograms to the agent which it will now drop. Updated test repo: aws/amazon-cloudwatch-agent-test#617
Full integ test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/20035810152
Requirements
make fmtandmake fmt-shmake lintIntegration Tests
To run integration tests against this PR, add the
ready for testinglabel.