Verify and document queue-proxy pod statistics security #16366

sonikaarora · 2026-01-29T06:00:05Z

Summary

This PR verifies and documents that queue-proxy instances can only report statistics for their own pod and cannot spoof metrics for other pods.

Changes

Added TestProtobufStatsReporterPodNameImmutable: Verifies that the pod name remains immutable across multiple stat reports
Added TestProtobufStatsReporterMultiplePods: Verifies that separate queue-proxy instances maintain independent pod identities
Enhanced documentation: Added security property documentation to ProtobufStatsReporter explaining immutable pod identity binding

Verification

The implementation was inspected and verified to ensure:

Pod name is set once at construction from Kubernetes environment (env.ServingPod)
The podName field is private and has no setter methods
All statistics reports use this immutable pod identity
The autoscaler correctly attributes metrics to the appropriate pod and revision

Test Results

All tests pass including the new security verification tests:

=== RUN   TestProtobufStatsReporterPodNameImmutable
--- PASS: TestProtobufStatsReporterPodNameImmutable (0.00s)
=== RUN   TestProtobufStatsReporterMultiplePods
--- PASS: TestProtobufStatsReporterMultiplePods (0.00s)

Coverage: 97.1% of statements in pkg/queue

Backward Compatibility

✅ No breaking changes
✅ All existing tests pass
✅ No changes to public APIs

Related: #11015

This change verifies and documents that queue-proxy instances can only report statistics for their own pod and cannot spoof metrics for other pods. Changes: - Added TestProtobufStatsReporterPodNameImmutable to verify pod name immutability across multiple stat reports - Added TestProtobufStatsReporterMultiplePods to verify separate pod identities are maintained - Enhanced ProtobufStatsReporter documentation explaining the security property of immutable pod identity binding The podName field is set once at construction from the Kubernetes environment and cannot be modified, ensuring correct metric attribution for autoscaling decisions.

knative-prow · 2026-01-29T06:00:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sonikaarora
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-prow · 2026-01-29T06:00:16Z

Hi @sonikaarora. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

linkvt · 2026-01-29T10:26:46Z

Hi,
thanks for your PR. I just checked it and the referenced issue and don't think that this was meant in the issue description.

If I understand your tests right they basically only check that golang works correctly and doesn't mix properties of multiple instances of the same struct - IMO no need to test this 😀 .

Considering that the issue kind is documentation and considering this quote from the issue I think that this PR is not the correct approach to solve it:

Inspect code and verify that when autoscaler calls the queue proxy, all the statistics from the queue-proxy are associated with the Revision and Pod associated with the queue proxy.

linkvt · 2026-01-29T10:38:01Z

/hold

This addresses reviewer feedback to properly verify the end-to-end autoscaler → queue-proxy interaction, not just struct field immutability. Changes: - Added TestScrapeOnlyScrapesPodsForCorrectRevision to verify that the autoscaler only scrapes pods belonging to the target revision - Added makePodsWithRevision helper to create pods with specific revision labels - Enhanced ProtobufStatsReporter documentation to explain the complete security model including revision attribution and network isolation - Kept existing unit tests as they verify pod name immutability at the queue-proxy level The test demonstrates that: 1. PodAccessor filters pods by serving.RevisionLabelKey 2. The scraper only requests pod IPs for the target revision 3. Stats from other revisions are never scraped 4. This ensures proper isolation for autoscaling decisions

sonikaarora · 2026-01-31T03:55:34Z

Hi @linkvt,

Thanks for the feedback! You're absolutely right - my initial tests were too basic.

I've updated the PR to address your concern. The key change is a new integration test: TestScrapeOnlyScrapesPodsForCorrectRevision (in pkg/autoscaler/metrics/stats_scraper_test.go).

This test verifies the complete autoscaler → queue-proxy flow:

Creates pods for two different revisions (revision-a and revision-b)
Configures the autoscaler scraper to target only revision-a
Verifies that only revision-a pods are scraped (by checking HTTP requests made)
Confirms that revision-b pods are never contacted

This directly addresses the issue requirement: "when autoscaler calls the queue proxy, all the statistics from the queue-proxy are associated with the Revision and Pod".

The original unit tests remain (they verify pod-level immutability) but are now complemented by this end-to-end test showing revision isolation via Kubernetes labels and PodAccessor filtering.

Would appreciate another look when you have time. Thanks!

dprotaso · 2026-02-02T20:54:07Z

/ok-to-test

dprotaso · 2026-02-02T21:09:53Z

/assign @evankanderson for input

I read the title of the issue and thought this was about the autoscaler verifying the source of metrics are coming from a specific pod/revision. But seeing the label about documentation and the body of the issue I'm unsure about the requirements.

knative-prow · 2026-02-02T21:09:56Z

@dprotaso: GitHub didn't allow me to assign the following users: for, input.

Note that only knative members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @evankanderson for input

I read the title of the issue and thought this was about the autoscaler verifying the source of metrics are coming from a specific pod/revision. But seeing the label about documentation and the body of the issue I'm unsure about the requirements.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dprotaso · 2026-02-02T21:11:43Z

pkg/queue/protobuf_stats_reporter_test.go

+
+// TestProtobufStatsReporterMultiplePods verifies that different queue-proxy instances
+// maintain separate pod identities and cannot report for each other.
+func TestProtobufStatsReporterMultiplePods(t *testing.T) {


I think we can drop this test. It's not really doing anything more than TestProtobufStatsReporterPodNameImmutable

dprotaso · 2026-02-02T21:22:17Z

pkg/autoscaler/metrics/stats_scraper_test.go

+// that statistics are correctly attributed to the appropriate revision.
+//
+// Related to: https://github.com/knative/serving/issues/11015
+func TestScrapeOnlyScrapesPodsForCorrectRevision(t *testing.T) {


The test scraper only really supports scraping one revision's pods - so we shouldn't seed stats for revision B

dprotaso · 2026-02-02T21:23:29Z

pkg/autoscaler/metrics/stats_scraper_test.go

+	// Create a scraper specifically for revision-a
+	// The fake client will return stats for all 4 pods, but the scraper
+	// should only scrape the 2 pods belonging to revision-a
+	allStats := append(statsRevA, statsRevB...)


Stats here shouldn't include statsRevB because the test scraper returns stats in the order of invocation

For example - if I were to call scrape 4 times then you'd get stats rev b being returned in the scraper. This would mean the mock is not behaving like the real scraper.

evankanderson · 2026-02-02T21:37:35Z

Yes, the original issue is to verify that when the autoscaler is reading metrics from a specific pod. In particular, https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/stats_scraper.go#L326 passes a stat object deserialized from the remote "queue-proxy" at https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/http_scrape_client.go#L80.

We can't actually tell whether it's a queue proxy or something malicious; we do prevent a DoS by limiting the buffer size. We should also prevent it from replying for a different Knative service, but I don't think anyone spent the time to analyze the call stack to see if this was a reasonable concern. We seem to have a serviceScraper for each Knative revision (if I read the code correctly), which seems to be using a Knative-created k8s Service to locate the pods. It looks like the Stat message may not include namespace / revision information, so this may be a moot attack vector.

evankanderson · 2026-02-02T21:53:58Z

Yes, the original issue is to verify that when the autoscaler is reading metrics from a specific pod. In particular, https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/stats_scraper.go#L326 passes a stat object deserialized from the remote "queue-proxy" at https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/http_scrape_client.go#L80.

We can't actually tell whether it's a queue proxy or something malicious; we do prevent a DoS by limiting the buffer size. We should also prevent it from replying for a different Knative service, but I don't think anyone spent the time to analyze the call stack to see if this was a reasonable concern. We seem to have a serviceScraper for each Knative revision (if I read the code correctly), which seems to be using a Knative-created k8s Service to locate the pods. It looks like the Stat message may not include namespace / revision information, so this may be a moot attack vector.

I've read the code. It appears that the autoscaler works as follows:

MetricCollector implements Collector for interfacing with the k8s watcher interface, and MetricClient for interfacing with the rest of the scaling system.
Internally, MetricCollector keeps a map of NamespacedName revisions to stats collectors. This map is updated by watching the Kubernetes apiserver for updates to the autoscaling.internal.knative.dev/v1alpha1/Metric custom resource.
For each Metric resource, the autoscaler constructs a separate watcher on the pods specified by the k8s Service reference in the Metric (same-namespace). The watcher tracks lifecycle of a StatsScraper and metric aggregations associated with the Metric.
Stat reports are never accumulated outside a Metric custom resource.
The rest of the autoscaling infrastructure requests scaling values based on the NamespacedName from item 2; if there is no corresponding Metric object, an ErrNotCollecting will be returned.

Based on this structure, the most damage an attacker with editor on all types in a namespace (the normal edit role does not have Metric permissions) could do would be:

Create additional Metric objects -- add some additional work to the autoscaler.
Change a Metric's linked k8s Service to point to a non-queue-proxy pod (either in their own namespace or someone else's).
- For a within-namespace pod, you could report additional load and cause additional scaling
- For an outside-namespace pod, you might also be able to intuit some measure of load / scale. To do this, you would need to be able to enumerate the Pod IPs of the target namespace to fill them into the Service's Endpoints. But if you can already enumerate the Pod IPs of the target namespace, you already have some scaling information...

In particular, it would not be possible to use access to the Metric or Stat responses from an attacker controlled pod to change the scaling for a Revision which is outside the attacker's namespace.

evankanderson · 2026-02-02T21:55:00Z

That's not to say that these might not be useful changes, but the "investigate" part of the original issue really was "investigate" before proposing fixes.

dprotaso · 2026-02-02T22:44:44Z

FWIW - I'm good to merge the PR (with the requested changes) though I'd say it doesn't fix the target issue.

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2026

knative-prow bot requested review from gauron99, linkvt and skonto January 29, 2026 06:00

knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 29, 2026

sonikaarora marked this pull request as ready for review January 29, 2026 06:03

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2026

knative-prow bot requested a review from dprotaso January 29, 2026 06:04

knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026

sonikaarora marked this pull request as draft January 29, 2026 17:58

sonikaarora force-pushed the verify-queue-proxy-pod-statistics branch from d3b0f24 to 85fc1c7 Compare January 31, 2026 03:53

sonikaarora marked this pull request as ready for review January 31, 2026 03:54

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2026

knative-prow bot requested a review from dsimansk January 31, 2026 03:54

knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 2, 2026

knative-prow bot assigned evankanderson Feb 2, 2026

dprotaso reviewed Feb 2, 2026

View reviewed changes

evankanderson mentioned this pull request Feb 2, 2026

Ensure that queue-proxy can only report statistics for its own Pod #11015

Closed

Verify and document queue-proxy pod statistics security #16366

Are you sure you want to change the base?

Verify and document queue-proxy pod statistics security #16366

Conversation

sonikaarora commented Jan 29, 2026 • edited by dprotaso Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Test Results

Backward Compatibility

Uh oh!

knative-prow bot commented Jan 29, 2026

Uh oh!

knative-prow bot commented Jan 29, 2026

Uh oh!

linkvt commented Jan 29, 2026

Uh oh!

linkvt commented Jan 29, 2026

Uh oh!

sonikaarora commented Jan 31, 2026

Uh oh!

dprotaso commented Feb 2, 2026

Uh oh!

dprotaso commented Feb 2, 2026

Uh oh!

knative-prow bot commented Feb 2, 2026

Uh oh!

dprotaso Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

dprotaso Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

dprotaso Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

evankanderson commented Feb 2, 2026

Uh oh!

evankanderson commented Feb 2, 2026

Uh oh!

evankanderson commented Feb 2, 2026

Uh oh!

dprotaso commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sonikaarora commented Jan 29, 2026 •

edited by dprotaso

Loading