Skip to content

Conversation

@sonikaarora
Copy link

@sonikaarora sonikaarora commented Jan 29, 2026

Summary

This PR verifies and documents that queue-proxy instances can only report statistics for their own pod and cannot spoof metrics for other pods.

Changes

  • Added TestProtobufStatsReporterPodNameImmutable: Verifies that the pod name remains immutable across multiple stat reports
  • Added TestProtobufStatsReporterMultiplePods: Verifies that separate queue-proxy instances maintain independent pod identities
  • Enhanced documentation: Added security property documentation to ProtobufStatsReporter explaining immutable pod identity binding

Verification

The implementation was inspected and verified to ensure:

  1. Pod name is set once at construction from Kubernetes environment (env.ServingPod)
  2. The podName field is private and has no setter methods
  3. All statistics reports use this immutable pod identity
  4. The autoscaler correctly attributes metrics to the appropriate pod and revision

Test Results

All tests pass including the new security verification tests:

=== RUN   TestProtobufStatsReporterPodNameImmutable
--- PASS: TestProtobufStatsReporterPodNameImmutable (0.00s)
=== RUN   TestProtobufStatsReporterMultiplePods
--- PASS: TestProtobufStatsReporterMultiplePods (0.00s)

Coverage: 97.1% of statements in pkg/queue

Backward Compatibility

✅ No breaking changes
✅ All existing tests pass
✅ No changes to public APIs

Related: #11015

This change verifies and documents that queue-proxy instances can only
report statistics for their own pod and cannot spoof metrics for other pods.

Changes:
- Added TestProtobufStatsReporterPodNameImmutable to verify pod name
  immutability across multiple stat reports
- Added TestProtobufStatsReporterMultiplePods to verify separate pod
  identities are maintained
- Enhanced ProtobufStatsReporter documentation explaining the security
  property of immutable pod identity binding

The podName field is set once at construction from the Kubernetes
environment and cannot be modified, ensuring correct metric attribution
for autoscaling decisions.
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2026
@knative-prow
Copy link

knative-prow bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sonikaarora
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot requested review from gauron99, linkvt and skonto January 29, 2026 06:00
@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 29, 2026
@knative-prow
Copy link

knative-prow bot commented Jan 29, 2026

Hi @sonikaarora. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sonikaarora sonikaarora marked this pull request as ready for review January 29, 2026 06:03
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2026
@knative-prow knative-prow bot requested a review from dprotaso January 29, 2026 06:04
@linkvt
Copy link
Contributor

linkvt commented Jan 29, 2026

Hi,
thanks for your PR. I just checked it and the referenced issue and don't think that this was meant in the issue description.

If I understand your tests right they basically only check that golang works correctly and doesn't mix properties of multiple instances of the same struct - IMO no need to test this 😀 .

Considering that the issue kind is documentation and considering this quote from the issue I think that this PR is not the correct approach to solve it:

Inspect code and verify that when autoscaler calls the queue proxy, all the statistics from the queue-proxy are associated with the Revision and Pod associated with the queue proxy.

@linkvt
Copy link
Contributor

linkvt commented Jan 29, 2026

/hold

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026
@sonikaarora sonikaarora marked this pull request as draft January 29, 2026 17:58
@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 29, 2026
This addresses reviewer feedback to properly verify the end-to-end
autoscaler → queue-proxy interaction, not just struct field immutability.

Changes:
- Added TestScrapeOnlyScrapesPodsForCorrectRevision to verify that the
  autoscaler only scrapes pods belonging to the target revision
- Added makePodsWithRevision helper to create pods with specific revision labels
- Enhanced ProtobufStatsReporter documentation to explain the complete
  security model including revision attribution and network isolation
- Kept existing unit tests as they verify pod name immutability at the
  queue-proxy level

The test demonstrates that:
1. PodAccessor filters pods by serving.RevisionLabelKey
2. The scraper only requests pod IPs for the target revision
3. Stats from other revisions are never scraped
4. This ensures proper isolation for autoscaling decisions
@sonikaarora sonikaarora force-pushed the verify-queue-proxy-pod-statistics branch from d3b0f24 to 85fc1c7 Compare January 31, 2026 03:53
@sonikaarora sonikaarora marked this pull request as ready for review January 31, 2026 03:54
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2026
@knative-prow knative-prow bot requested a review from dsimansk January 31, 2026 03:54
@sonikaarora
Copy link
Author

Hi @linkvt,

Thanks for the feedback! You're absolutely right - my initial tests were too basic.

I've updated the PR to address your concern. The key change is a new integration test: TestScrapeOnlyScrapesPodsForCorrectRevision (in pkg/autoscaler/metrics/stats_scraper_test.go).

This test verifies the complete autoscaler → queue-proxy flow:

  • Creates pods for two different revisions (revision-a and revision-b)
  • Configures the autoscaler scraper to target only revision-a
  • Verifies that only revision-a pods are scraped (by checking HTTP requests made)
  • Confirms that revision-b pods are never contacted

This directly addresses the issue requirement: "when autoscaler calls the queue proxy, all the statistics from the queue-proxy are associated with the Revision and Pod".

The original unit tests remain (they verify pod-level immutability) but are now complemented by this end-to-end test showing revision isolation via Kubernetes labels and PodAccessor filtering.

Would appreciate another look when you have time. Thanks!

@dprotaso
Copy link
Member

dprotaso commented Feb 2, 2026

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 2, 2026
@dprotaso
Copy link
Member

dprotaso commented Feb 2, 2026

/assign @evankanderson for input

I read the title of the issue and thought this was about the autoscaler verifying the source of metrics are coming from a specific pod/revision. But seeing the label about documentation and the body of the issue I'm unsure about the requirements.

@knative-prow
Copy link

knative-prow bot commented Feb 2, 2026

@dprotaso: GitHub didn't allow me to assign the following users: for, input.

Note that only knative members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @evankanderson for input

I read the title of the issue and thought this was about the autoscaler verifying the source of metrics are coming from a specific pod/revision. But seeing the label about documentation and the body of the issue I'm unsure about the requirements.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.


// TestProtobufStatsReporterMultiplePods verifies that different queue-proxy instances
// maintain separate pod identities and cannot report for each other.
func TestProtobufStatsReporterMultiplePods(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this test. It's not really doing anything more than TestProtobufStatsReporterPodNameImmutable

// that statistics are correctly attributed to the appropriate revision.
//
// Related to: https://github.com/knative/serving/issues/11015
func TestScrapeOnlyScrapesPodsForCorrectRevision(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test scraper only really supports scraping one revision's pods - so we shouldn't seed stats for revision B

// Create a scraper specifically for revision-a
// The fake client will return stats for all 4 pods, but the scraper
// should only scrape the 2 pods belonging to revision-a
allStats := append(statsRevA, statsRevB...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats here shouldn't include statsRevB because the test scraper returns stats in the order of invocation

For example - if I were to call scrape 4 times then you'd get stats rev b being returned in the scraper. This would mean the mock is not behaving like the real scraper.

@evankanderson
Copy link
Member

Yes, the original issue is to verify that when the autoscaler is reading metrics from a specific pod. In particular, https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/stats_scraper.go#L326 passes a stat object deserialized from the remote "queue-proxy" at https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/http_scrape_client.go#L80.

We can't actually tell whether it's a queue proxy or something malicious; we do prevent a DoS by limiting the buffer size. We should also prevent it from replying for a different Knative service, but I don't think anyone spent the time to analyze the call stack to see if this was a reasonable concern. We seem to have a serviceScraper for each Knative revision (if I read the code correctly), which seems to be using a Knative-created k8s Service to locate the pods. It looks like the Stat message may not include namespace / revision information, so this may be a moot attack vector.

@evankanderson
Copy link
Member

Yes, the original issue is to verify that when the autoscaler is reading metrics from a specific pod. In particular, https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/stats_scraper.go#L326 passes a stat object deserialized from the remote "queue-proxy" at https://github.com/knative/serving/blob/main/pkg/autoscaler/metrics/http_scrape_client.go#L80.

We can't actually tell whether it's a queue proxy or something malicious; we do prevent a DoS by limiting the buffer size. We should also prevent it from replying for a different Knative service, but I don't think anyone spent the time to analyze the call stack to see if this was a reasonable concern. We seem to have a serviceScraper for each Knative revision (if I read the code correctly), which seems to be using a Knative-created k8s Service to locate the pods. It looks like the Stat message may not include namespace / revision information, so this may be a moot attack vector.

I've read the code. It appears that the autoscaler works as follows:

  1. MetricCollector implements Collector for interfacing with the k8s watcher interface, and MetricClient for interfacing with the rest of the scaling system.
  2. Internally, MetricCollector keeps a map of NamespacedName revisions to stats collectors. This map is updated by watching the Kubernetes apiserver for updates to the autoscaling.internal.knative.dev/v1alpha1/Metric custom resource.
  3. For each Metric resource, the autoscaler constructs a separate watcher on the pods specified by the k8s Service reference in the Metric (same-namespace). The watcher tracks lifecycle of a StatsScraper and metric aggregations associated with the Metric.
  4. Stat reports are never accumulated outside a Metric custom resource.
  5. The rest of the autoscaling infrastructure requests scaling values based on the NamespacedName from item 2; if there is no corresponding Metric object, an ErrNotCollecting will be returned.

Based on this structure, the most damage an attacker with editor on all types in a namespace (the normal edit role does not have Metric permissions) could do would be:

  • Create additional Metric objects -- add some additional work to the autoscaler.
  • Change a Metric's linked k8s Service to point to a non-queue-proxy pod (either in their own namespace or someone else's).
    • For a within-namespace pod, you could report additional load and cause additional scaling
    • For an outside-namespace pod, you might also be able to intuit some measure of load / scale. To do this, you would need to be able to enumerate the Pod IPs of the target namespace to fill them into the Service's Endpoints. But if you can already enumerate the Pod IPs of the target namespace, you already have some scaling information...

In particular, it would not be possible to use access to the Metric or Stat responses from an attacker controlled pod to change the scaling for a Revision which is outside the attacker's namespace.

@evankanderson
Copy link
Member

That's not to say that these might not be useful changes, but the "investigate" part of the original issue really was "investigate" before proposing fixes.

@dprotaso
Copy link
Member

dprotaso commented Feb 2, 2026

FWIW - I'm good to merge the PR (with the requested changes) though I'd say it doesn't fix the target issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants