Skip to content

🐛 Add teardown on revision archival#2502

Open
perdasilva wants to merge 1 commit intooperator-framework:mainfrom
perdasilva:fix-cer-archival-teardown
Open

🐛 Add teardown on revision archival#2502
perdasilva wants to merge 1 commit intooperator-framework:mainfrom
perdasilva:fix-cer-archival-teardown

Conversation

@perdasilva
Copy link
Contributor

@perdasilva perdasilva commented Feb 11, 2026

Description

When a ClusterExtensionRevision transitions to the Archived lifecycle state, invoke the revision engine's Teardown method to clean up managed resources that are no longer part of the active revision. This ensures resources removed between bundle versions (e.g. a ConfigMap present in v1.0.0 but absent in v1.2.0) are deleted from the cluster.

  • Split teardown() into delete() (CER deletion) and archive() (CER archival); only archive() calls revisionEngine.Teardown()
  • Move RevisionEngine creation and watch establishment before the archive check so they are available for both archive and reconcile paths
  • Handle incomplete teardown (requeue after 5s) and teardown errors (propagate for controller retry)
  • Rename rev variable to cer for consistency with the type name
  • Add e2e step ClusterExtensionRevision "<name>" phase objects are not found or not owned by the revision: fetches all objects from the CER's phases and asserts each one either does not exist on the cluster or does not list the CER in its ownerReferences
  • Change extensionObjects in scenarioContext from []client.Object to [][]client.Object to track objects per revision rollout; add GetClusterExtensionObjectsForRevision helper
  • Replace the specific configmap/dummy-configmap not-found assertion in the update e2e scenario with the new generic phase-objects step

Test plan

  • Unit tests pass (Test_ClusterExtensionRevisionReconciler_Reconcile_ArchivalAndDeletion)
  • New unit test: requeue when archived revision teardown is incomplete
  • New unit test: error propagation when archived revision teardown fails
  • New unit test: error propagation when factory fails to create engine during archived teardown
  • E2e: "Each update creates a new revision" verifies all phase objects from the archived CER are either deleted or no longer owned by the revision

🤖 Generated with Claude Code

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

Copilot AI review requested due to automatic review settings February 11, 2026 13:49
@netlify
Copy link

netlify bot commented Feb 11, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit ea6fe54
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69970f8bb55b0a0008c8c248
😎 Deploy Preview https://deploy-preview-2502--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci
Copy link

openshift-ci bot commented Feb 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign perdasilva for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates ClusterExtensionRevision teardown behavior to actively tear down Boxcutter-managed resources when a revision is archived, and extends test coverage/data to validate resource removal across bundle versions.

Changes:

  • Invoke RevisionEngine.Teardown during teardown for archived revisions, with retry + requeue when teardown is incomplete.
  • Expand deletion/archival reconciliation tests to cover incomplete teardown and teardown failures (including expected requeue behavior).
  • Add a dummy ConfigMap manifest to the v1.0.0 test bundle to validate removal of resources that disappear in later bundle versions.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
internal/operator-controller/controllers/clusterextensionrevision_controller.go Calls Boxcutter teardown for archived revisions and requeues while teardown is incomplete.
internal/operator-controller/controllers/clusterextensionrevision_controller_test.go Adds test cases for archived teardown retry/error paths and asserts reconcile results.
testdata/images/bundles/test-operator/v1.0.0/manifests/dummy.configmap.yaml Introduces a version-specific manifest used to test removal of resources across upgrades.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 75.55556% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.22%. Comparing base (1fd37fa) to head (ea6fe54).

Files with missing lines Patch % Lines
...controllers/clusterextensionrevision_controller.go 75.55% 5 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2502      +/-   ##
==========================================
+ Coverage   69.46%   73.22%   +3.76%     
==========================================
  Files         103      103              
  Lines        8509     8524      +15     
==========================================
+ Hits         5911     6242     +331     
+ Misses       2129     1801     -328     
- Partials      469      481      +12     
Flag Coverage Δ
e2e 45.50% <0.00%> (-0.39%) ⬇️
experimental-e2e 53.64% <55.55%> (+40.87%) ⬆️
unit 57.84% <64.44%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from 06afd4a to 5ca3f6e Compare February 11, 2026 15:01
@perdasilva perdasilva changed the title 🐛 Teardown revision on archival 🐛 Add teardown on revision archival Feb 12, 2026
Copilot AI review requested due to automatic review settings February 16, 2026 12:10
@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from 5ca3f6e to 6e1e134 Compare February 16, 2026 12:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from 6e1e134 to 7c6037b Compare February 17, 2026 11:33
Copilot AI review requested due to automatic review settings February 17, 2026 14:04
@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from 7c6037b to b51648b Compare February 17, 2026 14:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@pedjak pedjak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if two revisions are sharing the same resource? how do ensure that we do not delete something accidentally?

setRetryingConditions(rev, err.Error())
return ctrl.Result{}, fmt.Errorf("converting to boxcutter revision: %v", err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to move this further down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on whether we want to remove revision creation from the deletion path or not. Since we want the garbage collector to handle deletion, it could avoid transient (or maybe even persistent) errors from getting in the way. If so, it might also make sense to move it closer to where revision is first needed

}

func (c *ClusterExtensionRevisionReconciler) archive(ctx context.Context, cer *ocv1.ClusterExtensionRevision) (machinery.RevisionTeardownResult, error) {
revision, _, err := c.toBoxcutterRevision(ctx, cer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we convert this multiple times, the caller can pass it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a suggestion by copilot to remove revision creation from the deletion path

if err != nil {
return nil, fmt.Errorf("converting to boxcutter revision: %v", err)
}
revisionEngine, err := c.RevisionEngineFactory.CreateRevisionEngine(ctx, cer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revision engine is also already available at the caller side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a similar argument as with creating the revision. It's not needed in the deletion path. Maybe we could just separate the deletion and archival paths further. Then we only repeat finalizer removal.
Actually, what do you think about removing the owner ref from archived revisions? They would still carry the owner labels we apply, so they are still attached. But, it gives a cleaner kubectl lineage output XD and might avoid CE reconciliations (although we don't really expect any changes to archived revisions...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revision engine also has a dependency on the service account, which might be deleted together with the revision. So, it's probably safer to not have revision engine creation on the deletion path.

And ClusterExtensionRevision "${NAME}-2" reports Available as True with Reason ProbesSucceeded
And ClusterExtensionRevision "${NAME}-1" is archived
# dummy-config map exists only in v1.0.0 - once the v1.0.0 revision is archived, it should be gone from the cluster
And resource "configmap/dummy-configmap" is eventually not found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if other resources are there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move that check to a more general upgrade test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, there is overlap between the resources. So, some of the v1.0.0 resources would still be present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, there is overlap between the resources. So, some of the v1.0.0 resources would still be present.

right, should have steps that checks for that explicitly in this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude and I updated things to check that the revision-1 resources either don't exist or are no longer in the object's ownerrefs

@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from b51648b to fdcc6d9 Compare February 18, 2026 13:45
Copilot AI review requested due to automatic review settings February 18, 2026 14:31
@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from fdcc6d9 to 9105af6 Compare February 18, 2026 14:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

func (c *ClusterExtensionRevisionReconciler) reconcile(ctx context.Context, rev *ocv1.ClusterExtensionRevision) (ctrl.Result, error) {
func (c *ClusterExtensionRevisionReconciler) reconcile(ctx context.Context, cer *ocv1.ClusterExtensionRevision) (ctrl.Result, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is really needed in this PR to rename arg from rev to cer? It makes the diff bigger than it needs to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it reads better, especially when you have two parallel revision objects. It makes it easier to know that you're operating over the CER vs the Boxcutter revision. It might even be worth going forward and renaming revision to boxcutterRevision or bcRev, or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it would be helpful, I could revert it for this PR and do a follow up with just that change

metricsResponse map[string]string

extensionObjects []client.Object
extensionObjects [][]client.Object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document the semantic of indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh whoops - I was supposed to have removed that. My bad.

@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from 9105af6 to b306567 Compare February 19, 2026 09:31
Copilot AI review requested due to automatic review settings February 19, 2026 13:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

internal/operator-controller/controllers/clusterextensionrevision_controller_test.go:659

  • The "teardown finalizer is removed" test case doesn’t set a DeletionTimestamp or LifecycleState=Archived, but the reconciler only removes the teardown finalizer in delete() (when DeletionTimestamp is set) or after archival teardown completes. As written, this test appears to be exercising the normal reconcile path and should keep the finalizer, so the assertion is likely incorrect.

Suggestion: set rev1.DeletionTimestamp in existingObjs() (to cover the delete path) or adjust the expectation to match the non-deletion reconcile behavior.

			name:           "teardown finalizer is removed",
			revisionResult: mockRevisionResult{},
			existingObjs: func() []client.Object {
				ext := newTestClusterExtension()
				rev1 := newTestClusterExtensionRevision(t, clusterExtensionRevisionName, ext, testScheme)
				rev1.Finalizers = []string{
					"olm.operatorframework.io/teardown",
				}
				return []client.Object{ext, rev1}
			},

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if err := c.TrackingCache.Free(ctx, cer); err != nil {
markAsAvailableUnknown(cer, ocv1.ClusterExtensionRevisionReasonReconciling, err.Error())
return ctrl.Result{}, fmt.Errorf("error stopping informers: %v", err)
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcile() currently calls ensureFinalizer() unconditionally for non-deleted revisions, including revisions already in LifecycleState=Archived. Once an archived revision completes teardown and archive() removes the finalizer, the next reconcile will re-add the finalizer (via ensureFinalizer) and then remove it again in archive(), generating update events and potentially causing an infinite reconcile loop / API churn on archived revisions.

Suggestion: only add the teardown finalizer for non-archived (active) revisions, or gate ensureFinalizer so it does not re-add the finalizer when LifecycleState=Archived and teardown is already complete. Another option is to check controllerutil.ContainsFinalizer and if archived+no-finalizer, skip archive/teardown entirely and return.

Suggested change
}
}
// If the teardown finalizer has already been removed, teardown/archive is complete.
// In that case, skip calling archive() to avoid unnecessary updates and reconcile churn.
if !controllerutil.ContainsFinalizer(cer, clusterExtensionRevisionTeardownFinalizer) {
return ctrl.Result{}, nil
}

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restructured things to only ensureFinalizer in the Active code path

Comment on lines +585 to +595
obj := phaseObj.Object
kind := obj.GetKind()
name := obj.GetName()
namespace := obj.GetNamespace()

waitFor(ctx, func() bool {
args := []string{"get", kind, name, "--ignore-not-found", "-o", "json"}
if namespace != "" {
args = append(args, "-n", namespace)
}
out, err := k8sClient(args...)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ClusterExtensionRevisionObjectsNotFoundOrNotOwned, phase objects may omit metadata.namespace (common for namespaced manifests; the controller applies them into spec.namespace/install namespace). When namespace == "", this step runs kubectl get <kind> <name> without -n, which will query the kubectl context’s default namespace and can incorrectly report the object as "not found" even if it exists in ${TEST_NAMESPACE}.

Suggestion: when namespace is empty, resolve the correct namespace (e.g., default to the scenario’s ${TEST_NAMESPACE} / install namespace annotation on the CER), and also handle truly cluster-scoped objects by falling back to a non-namespaced get if the namespaced get fails due to scope.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be fine as the revision resources should always contain the namespace they are to be placed in - and they won't necessarily always be the install namespace

When a ClusterExtensionRevision transitions to the Archived lifecycle
state, invoke the revision engine's Teardown method to clean up managed
resources that are no longer part of the active revision. This ensures
resources removed between bundle versions (e.g. a ConfigMap present in
v1.0.0 but absent in v1.2.0) are deleted from the cluster.

Changes:
- Split teardown() into delete() (CER deletion) and archive() (CER
  archival); only archive() calls revisionEngine.Teardown()
- Move RevisionEngine creation and watch establishment before the
  archive check so they are available for both archive and reconcile
  paths
- Handle incomplete teardown (requeue after 5s) and teardown errors
  (return error for controller retry)
- Rename rev variable to cer for consistency with the type name
- Update unit tests: rename to ArchivalAndDeletion, add test cases for
  incomplete teardown requeue, teardown error propagation, and factory
  creation error during archived teardown
- Add dummy-configmap to v1.0.0 test bundle (absent in v1.2.0) to
  validate teardown on archival in e2e tests
- Add e2e step ClusterExtensionRevision phase objects are not found or
  not owned by the revision: fetches all objects from the CER's phases
  and asserts each one either does not exist or does not list the CER
  in its ownerReferences
- Change extensionObjects in scenarioContext from []client.Object to
  [][]client.Object to track objects per revision rollout; add
  GetClusterExtensionObjectsForRevision helper
- Replace specific configmap-not-found assertion in the update e2e
  scenario with the new generic phase-objects step

Signed-off-by: Per Goncalves da Silva <pegoncal@redhat.com>
Signed-off-by: Per G. da Silva <pegoncal@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Per G. da Silva <pegoncal@redhat.com>
@perdasilva perdasilva force-pushed the fix-cer-archival-teardown branch from d9f672a to ea6fe54 Compare February 19, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments