fix: record failed lifecycle operations #253

jason-lynch · 2026-01-19T20:07:29Z

Summary

Adds the ability to record partial failures from resource lifecycle methods by adding a new ResourceData.Error field. This field is populated when an error occurs during Create and Update events. In this case, we'll persist failed resource back to the state before we terminate the plan. Combined with the existing NeedsCreate field, we can record when an operation needs to be retried on the next run. Recording partial failures also ensures resources can be properly deleted if the user opts to delete rather than retry.

Changes

This PR is split into two commits:

A refactor-only commit that moves some resource event logic from the workflows package into the resource package to consolidate that logic and make it easier to unit test.
A commit that:

Adds Error field to ResourceData
Adds a case in planCreates to ensure failed updates are retried
Modifies the event apply logic to allow for partial failures from Create and Update events
Modifies Workflows.applyEvents to propagate lifecycle errors after persisting the resource back to state

Testing

To test the issue from the ticket:

# create a database with an invalid restore config
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "restore_config": {
      "source_database_id": "storefront",
      "source_node_name": "n1",
      "source_database_name": "storefront",
      "repository": {
        "id": "foo",
        "type": "posix",
        "base_path": "/backups"
      }
    },
    "nodes": [{ "name": "n1", "host_ids": ["host-1"] }]
  }
}
EOF

# it will take about 5 minutes for the database to fail
# NOTE: you can reduce this timeout in server/internal/orchestrator/swarm/postgres_service.go if you don't
# want to wait.

# after it fails, delete the database
cp1-req delete-database storefront | cp-follow-task

PLAT-212

Summary by CodeRabbit

New Features
- Introduced event-driven resource lifecycle handling with improved error propagation and state synchronization.
- Added explicit resource error reporting so lifecycle failures surface in planning and apply flows.
- Simplified apply execution and centralized event logging for clearer operation traces.
Bug Fixes
- Fixed an issue preventing database deletion when Swarm service creation failed.
Tests
- Added comprehensive tests covering event handling, error paths, and lifecycle scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-19T20:07:44Z

📝 Walkthrough

Walkthrough

Adds an event-driven resource state system: new Event type with Apply handlers for refresh/create/update/delete, propagates resource operation errors into ResourceData.Error, updates planner/tests to emit error events, and integrates event application into workflow activities and logging.

Changes

Cohort / File(s)	Summary
Event System Core `server/internal/resource/event.go`, `server/internal/resource/event_test.go`	New public `Event` type, `EventType`/`EventReason` constants, `Apply(ctx, rc)` dispatching to refresh/create/update/delete handlers, `ResourceError()` accessor, and tests covering event behavior and error propagation.
ResourceData Enhancement `server/internal/resource/resource.go`	Added `Error string` to `ResourceData` and preserved it in `Clone()`.
State Management `server/internal/resource/state.go`, `server/internal/resource/state_test.go`	Removed prior inline event declarations from state.go; planner now emits Update events with HasError when `ResourceData.Error` is set. Tests refactored to simulate NotFound and Error conditions and validate plan outcomes.
Workflow Integration `server/internal/workflows/activities/apply_event.go`, `server/internal/workflows/common.go`	ApplyEvent flow simplified to call `event.Apply()`; `logResourceEvent` signature changed to accept `resource.Event` and `resource.Context`; common.go now collects non-refresh `ResourceError()` results.
Changelog `changes/unreleased/Fixed-20260119-095716.yaml`	New unreleased changelog entry documenting a fix: prevent DB deletion blockage when Swarm service creation fails.

Sequence Diagram

sequenceDiagram
    participant WA as Workflow Activity
    participant Evt as Event
    participant RC as Resource Context
    participant Res as Resource Implementation
    participant RD as ResourceData

    WA->>Evt: Apply(ctx, rc)
    Evt->>Evt: switch on Type

    alt Create
        Evt->>RC: load resource handler
        Evt->>Res: Create(ctx)
        alt success
            Res-->>RD: updated state
        else failure
            Res-->>Evt: error
            Evt->>RD: set Error, NeedsRecreate
        end
    else Update
        Evt->>Res: Update(ctx)
        alt success
            Res-->>RD: updated state
        else failure
            Res-->>Evt: error
            Evt->>RD: set Error
        end
    else Delete
        Evt->>Res: Delete(ctx)
        alt success
            Res-->>RD: removed
        else failure
            Res-->>Evt: error (propagate)
        end
    else Refresh
        Evt->>Res: Refresh(ctx)
        Evt->>RD: preserve NeedsRecreate/Error or surface refresh error
    end

    Evt->>RC: ToResourceData(resource)
    Evt-->>WA: return error or nil

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: scoped tasks #248: Updates the same logResourceEvent signature and call sites in server/internal/workflows/activities/apply_event.go.

Suggested reviewers

rshoemaker

Poem

🐰 Hop, a gentle event on the way,
Errors tucked into ResourceData's clay,
Apply calls run straight, no goroutine maze,
Handlers hum in tidy, test-covered blaze —
A rabbit nods, and hops in praise.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: record failed lifecycle operations' accurately and concisely describes the main change: recording failures from resource lifecycle methods through a new Error field.
Description check	✅ Passed	The PR description is well-structured and covers all required sections of the template with clear explanations of the changes, testing instructions, and the corresponding issue reference.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/internal/workflows/activities/apply_event.go (1)
136-167: Task logs can show success even when Create/Update fails.
event.Apply captures Create/Update failures in Resource.Error without returning an error, so the current log path marks these as success. This makes task logs misleading while the plan still fails later. Consider logging ResourceError() as an error entry but not returning it, so state persistence isn’t blocked.
✅ Suggested fix (log resource error without failing activity)
 	start := time.Now()
-	applyErr := event.Apply(ctx, rc)
+	applyErr := event.Apply(ctx, rc)
+	resourceErr := event.ResourceError()
 	duration := time.Since(start)

 	fields["duration_ms"] = duration.Milliseconds()

 	if applyErr != nil {
 		fields["success"] = false
 		fields["error"] = applyErr.Error()
@@
 		return applyErr
 	}
+
+	if resourceErr != nil {
+		fields["success"] = false
+		fields["error"] = resourceErr.Error()
+		err := log(task.LogEntry{
+			Message: fmt.Sprintf("error while %s resource %s", verb, resourceIdentifier),
+			Fields:  fields,
+		})
+		if err != nil {
+			return fmt.Errorf("failed to record event error: %w", err)
+		}
+		return nil
+	}

 	fields["success"] = true

🤖 Fix all issues with AI agents

In `@server/internal/resource/event_test.go`:
- Around line 127-133: The success branch in the test currently only checks
resource equality and can miss unexpected errors from event.Apply; update the
else branch to assert.NoError(t, err) before asserting equality so that when
tc.expectedErr == "" you first verify err == nil (call assert.NoError with the
test context t and the err returned by event.Apply) and then check that
event.Resource equals expected; reference event.Apply, err, event.Resource and
tc.expectedErr when locating the change.

🧹 Nitpick comments (1)

server/internal/resource/event.go (1)
45-65: Optional: guard against nil Event/Resource before registry lookup.
This prevents a possible nil dereference and improves error messaging if upstream ever passes a malformed event.
🔧 Suggested defensive check
 func (e *Event) Apply(ctx context.Context, rc *Context) error {
+	if e == nil || e.Resource == nil {
+		return errors.New("event has no resource data")
+	}
 	resource, err := rc.Registry.Resource(e.Resource)
 	if err != nil {
 		return err
 	}

server/internal/resource/event_test.go

Adds a new `Event.Apply` method that's responsible for calling the correct lifecycle method on the resource, managing the resource data fields, and determining when to return an error. This change helps to consolidate the resource model logic in the `resource` package. PLAT-212

Adds the ability to record partial failures from resource lifecycle methods by adding a new `ResourceData.Error` field. This field is populated when an error occurs during `Create` and `Update` events. In this case, we'll persist failed resource back to the state before we terminate the plan. Combined with the existing `NeedsCreate` field, we can record when an operation needs to be retried on the next run. Recording partial failures also makes it so resources can be properly deleted if the user opts to perform a delete rather than a retry. PLAT-212

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

server/internal/resource/event_test.go Show resolved Hide resolved

jason-lynch added 2 commits January 19, 2026 15:17

jason-lynch force-pushed the fix/PLAT-212/record-failed-creates branch from 71606db to 95f2bbd Compare January 19, 2026 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: record failed lifecycle operations #253

fix: record failed lifecycle operations #253

Uh oh!

jason-lynch commented Jan 19, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 19, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: record failed lifecycle operations #253

Are you sure you want to change the base?

fix: record failed lifecycle operations #253

Uh oh!

Conversation

jason-lynch commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jason-lynch commented Jan 19, 2026 •

edited

Loading

coderabbitai bot commented Jan 19, 2026 •

edited

Loading