Skip to content

Conversation

@jason-lynch
Copy link
Member

@jason-lynch jason-lynch commented Jan 19, 2026

Summary

Adds the ability to record partial failures from resource lifecycle methods by adding a new ResourceData.Error field. This field is populated when an error occurs during Create and Update events. In this case, we'll persist failed resource back to the state before we terminate the plan. Combined with the existing NeedsCreate field, we can record when an operation needs to be retried on the next run. Recording partial failures also ensures resources can be properly deleted if the user opts to delete rather than retry.

Changes

This PR is split into two commits:

  1. A refactor-only commit that moves some resource event logic from the workflows package into the resource package to consolidate that logic and make it easier to unit test.
  2. A commit that:
  • Adds Error field to ResourceData
  • Adds a case in planCreates to ensure failed updates are retried
  • Modifies the event apply logic to allow for partial failures from Create and Update events
  • Modifies Workflows.applyEvents to propagate lifecycle errors after persisting the resource back to state

Testing

To test the issue from the ticket:

# create a database with an invalid restore config
cp1-req create-database <<EOF | cp-follow-task
{
  "id": "storefront",
  "spec": {
    "database_name": "storefront",
    "database_users": [
      {
        "username": "admin",
        "password": "password",
        "db_owner": true,
        "attributes": ["SUPERUSER", "LOGIN"]
      }
    ],
    "restore_config": {
      "source_database_id": "storefront",
      "source_node_name": "n1",
      "source_database_name": "storefront",
      "repository": {
        "id": "foo",
        "type": "posix",
        "base_path": "/backups"
      }
    },
    "nodes": [{ "name": "n1", "host_ids": ["host-1"] }]
  }
}
EOF

# it will take about 5 minutes for the database to fail
# NOTE: you can reduce this timeout in server/internal/orchestrator/swarm/postgres_service.go if you don't
# want to wait.

# after it fails, delete the database
cp1-req delete-database storefront | cp-follow-task

PLAT-212

Summary by CodeRabbit

  • New Features

    • Introduced event-driven resource lifecycle handling with improved error propagation and state synchronization.
    • Added explicit resource error reporting so lifecycle failures surface in planning and apply flows.
    • Simplified apply execution and centralized event logging for clearer operation traces.
  • Bug Fixes

    • Fixed an issue preventing database deletion when Swarm service creation failed.
  • Tests

    • Added comprehensive tests covering event handling, error paths, and lifecycle scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 19, 2026

📝 Walkthrough

Walkthrough

Adds an event-driven resource state system: new Event type with Apply handlers for refresh/create/update/delete, propagates resource operation errors into ResourceData.Error, updates planner/tests to emit error events, and integrates event application into workflow activities and logging.

Changes

Cohort / File(s) Summary
Event System Core
server/internal/resource/event.go, server/internal/resource/event_test.go
New public Event type, EventType/EventReason constants, Apply(ctx, rc) dispatching to refresh/create/update/delete handlers, ResourceError() accessor, and tests covering event behavior and error propagation.
ResourceData Enhancement
server/internal/resource/resource.go
Added Error string to ResourceData and preserved it in Clone().
State Management
server/internal/resource/state.go, server/internal/resource/state_test.go
Removed prior inline event declarations from state.go; planner now emits Update events with HasError when ResourceData.Error is set. Tests refactored to simulate NotFound and Error conditions and validate plan outcomes.
Workflow Integration
server/internal/workflows/activities/apply_event.go, server/internal/workflows/common.go
ApplyEvent flow simplified to call event.Apply(); logResourceEvent signature changed to accept *resource.Event and *resource.Context; common.go now collects non-refresh ResourceError() results.
Changelog
changes/unreleased/Fixed-20260119-095716.yaml
New unreleased changelog entry documenting a fix: prevent DB deletion blockage when Swarm service creation fails.

Sequence Diagram

sequenceDiagram
    participant WA as Workflow Activity
    participant Evt as Event
    participant RC as Resource Context
    participant Res as Resource Implementation
    participant RD as ResourceData

    WA->>Evt: Apply(ctx, rc)
    Evt->>Evt: switch on Type

    alt Create
        Evt->>RC: load resource handler
        Evt->>Res: Create(ctx)
        alt success
            Res-->>RD: updated state
        else failure
            Res-->>Evt: error
            Evt->>RD: set Error, NeedsRecreate
        end
    else Update
        Evt->>Res: Update(ctx)
        alt success
            Res-->>RD: updated state
        else failure
            Res-->>Evt: error
            Evt->>RD: set Error
        end
    else Delete
        Evt->>Res: Delete(ctx)
        alt success
            Res-->>RD: removed
        else failure
            Res-->>Evt: error (propagate)
        end
    else Refresh
        Evt->>Res: Refresh(ctx)
        Evt->>RD: preserve NeedsRecreate/Error or surface refresh error
    end

    Evt->>RC: ToResourceData(resource)
    Evt-->>WA: return error or nil
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • feat: scoped tasks #248: Updates the same logResourceEvent signature and call sites in server/internal/workflows/activities/apply_event.go.

Suggested reviewers

  • rshoemaker

Poem

🐰 Hop, a gentle event on the way,
Errors tucked into ResourceData's clay,
Apply calls run straight, no goroutine maze,
Handlers hum in tidy, test-covered blaze —
A rabbit nods, and hops in praise.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: record failed lifecycle operations' accurately and concisely describes the main change: recording failures from resource lifecycle methods through a new Error field.
Description check ✅ Passed The PR description is well-structured and covers all required sections of the template with clear explanations of the changes, testing instructions, and the corresponding issue reference.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/internal/workflows/activities/apply_event.go (1)

136-167: Task logs can show success even when Create/Update fails.
event.Apply captures Create/Update failures in Resource.Error without returning an error, so the current log path marks these as success. This makes task logs misleading while the plan still fails later. Consider logging ResourceError() as an error entry but not returning it, so state persistence isn’t blocked.

✅ Suggested fix (log resource error without failing activity)
 	start := time.Now()
-	applyErr := event.Apply(ctx, rc)
+	applyErr := event.Apply(ctx, rc)
+	resourceErr := event.ResourceError()
 	duration := time.Since(start)

 	fields["duration_ms"] = duration.Milliseconds()

 	if applyErr != nil {
 		fields["success"] = false
 		fields["error"] = applyErr.Error()
@@
 		return applyErr
 	}
+
+	if resourceErr != nil {
+		fields["success"] = false
+		fields["error"] = resourceErr.Error()
+		err := log(task.LogEntry{
+			Message: fmt.Sprintf("error while %s resource %s", verb, resourceIdentifier),
+			Fields:  fields,
+		})
+		if err != nil {
+			return fmt.Errorf("failed to record event error: %w", err)
+		}
+		return nil
+	}

 	fields["success"] = true
🤖 Fix all issues with AI agents
In `@server/internal/resource/event_test.go`:
- Around line 127-133: The success branch in the test currently only checks
resource equality and can miss unexpected errors from event.Apply; update the
else branch to assert.NoError(t, err) before asserting equality so that when
tc.expectedErr == "" you first verify err == nil (call assert.NoError with the
test context t and the err returned by event.Apply) and then check that
event.Resource equals expected; reference event.Apply, err, event.Resource and
tc.expectedErr when locating the change.
🧹 Nitpick comments (1)
server/internal/resource/event.go (1)

45-65: Optional: guard against nil Event/Resource before registry lookup.
This prevents a possible nil dereference and improves error messaging if upstream ever passes a malformed event.

🔧 Suggested defensive check
 func (e *Event) Apply(ctx context.Context, rc *Context) error {
+	if e == nil || e.Resource == nil {
+		return errors.New("event has no resource data")
+	}
 	resource, err := rc.Registry.Resource(e.Resource)
 	if err != nil {
 		return err
 	}

Adds a new `Event.Apply` method that's responsible for calling the
correct lifecycle method on the resource, managing the resource data
fields, and determining when to return an error. This change helps to
consolidate the resource model logic in the `resource` package.

PLAT-212
Adds the ability to record partial failures from resource lifecycle
methods by adding a new `ResourceData.Error` field. This field is
populated when an error occurs during `Create` and `Update` events. In
this case, we'll persist failed resource back to the state before we
terminate the plan. Combined with the existing `NeedsCreate` field, we
can record when an operation needs to be retried on the next run.
Recording partial failures also makes it so resources can be properly
deleted if the user opts to perform a delete rather than a retry.

PLAT-212
@jason-lynch jason-lynch force-pushed the fix/PLAT-212/record-failed-creates branch from 71606db to 95f2bbd Compare January 19, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants