After upgrading CRUD to 1.7.0, we observed unstable behavior on storage nodes when the expirationd role is enabled. During role application, the fiber executing apply() may be unexpectedly killed. This happens while CRUD is performing storage calls in “fast mode”. As a result, role application is interrupted and fail. The issue is caused by how CRUD currently identifies and terminates its internal fibers.
Root cause
CRUD relies on fiber.name() to identify internal “fast-mode” fibers and kills fibers by matching their name. This assumption is unsafe for several reasons:
- changed fiber.name persists after crud request completed
- As a result, a different, unrelated fiber may later be killed by crud