Commit ac37544
monit: investigating tests again - using copilot on this one (#11255)
* add monit version to successful exit
* install the standard monit - if 5.34, then bail out
* add 3sec wait after service restart
- that restart happens exactly before the task receiving the SIGTERM, so maybe, just maybe, it just needs time to get ready for the party
* wait for monit initialisation after restart
* monit tests: check service-specific status in readiness wait
The wait task was checking 'monit status' (general), but the actual
failing command is 'monit status -B httpd_echo' (service-specific).
This causes a race where general status succeeds but service queries
fail. Update to check the exact command format that will be used.
* monit tests: remove 5.34.x version restriction
The version restriction was based on incorrect diagnosis. The actual
issue was the readiness check validating general status instead of
service-specific queries. Now that we check the correct command
format, the tests should work across all monit versions.
* monit tests: add stabilization delay after readiness check
After the readiness check succeeds, add a 1-second pause before
running actual tests. Monit 5.34.x and 5.35 appear to have a
concurrency issue where rapid successive 'monit status -B' calls
can cause hangs even though the first call succeeds.
* monit tests: add retry logic for state changes to handle monit daemon hangs
Monit daemon has an intermittent concurrency bug across versions 5.27-5.35
where 'monit status -B' commands can hang (receiving SIGTERM) even after
the daemon has successfully responded to previous queries. This appears
to be a monit daemon issue, not a timing problem.
Add retry logic with 2-second delays to the state change task to work
around these intermittent hangs. Skip retries if the failure is not
SIGTERM (rc=-15) to avoid masking real errors.
* monit tests: capture and display monit.log for debugging
Add tasks in the always block to capture and display the monit log file.
This will help diagnose the intermittent hanging issues by showing what
monit daemon was doing when 'monit status -B' commands hang.
* monit tests: enable verbose logging (-v flag)
Modify the monit systemd service to start with -v flag for verbose
logging. This should provide more detailed information in the monit
log about what's happening when status commands hang.
* monit: add 0.5s delay after state change command
After extensive testing and analysis with verbose logging enabled, identified
that monit's HTTP interface can become temporarily unresponsive immediately
after processing state change commands (stop, start, restart, etc.).
This manifests as intermittent SIGTERM (rc=-15) failures when the module
calls 'monit status -B <service>' to verify the state change. The issue
affects all monit versions tested (5.27-5.35) and is intermittent, suggesting
a race condition or brief lock in monit's HTTP request handling.
Verbose logging confirmed:
- State change commands complete successfully
- HTTP server reports as 'started'
- But subsequent status checks can hang without any log entry
Adding a 0.5 second sleep after sending state change commands gives the
monit daemon time to fully process the command and become responsive again
before the first status verification check.
This complements the existing readiness check after daemon restart and
the retry logic for SIGTERM failures in the tests.
* tests(monit): remove workarounds after module race condition fix
After 10+ successful CI runs with no SIGTERM failures, removing test-level
workarounds that are now redundant due to the 0.5s delay fix in the module:
- Remove 1-second stabilization pause after daemon restart
The module's built-in 0.5s delay after state changes makes this unnecessary
- Remove retry logic for SIGTERM failures in state change tests
The race condition is now prevented at the module level
- Remove verbose logging setup and log capture
Verbose mode didn't log HTTP requests, so it didn't help diagnose the issue
and adds unnecessary overhead
Kept the readiness check with retries after daemon restart - still needed
to validate daemon is responsive after service restart (different scenario
than the state change race condition).
* restore tasks/main.yml
* monit tests: reduce readiness check retries from 60 to 10
After successful CI runs, observed that monit daemon becomes responsive
within 1-2 seconds after restart. The readiness check typically passes
on the first attempt.
Reducing from 60 retries (30s timeout) to 10 retries (5s timeout) is
more appropriate and allows tests to fail faster if something is
genuinely broken.
* add changelog frag
* Update changelogs/fragments/11255-monit-integrationtests.yml
Co-authored-by: Felix Fontein <[email protected]>
---------
Co-authored-by: Felix Fontein <[email protected]>1 parent a977c6f commit ac37544
File tree
3 files changed
+21
-1
lines changed- changelogs/fragments
- plugins/modules
- tests/integration/targets/monit/tasks
3 files changed
+21
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
155 | | - | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
156 | 161 | | |
157 | 162 | | |
158 | 163 | | |
| |||
272 | 277 | | |
273 | 278 | | |
274 | 279 | | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
275 | 283 | | |
276 | 284 | | |
277 | 285 | | |
| |||
Lines changed: 8 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
52 | 52 | | |
53 | 53 | | |
54 | 54 | | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
55 | 63 | | |
56 | 64 | | |
57 | 65 | | |
| |||
0 commit comments