Skip to content

Commit ac37544

Browse files
russozfelixfontein
andauthored
monit: investigating tests again - using copilot on this one (#11255)
* add monit version to successful exit * install the standard monit - if 5.34, then bail out * add 3sec wait after service restart - that restart happens exactly before the task receiving the SIGTERM, so maybe, just maybe, it just needs time to get ready for the party * wait for monit initialisation after restart * monit tests: check service-specific status in readiness wait The wait task was checking 'monit status' (general), but the actual failing command is 'monit status -B httpd_echo' (service-specific). This causes a race where general status succeeds but service queries fail. Update to check the exact command format that will be used. * monit tests: remove 5.34.x version restriction The version restriction was based on incorrect diagnosis. The actual issue was the readiness check validating general status instead of service-specific queries. Now that we check the correct command format, the tests should work across all monit versions. * monit tests: add stabilization delay after readiness check After the readiness check succeeds, add a 1-second pause before running actual tests. Monit 5.34.x and 5.35 appear to have a concurrency issue where rapid successive 'monit status -B' calls can cause hangs even though the first call succeeds. * monit tests: add retry logic for state changes to handle monit daemon hangs Monit daemon has an intermittent concurrency bug across versions 5.27-5.35 where 'monit status -B' commands can hang (receiving SIGTERM) even after the daemon has successfully responded to previous queries. This appears to be a monit daemon issue, not a timing problem. Add retry logic with 2-second delays to the state change task to work around these intermittent hangs. Skip retries if the failure is not SIGTERM (rc=-15) to avoid masking real errors. * monit tests: capture and display monit.log for debugging Add tasks in the always block to capture and display the monit log file. This will help diagnose the intermittent hanging issues by showing what monit daemon was doing when 'monit status -B' commands hang. * monit tests: enable verbose logging (-v flag) Modify the monit systemd service to start with -v flag for verbose logging. This should provide more detailed information in the monit log about what's happening when status commands hang. * monit: add 0.5s delay after state change command After extensive testing and analysis with verbose logging enabled, identified that monit's HTTP interface can become temporarily unresponsive immediately after processing state change commands (stop, start, restart, etc.). This manifests as intermittent SIGTERM (rc=-15) failures when the module calls 'monit status -B <service>' to verify the state change. The issue affects all monit versions tested (5.27-5.35) and is intermittent, suggesting a race condition or brief lock in monit's HTTP request handling. Verbose logging confirmed: - State change commands complete successfully - HTTP server reports as 'started' - But subsequent status checks can hang without any log entry Adding a 0.5 second sleep after sending state change commands gives the monit daemon time to fully process the command and become responsive again before the first status verification check. This complements the existing readiness check after daemon restart and the retry logic for SIGTERM failures in the tests. * tests(monit): remove workarounds after module race condition fix After 10+ successful CI runs with no SIGTERM failures, removing test-level workarounds that are now redundant due to the 0.5s delay fix in the module: - Remove 1-second stabilization pause after daemon restart The module's built-in 0.5s delay after state changes makes this unnecessary - Remove retry logic for SIGTERM failures in state change tests The race condition is now prevented at the module level - Remove verbose logging setup and log capture Verbose mode didn't log HTTP requests, so it didn't help diagnose the issue and adds unnecessary overhead Kept the readiness check with retries after daemon restart - still needed to validate daemon is responsive after service restart (different scenario than the state change race condition). * restore tasks/main.yml * monit tests: reduce readiness check retries from 60 to 10 After successful CI runs, observed that monit daemon becomes responsive within 1-2 seconds after restart. The readiness check typically passes on the first attempt. Reducing from 60 retries (30s timeout) to 10 retries (5s timeout) is more appropriate and allows tests to fail faster if something is genuinely broken. * add changelog frag * Update changelogs/fragments/11255-monit-integrationtests.yml Co-authored-by: Felix Fontein <[email protected]> --------- Co-authored-by: Felix Fontein <[email protected]>
1 parent a977c6f commit ac37544

File tree

3 files changed

+21
-1
lines changed

3 files changed

+21
-1
lines changed
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
minor_changes:
2+
- monit - add ``monit_version`` return value also when the module has succeeded (https://github.com/ansible-collections/community.general/pull/11255).
3+
bugfixes:
4+
- monit - add delay of 0.5 seconds after state change and check for status (https://github.com/ansible-collections/community.general/pull/11255).

plugins/modules/monit.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,12 @@ def exit_fail(self, msg, status=None, **kwargs):
152152
self.module.fail_json(**kwargs)
153153

154154
def exit_success(self, state):
155-
self.module.exit_json(changed=True, name=self.process_name, state=state)
155+
self.module.exit_json(
156+
changed=True,
157+
name=self.process_name,
158+
monit_version=self._raw_version,
159+
state=state,
160+
)
156161

157162
@property
158163
def command_args(self):
@@ -272,6 +277,9 @@ def present(self):
272277
def change_state(self, state: str, expected_status: StatusValue, invert_expected: bool | None = None):
273278
current_status = self.get_status()
274279
self.run_command(STATE_COMMAND_MAP[state])
280+
# Give monit daemon a moment to process the command before checking status
281+
# to avoid race condition where HTTP interface may be temporarily unresponsive
282+
time.sleep(0.5)
275283
status = self.wait_for_status_change(current_status)
276284
status = self.wait_for_monit_to_stop_pending(status)
277285
status_match = status.state == expected_status

tests/integration/targets/monit/tasks/test_reload_present.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,14 @@
5252
name: monit
5353
state: restarted
5454

55+
- name: wait for monit daemon to be responsive after restart
56+
command: monit status -B httpd_echo
57+
register: monit_status_check
58+
changed_when: false
59+
until: monit_status_check.rc == 0
60+
retries: 10
61+
delay: 0.5
62+
5563
- name: test process present again
5664
monit:
5765
name: httpd_echo

0 commit comments

Comments
 (0)