Skip to content

Conversation

@russoz
Copy link
Collaborator

@russoz russoz commented Dec 3, 2025

SUMMARY

This PR will try to leverage Copilot to find why monit tests fail/pass inconsistently.

References:

ISSUE TYPE
  • Test Pull Request
COMPONENT NAME

monit

@ansibullbot
Copy link
Collaborator

@ansibullbot ansibullbot added WIP Work in progress module module plugins plugin (any type) small_patch Hopefully easy to review labels Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

ha, copilot cannot be used and the PR comes from one repo to another... 🤦

@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@ansibullbot ansibullbot removed the small_patch Hopefully easy to review label Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

I managed to get Copilot to work locally in VSCode and to analyse the issue. It seems the problem is with a specific range of versions of Monit (5.34.x). Here is Copilot's summary:

Investigation Results: Monit 5.34.x Regression

After analyzing the CI failures, I've identified the root cause is a known bug in monit version 5.34.x.

Test Failure Analysis

Build #165134 (20251203.7) Results:

Platform Monit Version Result Error
Alpine 3.21 - 2 5.34.2 ❌ FAILED rc: -15 (SIGTERM)
Debian 13 Trixie - 2 5.34.3 ❌ FAILED rc: -15 (SIGTERM)
Alpine 3.20 - 2 5.33.0 ✅ SUCCESS -
Alpine 3.22 - 2 5.35.0 ✅ SUCCESS -
Debian 11 Bullseye - 2 5.27.2 ✅ SUCCESS -
Debian 12 Bookworm - 2 5.33.0 ✅ SUCCESS -

Upstream Bug Evidence

According to the official monit changelog:

Version 5.34.1 (Oct 4 2024):

  • Fixed: The check program now avoids leaving zombie processes between cycles.
    • This fix introduced a regression

Version 5.35.0 (Apr 10 2025) - Fixes:

  1. Fixed: Monit may hang on exit, if a program executed by check program was pending while monit tried to quit (regression from Monit 5.34.1).
  2. Fixed: Issue #1115: Service checks may sporadically hang if the PID is recycled by the system quickly, e.g. due to PID randomization (regression from Monit 5.34.1).

Conclusion

The test failures are NOT caused by:

  • ❌ Timing issues in our tests
  • ❌ Race conditions in our code
  • ❌ Module logic bugs

The failures are correctly identifying platforms running the broken monit versions (5.34.1, 5.34.2, 5.34.3, 5.34.4). The tests appropriately receive SIGTERM when monit hangs during operation.

The tests are marked as unstable and are only required to pass when the module itself is modified. Since these failures are due to a known upstream bug (not our code), the intermittent failures on 5.34.x platforms are expected and acceptable.

Recommendation

No changes needed to the monit module or tests. The current unstable marker is appropriate for these integration tests.

@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

Copilot is quick to make strong statements and declarations, I would take those with a pinch of salt, but it does look like there is something there. Next step is to try blocking those versions from being installed and see what happens.

@ansibullbot ansibullbot added integration tests/integration tests tests labels Dec 3, 2025
@russoz russoz force-pushed the monit-test-copilot branch from 58badc8 to e01a548 Compare December 3, 2025 08:28
@ansibullbot ansibullbot added needs_ci This PR requires CI testing to be performed. Please close and re-open this PR to trigger CI and removed needs_ci This PR requires CI testing to be performed. Please close and re-open this PR to trigger CI labels Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

Ok, changes the integration test to skip the test when the installed monit is 5.34.x.

First run, all checks passed.

@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

Second run, all passed.

@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

I have sampled the logs - the tests are running normally when monit is not 5.34, for both lower and higher versions. Will close and reopen a couple of times more, for the sake of statistics, but it is looking good.

@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

Round 3, all checks passed

@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

Round 9 (with delay): Ansible devel, Ubuntu 24.04

... and there goes our nice theory down the drain. Exact same error, in the exact same place. Monit 5.33 so it's not the version - though it looks like the 5.34 jobs were big time offenders.

I will resume this tomorrow.

@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

Round 10, all checks passed.

@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 3, 2025

Round 11, all checks passed.

@russoz russoz closed this Dec 3, 2025
@russoz russoz reopened this Dec 3, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 4, 2025

Round 12, FAIL with SIGTERM(rc=-15)

  • Ubuntu 24.04 - Monit 5.33
  • Debian 11 - Monit 5.27 (!!!!)

So, this is most likely NOT a problem with monit.

The wait task was checking 'monit status' (general), but the actual
failing command is 'monit status -B httpd_echo' (service-specific).
This causes a race where general status succeeds but service queries
fail. Update to check the exact command format that will be used.
@russoz
Copy link
Collaborator Author

russoz commented Dec 8, 2025

Round 13, all checks passed.

Again!

@russoz russoz closed this Dec 8, 2025
@russoz russoz reopened this Dec 8, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 8, 2025

Round 14, all checks passed

@russoz russoz closed this Dec 8, 2025
@russoz russoz reopened this Dec 8, 2025
The version restriction was based on incorrect diagnosis. The actual
issue was the readiness check validating general status instead of
service-specific queries. Now that we check the correct command
format, the tests should work across all monit versions.
@russoz
Copy link
Collaborator Author

russoz commented Dec 8, 2025

Round 15 failed for debian 13 and alpine 3.21 (both running monit 5.34.x)

@russoz russoz closed this Dec 8, 2025
@russoz russoz reopened this Dec 8, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 8, 2025

Round 16 - failed for debian 13 (5.34.3)

@russoz russoz closed this Dec 8, 2025
@russoz russoz reopened this Dec 8, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 8, 2025

Round 17 - FAIL, Alpine 3.22 (Monit 5.35)

@russoz russoz closed this Dec 8, 2025
@russoz russoz reopened this Dec 8, 2025
@russoz
Copy link
Collaborator Author

russoz commented Dec 8, 2025

Round 18 - FAIL for Debian 13 (Monit 5.34)

After the readiness check succeeds, add a 1-second pause before
running actual tests. Monit 5.34.x and 5.35 appear to have a
concurrency issue where rapid successive 'monit status -B' calls
can cause hangs even though the first call succeeds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug This issue/PR relates to a bug integration tests/integration module module plugins plugin (any type) tests tests WIP Work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants