-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Sentry
- On Monday check all the alerts for the past week-end
- Fix recurring alerts
- Deploy fixes to production for issues that cause major disruption or complete downtime
- Verify no alerts are being triggered anymore
- More complicated issues should be brought to the team and prioritized correctly.
- Clean all the past alerts so we have easy to navigate dashboard
- Link GitHub and Sentry issues
Stuck PostgreSQL transactions issue
There is an ongoing issue with PostgreSQL transactions getting stuck when the database server restarts or clients lose connection to it. If you see repeating events such as OperationalError - (psycopg2.OperationalError) server closed the connection unexpectedly or PendingRollbackError - Can't reconnect until invalid transaction is rolled back, do the following:
- Check the status of the
postgrespod, respin it if necessary - Respin any pods (service, workers) that generated the error events
Grafana
React to alerts arriving through email and check the SLO monitoring page (Packit section) and respond to the email so others know what is happening. Suggest updates of the alert thresholds if needed.
Watch our other two Grafana dashboards as well:
SLO1 issues investigation
We are investigating SLO1 issues. They could be related to short running tasks taking more than half a minute to complete.
When looking at the Celery monitoring dashboard pay attention to short running tasks and how long they took to complete.
For the moment we can report misbehaving here.
CI/Zuul
You are responsible throughout the week for keeping the CI green, that is to look for and drive the resolution of systematic CI failures.
It can happen that a CI system has an outage. For problems related to Zuul, please reach out to the team at #sf-ops matrix.org or #team-rhos-dfg-pcinfra Slack channel.
pre-commit-ci
Once the pre-commit-ci user creates updates to our pre-commit configs, take care of the pull requests:
Openshift
If you think there's something wrong with the Openshift instance we're running in:
- Automotive cluster - ask in
packit-auto-shared-infrain internal Google chat or mailto [email protected] - Managed Platform Plus - ask in
#help-it-cloud-openshiftin internal Slack