Chief of Monitors

## [Sentry](https://red-hat-it.sentry.io/issues/?project=1767823)

 * On Monday check all the alerts for the past week-end
 * Fix recurring alerts
   * Deploy fixes to production for issues that cause major disruption or complete downtime
   * Verify no alerts are being triggered anymore
 * More complicated issues should be brought to the team and prioritized correctly.
 * Clean all the past alerts so we have easy to navigate dashboard
 * Link GitHub and Sentry issues

### Stuck PostgreSQL transactions issue
There is an ongoing [issue](https://github.com/packit/packit-service/issues/2954) with PostgreSQL transactions getting stuck when the database server restarts or clients lose connection to it. If you see repeating events such as `OperationalError - (psycopg2.OperationalError) server closed the connection unexpectedly` or `PendingRollbackError - Can't reconnect until invalid transaction is rolled back`, do the following:
 * Check the status of the `postgres` pod, respin it if necessary
 * Respin any pods (service, workers) that generated the error events

## [Grafana](http://metrics.osci.redhat.com)
React to alerts arriving through email and check the [SLO monitoring page](http://metrics.osci.redhat.com/d/VXwH27XMk/rhel-infrastructure-and-service-health?orgId=1) (Packit section) and respond to the email so others know what is happening. Suggest updates of the alert thresholds if needed.

Watch our other two Grafana dashboards as well:
* [Celery Monitoring](http://metrics.osci.redhat.com/d/3OBI1flGz/celery-monitoring-with-flower?orgId=1&from=now-24h&to=now&refresh=10s)
* [Packit Monitoring](http://metrics.osci.redhat.com/d/iuAkAaWMk/packit?orgId=1)

### SLO1 issues investigation
We are investigating SLO1 issues. They could be related to _short running tasks_ taking more than half a minute to complete.
When looking at the Celery monitoring dashboard pay attention to _short running tasks_ and how long they took to complete.
For the moment we can report misbehaving [here](https://github.com/packit/packit-service/issues/1996).

## [CI/Zuul](https://gitlab.com/softwarefactory-project/centosinfra-prod/packit-service-config)

You are responsible throughout the week for keeping the CI green, that is to look for and drive the resolution of systematic CI failures.

It can happen that a CI system has an outage. For problems related to Zuul, please reach out to the team at `#sf-ops` matrix.org or `#team-rhos-dfg-pcinfra` Slack channel.

## pre-commit-ci

Once the pre-commit-ci user creates updates to our pre-commit configs, [take care of the pull requests](https://github.com/search?q=org%3Apackit+pre-commit+autoupdate&type=pullrequests&ref=advsearch&state=open):

## Openshift

If you think there's something wrong with the Openshift instance we're running in:
* Automotive cluster - ask in `packit-auto-shared-infra` in internal Google chat or mailto auto-packit-shared-infra@redhat.com
* Managed Platform Plus - ask in `#help-it-cloud-openshift` in internal Slack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chief of Monitors #1017

Sentry

Stuck PostgreSQL transactions issue

Grafana

SLO1 issues investigation

CI/Zuul

pre-commit-ci

Openshift

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Chief of Monitors #1017

Description

Sentry

Stuck PostgreSQL transactions issue

Grafana

SLO1 issues investigation

CI/Zuul

pre-commit-ci

Openshift

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions