-
Notifications
You must be signed in to change notification settings - Fork 67
[inventory] Add unhealthy zpools from each sled #9615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
illumos-utils/src/zpool.rs
Outdated
| pub struct UnhealthyZpoolsResult { | ||
| pub zpools: Vec<String>, | ||
| pub errors: Vec<String>, | ||
| pub time_of_status: Option<DateTime<Utc>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be optional? If we have a result, it must have been collected at some time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant for when this runs on non-illumos environements
The idea is that there is no time of status because no command was ever run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also there is a point in time where health monitor may not have run yet -> #9589 (comment)
illumos-utils/src/zpool.rs
Outdated
| #[derive(Debug, Clone, PartialEq, Eq, Deserialize, Serialize, JsonSchema)] | ||
| #[serde(rename_all = "snake_case")] | ||
| pub struct UnhealthyZpoolsResult { | ||
| pub zpools: Vec<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth between just having a list of unhealthy zpools, or associating each zpool with it's state. In the end I went with listing the zpools only, but I'm not convinced. We'll be including the information of the health checks in the support bundle, and it'd be useful for them to be able to see what state each zpool is in. Thoughts? @davepacheco @jgallagher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Associating each zpool with its state sounds good to me; having an explicit entry for "this zpool was healthy" seems safer than inferring "any zpool that isn't explicitly listed as unhealthy must have been healthy".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I was thinking of only including the "unhealthy" zpools with their associated statuses in this list. Similarly with the svcs_in_maintenance I only added the services in maintenance with their associated zones. If I were to include the "healthy" zpools then it wouldn't really be consistent with the services in maintenance no?
My take on the health checks is to only report on things that are in an unhealthy state. Thoughts?
|
Thanks for taking a look @jgallagher @davepacheco ! I think I've addressed all of the comments and this is ready for another review. I also incorporated the feedback from #9589 (comment) here. I have updated the PR's description to show the output of |
|
In the end I decided to make the SMF services in maintenance as an |
| if let Some(svcs) = &health_monitor.smf_services_in_maintenance { | ||
| match svcs { | ||
| Ok(svcs) => { | ||
| if !svcs.is_empty() { | ||
| if let Some(time_of_status) = &svcs.time_of_status { | ||
| writeln!( | ||
| indent2, | ||
| "SMF services in maintenance at {}:", | ||
| time_of_status.to_rfc3339_opts( | ||
| SecondsFormat::Millis, | ||
| /* use_z */ true, | ||
| ) | ||
| )?; | ||
| } | ||
| let mut indent3 = | ||
| IndentWriter::new(" ", &mut indent2); | ||
| for svc in &svcs.services { | ||
| writeln!(indent3, "{svc}")?; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| writeln!( | ||
| indent2, | ||
| "failed to retrieve SMF services in maintenance: {e}" | ||
| )?; | ||
| Err(e) => { | ||
| writeln!( | ||
| indent2, | ||
| "failed to retrieve SMF services in maintenance: {e}" | ||
| )?; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am changing all of this in https://github.com/oxidecomputer/omicron/pull/9589/files#diff-7aa92821ab1567814797c276e521417c9ecb6ee700d2b76679ea5e3daedc5904R1105-R1183 anyway. These changes are only to not have any compilation errors with the new Option<Result<SvcsInMaintenanceResult, String>> type
This is the second PR for #9412
This commit adds a list of unhealthy zpools to the sled agent inventory's health monitor. In a follow-up PR this information will be added to the DB
Successful unhealthy zpool and SMF service in maintenance retrieval:
Response contains errors:
All zpools and services are healthy
Health monitor checks aren't running: