-
Notifications
You must be signed in to change notification settings - Fork 752
Description
Is there an existing issue for this?
- I have searched the existing issues
Describe the bug
For the health check, the OpenAI resource looks at https://status.openai.com/api/v2/status.json and checks the indicator property. If the indicator is minor, the health check is degraded. A major indicator results in an unhealthy resource.
| "major" => HealthCheckResult.Unhealthy(description.Length > 0 ? description : "Major service outage."), |
The problem is that https://status.openai.com/ represents more than the OpenAI API. It also reacts to ChatGPT and Sora problems.
Two days ago, there was a major ChatGPT issue that caused the OpenAI resource not to start. Fortunately, I am only in dev. That would have been a big problem in Production if the resources had been restarted for any reason (e.g. a deploy). The ChatGPT issue lasted for more than 10 hours, so this would not have been a momentary problem.
Today, there is a minor ChatGPT issue.
The result is the OpenAI is running in a degraded state. This caused my web resource not to run. I was able to get by that by removing the .WaitFor("open-ai") call which isn't ideal.
The crux of this issue is that an OpenAI will either be unhealthy or degraded based on the status of something unrelated. The OpenAI status page was not designed to be a health check on the API specifically and therefore takes down aspire OpenAI resources as collateral damage.
tl;dr - If Sora is down, aspire OpenAI resources won't start.
Expected Behavior
Find a way to make the Health Check more granular or don't mark the resource unhealthy and let dependent resources run with some kind of warning that there might be an issue.
Steps To Reproduce
Looking at https://status.openai.com/api/v2/status.json, during outages, I don't see anything to differentiate between API, ChatGPT, and Sora issues.
Exceptions (if any)
No response
.NET Version info
No response
Anything else?
No response