-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"oc observe" emits "wait: no child processes" errors, terminates after max errors #17743
Comments
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale This is still very much an issue. 65 restarts in ~13h of pod uptime in one case. |
@smarterclayton I believe that what may be happening is that we are using cc @soltysh for input |
Ok
…On Wed, Mar 14, 2018 at 3:25 PM, Juan Vallejo ***@***.***> wrote:
@smarterclayton <https://github.com/smarterclayton> I believe that what
may be happening is that we are using cmd.Run here
<https://github.com/openshift/origin/blob/master/pkg/oc/cli/cmd/observe/observe.go#L673>
which starts the process and then waits for it to complete. If the process
finishes and exits before the wait happens, we end up with the error wait:
no child process. If this is the case, I believe it would be safe to
ignore these errors, not having them count towards the --maximum-errors
limit.
cc @soltysh <https://github.com/soltysh> for input
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17743 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_px3-hYR6KLce-y1OXRQrPvRlYx5Sks5teW6RgaJpZM4RAhN6>
.
|
@juanvallejo go for it, I was looking at |
Automatic merge from submit-queue (batch tested with PRs 18953, 18992). ignore wait4 errs in observe max-err-count Fixes #17743 These errors should not count against the --maximum-errors count as the process has cleanly run and exited before wait4 syscall is made. cc @smarterclayton @soltysh
@juanvallejo, thank you very much for the fix. I can confirm that we're no longer seeing pod restarts--zero in ~5h of pod uptime compared to up to 65 in ~13h pod uptime before. |
Any protips for those of us not on 3.10+ (where this seems to have landed) ? |
We're using the openshift/observe Docker image to run an object observer in a pod. The observer invokes a Python script for events and resyncs. Every once in a while, it seems indeterministic, the observer emits
error: wait: no child processes
. After the number of retries given via--maximum-errors
, default 20, the observer terminates. It appears to be a case of incorrect handling ofwait4
return values and errors.Version
The specific base image version we use is
docker.io/openshift/observe@sha256:d66eb70a2b1d372932924b5b6a71f7503a519b55fb3575a7d4378caedc197eb1
. The issue has existed for at least three weeks, however, and possibly longer. Version of OpenShift client within the container:Steps To Reproduce
Unfortunately I have failed to find a reliable reproduction case. We see around 1-3 restarts per 24 hours, i.e. around 60 errors, per day, in multiple observers and on different clusters. Arguments given in deployment config:
Output:
System calls (attaching
strace
seems to trigger more errors):Current Result
oc observe
terminates after 20 cases ofwait: no child processes
.Expected Result
oc observe
does not terminate for too many cases of ECHILD.The text was updated successfully, but these errors were encountered: