"oc observe" emits "wait: no child processes" errors, terminates after max errors #17743

hansmi · 2017-12-13T13:07:37Z

We're using the openshift/observe Docker image to run an object observer in a pod. The observer invokes a Python script for events and resyncs. Every once in a while, it seems indeterministic, the observer emits error: wait: no child processes. After the number of retries given via --maximum-errors, default 20, the observer terminates. It appears to be a case of incorrect handling of wait4 return values and errors.

Version

The specific base image version we use is docker.io/openshift/observe@sha256:d66eb70a2b1d372932924b5b6a71f7503a519b55fb3575a7d4378caedc197eb1. The issue has existed for at least three weeks, however, and possibly longer. Version of OpenShift client within the container:

oc v3.9.0-alpha.0+61e6f25-90
kubernetes v1.8.1+0d5291c
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.30.0.1:443
openshift v3.6.173.0.49
kubernetes v1.6.1+5115d708d7

Steps To Reproduce

Unfortunately I have failed to find a reliable reproduction case. We see around 1-3 restarts per 24 hours, i.e. around 60 errors, per day, in multiple observers and on different clusters. Arguments given in deployment config:

"args": [
    "quota",
    "--all-namespaces",
    "--listen-addr=0.0.0.0:8080",
    "--object-env-var=OBJ",
    "--no-headers",
    "--type-env-var=EVENT_TYPE",
    "--resync-period=300s",
    "--",
    "/opt/observer/quota-to-influxdb.py"
]

Output:

# oc -n … logs observer-quota-… | grep error
error: wait: no child processes
error: wait: no child processes
error: wait: no child processes

System calls (attaching strace seems to trigger more errors):

# strace -e wait4,write -p 58120
strace: Process 58120 attached
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3179, si_uid=1000030000, si_status=0, si_utime=13, si_stime=7} ---
wait4(3179, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 133362}, ru_stime={0, 84067}, ...}) = 3179
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3186, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3186, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 172137}, ru_stime={0, 62312}, ...}) = 3186
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3193, si_uid=1000030000, si_status=0, si_utime=17, si_stime=5} ---
wait4(3193, 0xc421107254, 0, 0xc420860090) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3200, si_uid=1000030000, si_status=0, si_utime=15, si_stime=4} ---
wait4(3200, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 158272}, ru_stime={0, 50962}, ...}) = 3200
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3207, si_uid=1000030000, si_status=0, si_utime=15, si_stime=8} ---
wait4(3207, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 157145}, ru_stime={0, 97880}, ...}) = 3207
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3214, si_uid=1000030000, si_status=0, si_utime=17, si_stime=6} ---
wait4(3214, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 179208}, ru_stime={0, 74215}, ...}) = 3214
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3221, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3221, 0xc421107254, 0, 0xc42099e5a0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3228, si_uid=1000030000, si_status=0, si_utime=17, si_stime=6} ---
wait4(3228, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 183290}, ru_stime={0, 75607}, ...}) = 3228
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3235, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3235, 0xc421107254, 0, 0xc4211c0000) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3242, si_uid=1000030000, si_status=0, si_utime=17, si_stime=5} ---
wait4(3242, 0xc421107254, 0, 0xc4211c0090) = -1 ECHILD (No child processes)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3249, si_uid=1000030000, si_status=0, si_utime=14, si_stime=5} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3256, si_uid=1000030000, si_status=0, si_utime=13, si_stime=5} ---
wait4(3256, 0xc421107254, 0, 0xc4211ea000) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3263, si_uid=1000030000, si_status=0, si_utime=14, si_stime=6} ---
wait4(3263, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 155364}, ru_stime={0, 79734}, ...}) = 3263
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3270, si_uid=1000030000, si_status=0, si_utime=15, si_stime=6} ---
wait4(3270, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 159903}, ru_stime={0, 74061}, ...}) = 3270
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3277, si_uid=1000030000, si_status=0, si_utime=12, si_stime=5} ---
wait4(3277, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 132053}, ru_stime={0, 70879}, ...}) = 3277
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3284, si_uid=1000030000, si_status=0, si_utime=14, si_stime=4} ---
wait4(3284, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 148668}, ru_stime={0, 61723}, ...}) = 3284
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3291, si_uid=1000030000, si_status=0, si_utime=15, si_stime=5} ---
wait4(3291, 0xc421107254, 0, 0xc4211c02d0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3298, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3298, 0xc421107254, 0, 0xc4211c0360) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3305, si_uid=1000030000, si_status=0, si_utime=16, si_stime=6} ---
wait4(3305, 0xc421107254, 0, 0xc420860ab0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3312, si_uid=1000030000, si_status=0, si_utime=16, si_stime=7} ---
wait4(3312, 0xc421107254, 0, 0xc4211ea120) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3319, si_uid=1000030000, si_status=0, si_utime=14, si_stime=6} ---
wait4(3319, 0xc421107254, 0, 0xc4211ea1b0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3326, si_uid=1000030000, si_status=0, si_utime=14, si_stime=6} ---
wait4(3326, 0xc421107254, 0, 0xc4211c03f0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3333, si_uid=1000030000, si_status=0, si_utime=14, si_stime=5} ---
wait4(3333, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 149265}, ru_stime={0, 66994}, ...}) = 3333

Current Result

oc observe terminates after 20 cases of wait: no child processes.

Expected Result

oc observe does not terminate for too many cases of ECHILD.

The text was updated successfully, but these errors were encountered:

openshift-bot · 2018-03-13T23:33:08Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

hansmi · 2018-03-14T11:57:23Z

/remove-lifecycle stale

This is still very much an issue. 65 restarts in ~13h of pod uptime in one case.

juanvallejo · 2018-03-14T19:25:01Z

@smarterclayton I believe that what may be happening is that we are using cmd.Run here which starts the process and then waits for it to complete. If the process finishes and exits before the wait happens, we end up with the error wait: no child processes. If this is the case, I believe it would be safe to ignore these errors, not having them count towards the --maximum-errors limit.

cc @soltysh for input

smarterclayton · 2018-03-15T01:40:57Z

Ok

…

On Wed, Mar 14, 2018 at 3:25 PM, Juan Vallejo ***@***.***> wrote: @smarterclayton <https://github.com/smarterclayton> I believe that what may be happening is that we are using cmd.Run here <https://github.com/openshift/origin/blob/master/pkg/oc/cli/cmd/observe/observe.go#L673> which starts the process and then waits for it to complete. If the process finishes and exits before the wait happens, we end up with the error wait: no child process. If this is the case, I believe it would be safe to ignore these errors, not having them count towards the --maximum-errors limit. cc @soltysh <https://github.com/soltysh> for input — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17743 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_px3-hYR6KLce-y1OXRQrPvRlYx5Sks5teW6RgaJpZM4RAhN6> .

soltysh · 2018-03-15T16:37:18Z

@juanvallejo go for it, I was looking at os/exec and haven't figured out anything smarter.

@smarterclayton

Automatic merge from submit-queue (batch tested with PRs 18953, 18992). ignore wait4 errs in observe max-err-count Fixes #17743 These errors should not count against the --maximum-errors count as the process has cleanly run and exited before wait4 syscall is made. cc @smarterclayton @soltysh

hansmi · 2018-03-20T11:31:34Z

@juanvallejo, thank you very much for the fix. I can confirm that we're no longer seeing pod restarts--zero in ~5h of pod uptime compared to up to 65 in ~13h pod uptime before.

gregswift · 2018-06-29T20:19:27Z

Any protips for those of us not on 3.10+ (where this seems to have landed) ?

pweil- added component/cli kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Dec 13, 2017

pweil- assigned juanvallejo Dec 13, 2017

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2018

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2018

juanvallejo mentioned this issue Mar 15, 2018

ignore wait4 errs in observe max-err-count #18992

Merged

openshift-merge-robot closed this as completed in #18992 Mar 19, 2018

wking mentioned this issue Mar 1, 2019

install: Fix reference to pod image openshift/machine-config-operator#508

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"oc observe" emits "wait: no child processes" errors, terminates after max errors #17743

"oc observe" emits "wait: no child processes" errors, terminates after max errors #17743

hansmi commented Dec 13, 2017 •

edited

Loading

openshift-bot commented Mar 13, 2018

hansmi commented Mar 14, 2018

juanvallejo commented Mar 14, 2018 •

edited

Loading

smarterclayton commented Mar 15, 2018 via email

soltysh commented Mar 15, 2018

hansmi commented Mar 20, 2018

gregswift commented Jun 29, 2018 •

edited

Loading

"oc observe" emits "wait: no child processes" errors, terminates after max errors #17743

"oc observe" emits "wait: no child processes" errors, terminates after max errors #17743

Comments

hansmi commented Dec 13, 2017 • edited Loading

Version

Steps To Reproduce

Current Result

Expected Result

openshift-bot commented Mar 13, 2018

hansmi commented Mar 14, 2018

juanvallejo commented Mar 14, 2018 • edited Loading

smarterclayton commented Mar 15, 2018 via email

soltysh commented Mar 15, 2018

hansmi commented Mar 20, 2018

gregswift commented Jun 29, 2018 • edited Loading

hansmi commented Dec 13, 2017 •

edited

Loading

juanvallejo commented Mar 14, 2018 •

edited

Loading

gregswift commented Jun 29, 2018 •

edited

Loading