Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"oc observe" emits "wait: no child processes" errors, terminates after max errors #17743

Closed
hansmi opened this issue Dec 13, 2017 · 7 comments
Closed
Assignees
Labels
component/cli kind/bug Categorizes issue or PR as related to a bug. priority/P2

Comments

@hansmi
Copy link

hansmi commented Dec 13, 2017

We're using the openshift/observe Docker image to run an object observer in a pod. The observer invokes a Python script for events and resyncs. Every once in a while, it seems indeterministic, the observer emits error: wait: no child processes. After the number of retries given via --maximum-errors, default 20, the observer terminates. It appears to be a case of incorrect handling of wait4 return values and errors.

Version

The specific base image version we use is docker.io/openshift/observe@sha256:d66eb70a2b1d372932924b5b6a71f7503a519b55fb3575a7d4378caedc197eb1. The issue has existed for at least three weeks, however, and possibly longer. Version of OpenShift client within the container:

oc v3.9.0-alpha.0+61e6f25-90
kubernetes v1.8.1+0d5291c
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.30.0.1:443
openshift v3.6.173.0.49
kubernetes v1.6.1+5115d708d7
Steps To Reproduce

Unfortunately I have failed to find a reliable reproduction case. We see around 1-3 restarts per 24 hours, i.e. around 60 errors, per day, in multiple observers and on different clusters. Arguments given in deployment config:

"args": [
    "quota",
    "--all-namespaces",
    "--listen-addr=0.0.0.0:8080",
    "--object-env-var=OBJ",
    "--no-headers",
    "--type-env-var=EVENT_TYPE",
    "--resync-period=300s",
    "--",
    "/opt/observer/quota-to-influxdb.py"
]

Output:

# oc -n … logs observer-quota-… | grep error
error: wait: no child processes
error: wait: no child processes
error: wait: no child processes

System calls (attaching strace seems to trigger more errors):

# strace -e wait4,write -p 58120
strace: Process 58120 attached
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3179, si_uid=1000030000, si_status=0, si_utime=13, si_stime=7} ---
wait4(3179, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 133362}, ru_stime={0, 84067}, ...}) = 3179
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3186, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3186, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 172137}, ru_stime={0, 62312}, ...}) = 3186
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3193, si_uid=1000030000, si_status=0, si_utime=17, si_stime=5} ---
wait4(3193, 0xc421107254, 0, 0xc420860090) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3200, si_uid=1000030000, si_status=0, si_utime=15, si_stime=4} ---
wait4(3200, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 158272}, ru_stime={0, 50962}, ...}) = 3200
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3207, si_uid=1000030000, si_status=0, si_utime=15, si_stime=8} ---
wait4(3207, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 157145}, ru_stime={0, 97880}, ...}) = 3207
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3214, si_uid=1000030000, si_status=0, si_utime=17, si_stime=6} ---
wait4(3214, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 179208}, ru_stime={0, 74215}, ...}) = 3214
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3221, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3221, 0xc421107254, 0, 0xc42099e5a0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3228, si_uid=1000030000, si_status=0, si_utime=17, si_stime=6} ---
wait4(3228, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 183290}, ru_stime={0, 75607}, ...}) = 3228
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3235, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3235, 0xc421107254, 0, 0xc4211c0000) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3242, si_uid=1000030000, si_status=0, si_utime=17, si_stime=5} ---
wait4(3242, 0xc421107254, 0, 0xc4211c0090) = -1 ECHILD (No child processes)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3249, si_uid=1000030000, si_status=0, si_utime=14, si_stime=5} ---
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3256, si_uid=1000030000, si_status=0, si_utime=13, si_stime=5} ---
wait4(3256, 0xc421107254, 0, 0xc4211ea000) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3263, si_uid=1000030000, si_status=0, si_utime=14, si_stime=6} ---
wait4(3263, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 155364}, ru_stime={0, 79734}, ...}) = 3263
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3270, si_uid=1000030000, si_status=0, si_utime=15, si_stime=6} ---
wait4(3270, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 159903}, ru_stime={0, 74061}, ...}) = 3270
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3277, si_uid=1000030000, si_status=0, si_utime=12, si_stime=5} ---
wait4(3277, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 132053}, ru_stime={0, 70879}, ...}) = 3277
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3284, si_uid=1000030000, si_status=0, si_utime=14, si_stime=4} ---
wait4(3284, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 148668}, ru_stime={0, 61723}, ...}) = 3284
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3291, si_uid=1000030000, si_status=0, si_utime=15, si_stime=5} ---
wait4(3291, 0xc421107254, 0, 0xc4211c02d0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3298, si_uid=1000030000, si_status=0, si_utime=16, si_stime=5} ---
wait4(3298, 0xc421107254, 0, 0xc4211c0360) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3305, si_uid=1000030000, si_status=0, si_utime=16, si_stime=6} ---
wait4(3305, 0xc421107254, 0, 0xc420860ab0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3312, si_uid=1000030000, si_status=0, si_utime=16, si_stime=7} ---
wait4(3312, 0xc421107254, 0, 0xc4211ea120) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3319, si_uid=1000030000, si_status=0, si_utime=14, si_stime=6} ---
wait4(3319, 0xc421107254, 0, 0xc4211ea1b0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3326, si_uid=1000030000, si_status=0, si_utime=14, si_stime=6} ---
wait4(3326, 0xc421107254, 0, 0xc4211c03f0) = -1 ECHILD (No child processes)
write(2, "error: wait: no child processes\n", 32) = 32
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3333, si_uid=1000030000, si_status=0, si_utime=14, si_stime=5} ---
wait4(3333, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 149265}, ru_stime={0, 66994}, ...}) = 3333
Current Result

oc observe terminates after 20 cases of wait: no child processes.

Expected Result

oc observe does not terminate for too many cases of ECHILD.

@pweil- pweil- added component/cli kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Dec 13, 2017
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2018
@hansmi
Copy link
Author

hansmi commented Mar 14, 2018

/remove-lifecycle stale

This is still very much an issue. 65 restarts in ~13h of pod uptime in one case.

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2018
@juanvallejo
Copy link
Contributor

juanvallejo commented Mar 14, 2018

@smarterclayton I believe that what may be happening is that we are using cmd.Run here which starts the process and then waits for it to complete. If the process finishes and exits before the wait happens, we end up with the error wait: no child processes. If this is the case, I believe it would be safe to ignore these errors, not having them count towards the --maximum-errors limit.

cc @soltysh for input

@smarterclayton
Copy link
Contributor

smarterclayton commented Mar 15, 2018 via email

@soltysh
Copy link
Contributor

soltysh commented Mar 15, 2018

@juanvallejo go for it, I was looking at os/exec and haven't figured out anything smarter.

openshift-merge-robot added a commit that referenced this issue Mar 19, 2018
Automatic merge from submit-queue (batch tested with PRs 18953, 18992).

ignore wait4 errs in observe max-err-count

Fixes #17743

These errors should not count against the --maximum-errors count
as the process has cleanly run and exited before wait4 syscall is made.

cc @smarterclayton @soltysh
@hansmi
Copy link
Author

hansmi commented Mar 20, 2018

@juanvallejo, thank you very much for the fix. I can confirm that we're no longer seeing pod restarts--zero in ~5h of pod uptime compared to up to 65 in ~13h pod uptime before.

@gregswift
Copy link

gregswift commented Jun 29, 2018

Any protips for those of us not on 3.10+ (where this seems to have landed) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cli kind/bug Categorizes issue or PR as related to a bug. priority/P2
Projects
None yet
Development

No branches or pull requests

8 participants