Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openshift router pod (docker container) doesn't reap zombie processes #1242

Closed
ramr opened this issue Mar 6, 2015 · 16 comments
Closed

Openshift router pod (docker container) doesn't reap zombie processes #1242

ramr opened this issue Mar 6, 2015 · 16 comments

Comments

@ramr
Copy link
Contributor

ramr commented Mar 6, 2015

The openshift router pod running the openshift/origin-haproxy-router:v* image does not reap the haproxy zombie processes. Its probably not a big deal for now - but we will hit process limits when this list of unreaped children grows - dependent on how often the /var/lib/haproxy/reload-haproxy reload script is called.

[vagrant@openshiftdev ~]$ sudo nsenter -m -u -n -i -p -t $(sudo docker inspect --format "{{ .State.Pid }}" "$(sudo docker ps | grep origin-haproxy-router | awk '{print $NF}')")    
[root@router-1-paurv /]# ps -ef | cat    
UID        PID  PPID  C STIME TTY          TIME CMD    
root         1     0  0 01:09 ?        00:00:01 /usr/bin/openshift-router --template=/var/lib/haproxy/conf/haproxy-config.template --reload=/var/lib/haproxy/reload-haproxy    
haproxy     15     1  0 01:09 ?        00:00:00 [haproxy] <defunct>    
haproxy     19     1  0 01:09 ?        00:00:00 [haproxy] <defunct>    
haproxy     23     1  0 01:09 ?        00:00:00 [haproxy] <defunct>    
haproxy     27     1  0 01:09 ?        00:00:00 [haproxy] <defunct>    
haproxy     31     1  0 01:09 ?        00:00:00 [haproxy] <defunct>    
haproxy     35     1  0 01:09 ?        00:00:00 [haproxy] <defunct>    
haproxy     39     1  0 01:09 ?        00:00:00 [haproxy] <defunct>    
haproxy     43     1  0 01:10 ?        00:00:00 [haproxy] <defunct>    
haproxy     80     1  0 01:49 ?        00:00:00 /usr/sbin/haproxy -f /var/lib/haproxy/conf/haproxy.config -p /var/lib/haproxy/run/haproxy.pid -sf 43    
root        81     0  0 01:56 ?        00:00:00 -bash    
root        97    81  0 01:57 ?        00:00:00 ps -ef    
root        98    81  0 01:57 ?        00:00:00 cat    
@ramr
Copy link
Contributor Author

ramr commented Mar 6, 2015

@pweil- fyi.

@smarterclayton
Copy link
Contributor

Sigh

@pweil-
Copy link

pweil- commented Mar 6, 2015

Hrm, well that's troublesome since we're checking for a pid file and sending finish signals in the restart if we found it:

if [ -n "$old_pid" ]; then
  /usr/sbin/haproxy -f $config_file -p $pid_file -sf $old_pid
else
  /usr/sbin/haproxy -f $config_file -p $pid_file
fi

@pweil-
Copy link

pweil- commented Mar 6, 2015

Some thoughts:

  1. the plugin is the entry point for the image, it executes the reload commands
  2. the plugin could be modified to reap child processes - this introduces os specific code with syscall
  3. maybe the docker container can be set up differently so that haproxy is running under an init process that can reap?

cc @pmorie

@smarterclayton
Copy link
Contributor

In general, the 3rd option is most sane. We have systemd container for this, also supervisor and a few others. None of them are ideal. The important thing is that haproxy must know it is being restarted.

There may be a golang library already for acting as pid 1 - you should investigate that.

@pweil-
Copy link

pweil- commented Mar 6, 2015

@smarterclayton

HAProxy knows it is being restarted, the old pid is stopped and is defunct. I think the issue is the parent process (the router plugin) doesn't care and isn't gathering the exit code so it lingers until the parent process stops (which running under init would do if it inherited the child).

But, I can do this without anything crazy or OS specific code if I have the restart script pass back the old pid and do something like this in a go routine:

proc, e := os.FindProcess(int(pid))
proc.Wait()

if that seems reasonable. Preliminary testing seems to work fine. Full test code: master...pweil-:router-reaper

I haven't run across any "do it for you" solutions. I did find a similar issue that led me to the SIGCHLD solution that was OS specific. I will look again since I was mainly searching for reaping with our process acting as pid 1.

@smarterclayton
Copy link
Contributor

There's a golang process manager written by brad fitz. It's worth a look.

On Mar 6, 2015, at 12:29 PM, Paul [email protected] wrote:

@smarterclayton

HAProxy knows it is being restarted, the old pid is stopped and is defunct. I think the issue is the parent process (the router plugin) doesn't care and isn't gathering the exit code so it lingers until the parent process stops (which running under init would do if it inherited the child).

But, I can do this without anything crazy or OS specific code if I have the restart script pass back the old pid and do something like this in a go routine:

proc, e := os.FindProcess(int(pid))
proc.Wait()
if that seems reasonable. Preliminary testing seems to work fine.

I haven't run across any "do it for you" solutions. I did find a similar issue that led me to the SIGCHLD solution that was OS specific. I will look again since I was mainly searching for reaping with our process acting as pid 1.


Reply to this email directly or view it on GitHub.

@pweil-
Copy link

pweil- commented Mar 6, 2015

checking out dinit, and baseimage-docker. Will check that one out too.

@smarterclayton
Copy link
Contributor

BaseImage docker is a terrible pattern

On Mar 6, 2015, at 2:35 PM, Paul [email protected] wrote:

checking out dinit, and baseimage-docker. Will check that one out too.


Reply to this email directly or view it on GitHub.

@pweil-
Copy link

pweil- commented Mar 6, 2015

rm -Rf pauls_brain/baseimage-docker

@smarterclayton
Copy link
Contributor

My preferred solution here would be a simple addition to our Go binaries which detect when they are PID 1 and enable a simple minimal behavior for reaping (Wait on child processes in a goroutine) and proper signal handling. Docker may already have this from the old dockerinit code - I'd be surprised if someone hasn't handled it. @mrunalp you aware of anything like that?

@mrunalp
Copy link
Member

mrunalp commented Mar 9, 2015

@smarterclayton delved into that last week in lib container. Should be possible to put something together. I will take a closer look at the router tomorrow and see what we can do.

@pweil-
Copy link

pweil- commented Mar 9, 2015

@smarterclayton @mrunalp - that doesn't seem too difficult. I will wait to see if Mrunal knows of something already done. I looked at runsit, it seems to do the trick but makes setting up the image a little less obvious. Still solvable but if we'd prefer a simpler solution (I would) then I won't spend more time diving in.

@mrunalp - let me know if there is anything I can do to help

@mrunalp
Copy link
Member

mrunalp commented Mar 9, 2015

@pweil- I am going to prototype what @smarterclayton suggested; basically make the process act as a simple init that reaps zombies. If it works reliably, then we can use that else look for a real init.

@pweil-
Copy link

pweil- commented Mar 9, 2015

@mrunalp - this may not be the ideal implementation since it doesn't catch /any/ child process but it seemed to work fine for me as a general solution. master...pweil-:router-reaper . It was my prototype for option 2 in the comment above

@smarterclayton
Copy link
Contributor

Fixt

jboyd01 pushed a commit to jboyd01/origin that referenced this issue Sep 20, 2017
…service-catalog/' changes from ae6b643caf..50e234de83

50e234de83 origin build: add origin tooling
092d7f8 Fix typos and resource names in walkthrough e2e logs (openshift#1237)
d25bd11 Archive the old agenda doc, link to new one (openshift#1243)
6192d14 fix lint errors (openshift#1242)
d103dad Fix lint errors and regenerate openapi (openshift#1238)
e9328d3 Broker Relist (openshift#1183)
b0f3222 Correct the reasons and messages set on the ready condition during async polling (openshift#1235)
d2bb82f Re-enable the href checker (openshift#1232)
2c29654 Use feature gates in controller-manager (openshift#1231)
699eab9 switch build to go1.9 (openshift#1155)
7529ed8 broker resource secret authorization checking (openshift#1186)
50d9bdf v0.0.20 chart updates (openshift#1228)
REVERT: ae6b643caf Use oc adm instead of oadm which might not exist in various installations.
REVERT: 66a4eb2a2c Update instructions... will remove once documented elsewhere
REVERT: 1b704d1530 replace build context setup with init containers
REVERT: ee4df18c7f hack/lib: dedup os::util::host_platform and os::build::host_platform
REVERT: 1cd6dfa998 origin: Switch out owners to Red Hatters
REVERT: 664f4d318f Add instructions for syncing repos
REVERT: 2f2cdd546b origin-build: delete files with colon in them
REVERT: cdf8b12848 origin-build: don't build user-broker
REVERT: ebfede9056 origin build: add _output to .gitignore
REVERT: 55412c7e3d origin build: make build-go and build-cross work
REVERT: 68c74ff4ae origin build: modify hard coded path
REVERT: 3d41a217f6 origin build: add origin tooling

git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog
git-subtree-split: 50e234de836b5e7c9e3d7d763847b99a0f0ea500
jboyd01 pushed a commit to jboyd01/origin that referenced this issue Sep 21, 2017
…service-catalog/' changes from ae6b643caf..06b897d198

06b897d198 origin build: add origin tooling
092d7f8 Fix typos and resource names in walkthrough e2e logs (openshift#1237)
d25bd11 Archive the old agenda doc, link to new one (openshift#1243)
6192d14 fix lint errors (openshift#1242)
d103dad Fix lint errors and regenerate openapi (openshift#1238)
e9328d3 Broker Relist (openshift#1183)
b0f3222 Correct the reasons and messages set on the ready condition during async polling (openshift#1235)
d2bb82f Re-enable the href checker (openshift#1232)
2c29654 Use feature gates in controller-manager (openshift#1231)
699eab9 switch build to go1.9 (openshift#1155)
7529ed8 broker resource secret authorization checking (openshift#1186)
50d9bdf v0.0.20 chart updates (openshift#1228)
REVERT: ae6b643caf Use oc adm instead of oadm which might not exist in various installations.
REVERT: 66a4eb2a2c Update instructions... will remove once documented elsewhere
REVERT: 1b704d1530 replace build context setup with init containers
REVERT: ee4df18c7f hack/lib: dedup os::util::host_platform and os::build::host_platform
REVERT: 1cd6dfa998 origin: Switch out owners to Red Hatters
REVERT: 664f4d318f Add instructions for syncing repos
REVERT: 2f2cdd546b origin-build: delete files with colon in them
REVERT: cdf8b12848 origin-build: don't build user-broker
REVERT: ebfede9056 origin build: add _output to .gitignore
REVERT: 55412c7e3d origin build: make build-go and build-cross work
REVERT: 68c74ff4ae origin build: modify hard coded path
REVERT: 3d41a217f6 origin build: add origin tooling

git-subtree-dir: cmd/service-catalog/go/src/github.com/kubernetes-incubator/service-catalog
git-subtree-split: 06b897d1988a5a3c035c5a971c15b97cbc732918
jpeeler pushed a commit to jpeeler/origin that referenced this issue Feb 1, 2018
sjenning pushed a commit to sjenning/origin that referenced this issue Jul 30, 2018
[3.8] Ensure Continue token is proxied for openshift RBAC list calls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants