Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Only restart containers from systemd #1511

Merged
merged 3 commits into from
Jun 23, 2021
Merged

Only restart containers from systemd #1511

merged 3 commits into from
Jun 23, 2021

Conversation

pothos
Copy link
Member

@pothos pothos commented Jun 22, 2021

  • Only restart containers from systemd
    Restarting etcd or the kubelet container from the Docker daemon
    introduces a race because the "docker logs" command will terminate and
    then the systemd unit will also restart the container in parallel to
    the Docker daemon. This race causes problems like terminating the
    container again or failing to start it, but it also introduces
    additional issues because the ExecStartPre and ExecStopPost commands
    run while the container is running, which means the state can get
    corrupted. Even if the systemd unit would not restart in parallel, it
    is not wanted to restart the container from the Docker daemon because
    the systemd unit should first run the ExecStartPre/Post commands, e.g.,
    for the "etcd-rejoin" script.

    Only restart the containers from the systemd unit, not from the Docker
    daemon.

  • etcd: stop the service before modifying the etcd folder

    The etcd folder was modified while the etcd service was running which
    could lead to an earlier restart of the service because the service
    conditions for existing files are finally met and it may conflict
    with any commands in the ExecStopPost directive.

    First stop the etcd service unit before touching the files and only
    start the unit after the modification is done.

  • aws|packet: align etcd unit file with the shared controller module

    The shared etcd CLC systemd unit has additional conditions for startup
    which were missing in the aws and packet etcd systemd units. It seems
    that these conditions help to make the bootstap more robust.

    Port the startup conditions over to aws and packet to align the
    platforms (which also makes debugging more clear).

How to use

Testing done

Looking at CI

@pothos pothos mentioned this pull request Jun 22, 2021
@surajssd
Copy link
Member

@pothos if this hypothesis works, could you also update for other platforms?

Restarting etcd or the kubelet container from the Docker daemon
introduces a race because the "docker logs" command will terminate and
then the systemd unit will also restart the container in parallel to
the Docker daemon. This race causes problems like terminating the
container again or failing to start it, but it also introduces
additional issues because the ExecStartPre and ExecStopPost commands
run while the container is running, which means the state can get
corrupted. Even if the systemd unit would not restart in parallel, it
is not wanted to restart the container from the Docker daemon because
the systemd unit should first run the ExecStartPre/Post commands, e.g.,
for the "etcd-rejoin" script.

Only restart the containers from the systemd unit, not from the Docker
daemon.
@pothos pothos force-pushed the kai/restart-etcd branch from 04602f0 to 8c937c3 Compare June 22, 2021 14:58
@pothos pothos changed the title etcd: only restart the container from systemd Only restart containers from systemd Jun 22, 2021
@pothos
Copy link
Member Author

pothos commented Jun 22, 2021

Yes, looks the kubelet was also affected, updated it there, too.

@pothos pothos requested a review from surajssd June 22, 2021 15:07
@invidian
Copy link
Member

This should be mentioned in the upgrade guide.

The etcd folder was modified while the etcd service was running which
could lead to an earlier restart of the service because the service
conditions for existing files are finally met and it may conflict
with any commands in the ExecStopPost directive.

First stop the etcd service unit before touching the files and only
start the unit after the modification is done.
@pothos
Copy link
Member Author

pothos commented Jun 22, 2021

Still got a failure and tried one more improvement in another commit

The shared etcd CLC systemd unit has additional conditions for startup
which were missing in the aws and packet etcd systemd units. It seems
that these conditions help to make the bootstap more robust.

Port the startup conditions over to aws and packet to align the
platforms (which also makes debugging more clear).
Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Member

@surajssd surajssd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@pothos pothos merged commit 2fa6cc4 into master Jun 23, 2021
@pothos pothos deleted the kai/restart-etcd branch June 23, 2021 09:09
@invidian invidian added the bug Something isn't working label Jul 16, 2021
@invidian invidian added this to the v0.9.0 milestone Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants