-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1928581: validate the proxy by trying oc image info #2539
Conversation
@QiWang19: This pull request references Bugzilla bug 1928581, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: QiWang19 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b157484
to
f895be7
Compare
What about proxy config changes made as a day 2 operation? Does the podman "test pull" code in this PR verify that image pulls work when modifying the cluster proxy config after installation? |
f895be7
to
1e1356a
Compare
/retest |
643480d
to
9a31c9a
Compare
/retest |
Considering that proxy config is a global config in the cluster, shouldn't this be fixed at source i.e checking for proxy config when it gets applied/updated to the cluster? Consumers like MCO comes into picture later on. |
/retest |
@sinnykumari From the Bugzilla discussion, when applying to the cluster the proxy will be validated by CNO. comment 17. It is the not a MCO change.
|
ah ok, I may have been confused then because this PR is making changes to MCO's MCC bootstrap mode. If CNO is going to validate the proxy (which makes sense to me), shouldn't this validation code be in the CNO repo? |
If the bootstrap mode has an invalid proxy, the CNO pod will fail to launch since the CNO images cannot be pulled down. |
@sinnykumari PTAL. Add validation for places after the installation. |
/retest |
b179dda
to
3f9788d
Compare
3f9788d
to
a6bc317
Compare
73c5174
to
41ffbbd
Compare
41ffbbd
to
1744e5f
Compare
Signed-off-by: Qi Wang <[email protected]>
1744e5f
to
30a0c20
Compare
/retest-required |
@yuqi-zhang @palonsoro Could you review? The new commit can install openshift-client and exec oc command to fetch CNO image pull spec. |
@QiWang19 it looks good to me. Thanks! |
@@ -12,6 +12,10 @@ COPY --from=builder /go/src/github.com/openshift/machine-config-operator/instroo | |||
RUN cd / && tar xf /tmp/instroot.tar && rm -f /tmp/instroot.tar | |||
COPY install /manifests | |||
|
|||
RUN dnf -y update && dnf -y reinstall shadow-utils && \ | |||
dnf -y install skopeo && dnf -y install openshift-clients && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unfortunate to add a whole new copy of skopeo and oc
into the image, because we have them right there on the host too...
(And for that matter, it's actually useful to validate the proxy configuration from the host network namespace since that's where most image pulling will be happening)
Tricky to deal with without making the MCC privileged enough for host mounts though. But OTOH, the MCC really is privileged in a cluster sense anyways, so making it a privileged container (at least enough for host mounts) isn't really adding any new attack surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgwalters I agree with the host network part, you made a great point here, that may be something worth considering.
However, regarding including the binaries, that's a usual burden that we are already paying a lot of times in a lot of components, nothing that should be surprising to us if we have OCP4 consisting of a number of clusteroperators which deploy a number of operands, almost all of them running inside containers.
Making the MCC deployment require access to the host and require the host to always have these binaries available, even when not crazy in practice, goes much against the spirit of having every component in a container so it is, well, self-contained.
A possible way of improvement here would be to make MCO image derive from the tools
image shipped in the release instead of base
, because the tools
image includes the correct version of oc
already. Other images that require oc
already do it and benefit from image layer de-duplication in what regards the oc client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible way of improvement here would be to make MCO image derive from the tools image shipped in the release instead of base, because the tools image includes the correct version of oc already.
Yeah, that'd help, but doesn't get us out of also shipping skopeo, which today also vendors large parts of the container runtime again.
Hmm, do we actually need to use skopeo vs just forking oc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we do. oc
is only used to find out which image must be checked (CNO image). Once found, the check is done with skopeo
(oc
doesn't provide a way to do it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't just running e.g. oc image info
be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm, as long as executing this successfully guarantees a successful pull, it would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oc
today vendors the docker Go library for interacting with registries, whereas skopeo uses the github.com/containers/image bits. But ultimately...I can't imagine a case where one worked but not the other.
Today oc
's fetching is kind of load-bearing because it's where we have e.g. oc image mirror
etc that people use for disconnected.
I can't imagine a case where oc
succeeds but skopeo
(and/or podman/cri-o) would fail with respect to the proxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can just run oc image info
Sorry for the delay. I think I would need to check in with the team on this as a more general discussion on how to proceed. |
/retest-required |
} else if err != nil { | ||
return err | ||
} | ||
if err := ctrlcommon.ProxyValidation(&proxy.Status, clusterversionCfg.Status.Desired.Version, icspRules); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you could talk me into putting this in a separate place for "sanity checks" before config rollout/eviction/drain, but I have concerns about the proposed location in SyncRenderConfig()
-- if the proxyconfig fails this test for whatever reason, we don't get a RenderConfig
, which prevents the rest of the sync functions from running, and that is problematic for general cluster stability during normal operations (among other things, it affects certificate rotations).
Specifically, in a case where the proxy was "valid" when it was configured, but is down/unreachable/etc at the time of the check, the MCO would degrade, wouldn't it?
I don't know that I have a spot picked out where it should go, because the MCO has typically not accepted preflight checks of this nature, but I'm kind of sympathetic here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuqi-zhang could you help locate where MCO deploys the proxy settings to the Nodes so that the proxyvalidatoin can be done there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so the exact details are a bit up to debate. The way it's set up in this PR right now is blocking at the operator
level which is potentially dangerous for reasons John has listed, and I agree that we should probably think about moving this to a consolidated "checking" location.
So, if we want to achieve the point of "don't roll out the proxy to nodes unless it most likely works", then it likely will have to happen at the controller level, between:
- https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/render/render_controller.go#L546, where the rendered MC is generated for a pool
- https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L846, where the config is rolled out to the MCP
So then this would be something like validateIncomingRenderedConfig
before it gets rolled out to the pool, somehow, where right now we just validate the proxy, but could be extended as a node-specific pre-flight check of some sort. We could even have a flag that enables/disables this, if we don't want to change default behaviour.
The other side of this issue is, as we move towards layering, what if I built a new format OS image with a proxy built into it somehow? This validation path would not catch that if done directly in the image.
One last extension thought on validation that's a bit more encompassing: have a (flag enabled?) option to create an extra canary node on incoming updates to see if that node would break, before upgrading the rest of the nodes. That's a bit too far though.
In summary, I think this is a pretty complex topic. Right now I think maybe the safest option is at the controller level, but how exactly that would be done is a bit up in the air
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other side of this issue is, as we move towards layering, what if I built a new format OS image with a proxy built into it somehow? This validation path would not catch that if done directly in the image.
Ultimately I think what we want is ostreedev/ostree#2725 - basically, we try booting the new configuration, and roll back if kubelet isn't able to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuqi-zhang Please review, need help with the operator code auto-generation regarding to the func (f *fixture) newController()
(pkg/controller/render/render_controller_test.go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need help with the operator code auto-generation regarding to the func (f *fixture) newController() (pkg/controller/render/render_controller_test.go)
Sorry, I don't quite follow, what is the issue with the code autogen?
Signed-off-by: Qi Wang <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: QiWang19, rphillips The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -183,6 +183,12 @@ func (b *Bootstrap) Run(destDir string) error { | |||
configs = append(configs, kconfigs...) | |||
} | |||
|
|||
if releaseVersion, ok := cconfig.Annotations[ctrlcommon.ReleaseImageVersionAnnotationKey]; ok { | |||
if err := ctrlcommon.ProxyValidation(cconfig.Spec.Proxy, releaseVersion, icspRules); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, just for my own curiosity, this will check via the bootstrap network on the bootstrap node right?
Is there a possibility that the bootstrap network is different? Does it even use the proxy you provide to the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bootstrap network will be the masters network in most if not all the cases.
For example, in on-prem environments where the keepalived VIP is deployed, the kube-apiserver VIP is first assigned to the bootstrap and eventually moves to one of the masters, so bootstrap and masters must be in the same subnet for that to happen.
const ( | ||
tagName = "cluster-network-operator" | ||
imageInfo = "adm release info %s --image-for %s" | ||
imageInfoWithICSP = "adm release info %s --image-for %s --icsp-file %s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how the ICSP affects the imagespec here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the oc command can accept the icsp file as an alternative source to retrieve the release image. https://github.com/openshift/oc/blob/3cdf3c29f0c109c94eb67124548a6b21fc5f6a22/pkg/cli/admin/release/info.go#L136.
If the icsp has been configured on the cluster I think we should pass them to the oc to get the release image and get the cno pull spec.
@@ -301,6 +303,25 @@ func (optr *Operator) syncRenderConfig(_ *renderConfig) error { | |||
} | |||
} | |||
spec.AdditionalTrustBundle = trustBundle | |||
clusterversionCfg, err := optr.configClient.ConfigV1().ClusterVersions().Get(context.TODO(), "version", metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are looking to remove these changes right?
} | ||
if releaseVersion, ok := cc.Annotations[ctrlcommon.ReleaseImageVersionAnnotationKey]; ok { | ||
if err := ctrlcommon.ProxyValidation(cc.Spec.Proxy, releaseVersion, icspRules); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we err out here, I think we just don't generate the rendered config right? I feel like maybe we should still generate the rendered config, the have the node controller do the validation and fail there, so we can reference which rendered MC is failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way, we still would have an issue where the rendered config doesn't get generated if we do it here, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that error was out before the render config was generated and synced.
To let the node controller do the check, we can drop the validation from render_conller, and in the node controller we add validation before this line https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L846, something like:
cconfigs, err := ctrl.ccLister.List(labels.Everything())
for _, cc := range cconfigs {
retry(validation)
}
what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think somewhere in the sync MCP function could work.
Although, hmm, this does mean that every time we perform an update of any sort, for every node that gets synced, we re-check the proxy, and even for scenarios that don't have any changes to the proxy, we re-validate, which seems like... a lot of unnecessary work.
So in that case, maybe having it as we do now is better, but only validate on a change to the proxy between old and new?
What do you think @jkyros ? I'm leaning towards reducing the # of times we validate if there isn't a change in the proxy, just not sure where the best place to do so would be. From a logic perspective, I think maybe render controller is easier, but comes with the downside of not generating a new rendered MC.
I think in my view, the best place would be, after we generate the rendered MC, before we roll out to a MCP, we do a one time check if the current->desired MC contains a proxy change, before allowing the node controller to roll out. Such that if there is an error, the user would see a new rendered MC, but the MCP not start an upgrade due to it being degraded on checking for proxy (with a certain amount of retries). But I don't know how well that plugs into what we have now without it either 1. looking clunkly or 2. adding a whole new interface of some sort to do so
if strings.Contains(string(rawOut), proxyErr) { | ||
return fmt.Errorf("invalid http proxy: %w: error running %s %s: %s", err, oc, strings.Join(args, " "), string(rawOut)) | ||
} | ||
return fmt.Errorf("%w: error running %s %s: %s", err, oc, strings.Join(args, " "), string(rawOut)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some kind of retry here, for transient failures?
I guess we always retry via re-syncing technically, maybe it would be worth adding a requeue somewhere...
The issue I'm thinking about is, let's say the network is unstable, and we happen to fail the one validation, but the proxy is otherwise valid, what is the user experience there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree. we can retry to deal with the risks.
} else if err != nil { | ||
return err | ||
} | ||
if err := ctrlcommon.ProxyValidation(&proxy.Status, clusterversionCfg.Status.Desired.Version, icspRules); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need help with the operator code auto-generation regarding to the func (f *fixture) newController() (pkg/controller/render/render_controller_test.go)
Sorry, I don't quite follow, what is the issue with the code autogen?
@QiWang19: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
the function signature change made for getting icsp https://github.com/openshift/machine-config-operator/pull/2539/files#diff-d38c494535eacf2f0876136ce2b6a6329c78e91d238f7cb2b8f75379427747c0R80 |
Hmm, it's been a long time since we last updated that test. What happens if you just try to manually update the test function with the additional necessary items? i.e. a |
I've spent some time thinking through this general problem (validation of the configuration on the nodes), and I'd like to bring this up more for general discussion. First, I'd like to go back to the MCO mission statement:
The MCO was never designed to be, and I believe should still not be, the place where we provide Side note: there is a sort of mitigation in place for "breaking updates": we roll out changes one node at a time (generally), and any singular node should always be replaceable. Back to the point of this PR, proxy has always been a contentious issue. Fundamentally, the proxy object is not owned by the MCO. If any validation were to happen, that the root object owner should be making validations to the object changes before it is provided to the cluster for consumption. If a user provides a broken proxy, shouldn't the change be rejected in the first place? Instead of having it get all the way to the MCO generating a new config before saying: actually your proxy changes aren't valid because the MCC container can't pull the CNO image. The CNO could have done that before we even got here, reducing the complexity of transit. The MCO would then react to it
Side note again: more broadly speaking, I would also like to revisit the bug for a moment: up until this point https://bugzilla.redhat.com/show_bug.cgi?id=1928581#c12 we were still discussing how to properly validate at a CNO level, but right after we did some component switching for which there is no context for in the bug. Trevor makes some good points in https://bugzilla.redhat.com/show_bug.cgi?id=1928581#c17 and then we flipped it back to node. Did we ever get a chance to discuss this at a higher level? Now, to also look at the flip slide, "MCO does not do validation" is not a view that cannot be changed. Openshift is constantly growing and adapting, such that if there is sufficient need to tackle a problem, I think we should consider it. As I see it, there are a few alternatives floating in mind:
And lastly, I feel bad for this writeup, since many people have put a lot of work into this PR, but after all the back and forth I am still leaning towards "this isn't something we should do in the MCO". I am happy to discuss this further in any context, and I am willing to change my mind. |
Excellent writeup, I agree with most of it. My view on this is what we really want is automated rollbacks. Basically in this scenario:
A question here is whether we then later try to reconcile again later. I think it'd make sense to do so, with a backoff to only trying the change again e.g. once a day at most or so? |
But to be clear I agree in this specific instance it'd make sense to have the proxy config be validated by an owning component before it gets rolled out. |
Thank you Jerry for adding all the context and reasoning, great explanation! 100% agree with it and will echo again that validation should be done at the source not at the consumer level. This scales better, less error prone as provider have better knowledge of what is correct. |
I agree that it should have been CNO and not MCO who does this test. Honestly, I don't understand how the bug ended in MCO in the first place. But if this is to be returned to CNO, we need some higher level co-ordination to make this possible. BTW, the PR may not be the best discussion place for all of this but the bugzilla. |
I am in agreement with the latest discussion. Let's close this PR and document a procedure to test the settings. The proxy settings should be tested in a staging environment. |
@QiWang19: This pull request references Bugzilla bug 1928581. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I agree with closing this in what regards MCO, because this should not (and never should have been) checked by MCO. However, just relying on customer validation on stage environment is not a correct approach, because mistakes will always happen. The main point of this bug was not to protect from the error itself, but from the fact that there is no sane way to recover from it once it happened. If this issue raises again, we shall open bug to the CNO, which where some proper solution should be placed. |
Thanks everyone for the work and comments! will try to continue tracking this in Jira so we don't lose the context |
Try use skopeo inspect image using http proxy config for proxy validation. If the skopeo command fails, do not render the proxy.
- What I did
Close Bug 1928581 https://bugzilla.redhat.com/show_bug.cgi?id=1928581
- How to verify it
- Description for the changelog
Add HTTP proxy validation.