-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Multiple Language Support + Application Discovery in Hub #173
Conversation
Signed-off-by: Jeff Ortel <[email protected]>
Signed-off-by: Jeff Ortel <[email protected]>
Signed-off-by: Jeff Ortel <[email protected]>
Can we copy the details over here so that we can comment on specific details |
Signed-off-by: Jeff Ortel <[email protected]>
done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some concerns and nits.
I really don't think that we want to be in the business of scheduling pods beyond priority class; we should allow the platform to be responsible for the things that it is supposed to be responsible for, IMO.
selector: ! tag:Language= | ||
|
||
Java and Linux | ||
selector: tag:Language=Java && tag:OperatingSystem=Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty concerned with a new format for selector, the concept of tag:
does not exist, and does not make it clear what I can do with this. I am worried that we keep running into each other with concepts, and considering that the hub is another way of running the CLI at scale, I believe that we need to make the UX as cohesive as possible between the two.
I would prefer that either we make the selectors specific (for tag, whatever else you are thinking of selcting on) or we just start with selector and assume that it will only ever work off of tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the systemic use of a generic concept like selector is a good thing and makes the UX more cohesive. The analysis rules are clearly selected using a label selectors. Extensions, they are clearly extension selectors (by nature of scope). The same is true for addons.
I don't see how introducing a completely new term/convention for expressing how addons and extensions are selected makes for a better UX. IMHO, it seems more confusing.
The tag: and platform: prefix provides for extensibility without changing the API and seems like a minor conceptual addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there is not reason to expect that extensions will only be selected by tags and addons will only be selected by tags. Given that both will need to be selected for source analysis, platform analysis and TBD.
For source analysis, the addon may be selected by tag.
For platform analysis, the addon will like be selected by platform (kind).
It's reasonable to anticipate the need the same flexibility for different types of discovery tasks.
The selector is a powerful, compact and extensible mechanism for supporting this and future needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tag: and platform: prefix provides for extensibility without changing the API and seems like a minor conceptual addition.
I worry that this prefix will be expected in other selectors and will not exist; that is the confusion that I am worried about. This may not make sense as a worry, but I wanted to bring it up.
I agree that the <prefix>:
to selectors is a suitable mechanism for extending selectors as we move forward for "things" as you said.
There are things that I would want to make sure that we have if we do go forward (and I think that the worry above is worth ignoring).
- we should have good validation, for if the is valid. As you said, in the others it is assumed what it is and they are valid, but when we open it up, people can make mistakes that would be hard to track down (why isn't my thing being selected, OHH i am use language:Java instead of tag:Language=java as an example)
- we should have a discovery mechanism to know what is valid, and because these selectors are fields in k8s resources, and we want them generic, there is no good way (see csi drivers IMO). Maybe this is a UX that we have to live with, and just make sure that we have good documentation.
WDYT? does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Definitely. The implementation reports errors for an unknown selector prefix. And, if the tag: selector references an unknown tag category.
- Not sure I understand this exactly. I'd anticipated the extension and addon controllers would validate the k8s references in the CRs. Is that what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome re: validation, can you add a bit about that in the enhancement and how that works? if it is there, and I missed it I am sorry.
I was kind of hoping that we started to think of a "documentation" screen. Maybe this is outside the scope of this enhancement. One of the things that we are trying to do in Kantra is exposed via open API what providers give you for rules, and I was hoping that for this, we could do something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the outcome here is that we need to track two things:
- A validation feature in the UI for when selecting the implicit selectors, that you can not use
<prefix>:
- A validation feature in kantra to do the same.
@ibolton336 @sjd78 @pranavgaikwad let me know if you folks see issues with that.
- `selector` - defines selector for inclusion in the addon task pod. | ||
- `metadata` - defines unstructured information. | ||
|
||
Changed _Addon_ CR (Spec): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not understand the changes the spec.
Also if we are changing a k8s API, then we need to make sure that we understand what to do on upgrade with all the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The incompatible changes are is part of a more holistic approach and better symmetry with extensions. The additions are needed for dynamic selection of addons we'll almost certainly needed for platform discovery (future). As for upgrade. We can include a conversion web hook and version our API but:
- web hooks are a pain.
- the API is at alpha.1
- the only addon CRs are ours which are installed/updated by our operator.
That said, if folks feel strongly, we can include a conversion web hook. But I really don't think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After relooking I missed the -
and +
that you add that was not the -/+ from the diff, that is my fault I am sorry :)
If we are just adding fields and not changing fields or removing them, then we just need to handle the sane defaults (for the things that exist).
I agree webhook should be a last resort and is not necessary unless we are changing a field.
Field removal, on the other hand, is something that is not something we should do in one release. Instead, we should deprecate and deal with the consequences of them being left and removed in a future release. I know this is a pain, but is probably the right thing to do for our users. I think we need to come up with a plan here.
I would suggest that we should follow a similar process for our APIs that the hub API is providing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share the same philosophy here. However, I am all but certain there are no addon CRs in our namespace besides the (1) we install and manage. Mainly because there is no way of running other from the UI. Also, I would argue that an alpha1 version offer no guarantee against this kind of change. It's only as a matter of practicality that I propose we don't do as your suggesting.
Leaving the resources and image fields in the spec introduces the opportunity for anomolies. What if the image and resources are defined in both the (old) fields and the new container field? Seems like we would be creating a problem complexity unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed w/ Jeff offline:
Plan A:
we remove the fields, but turn on strict enforcing. This will enable the users who were using the API that the fields that were removed are no longer valid.
Plan B:
If there are issues with upgrade/downgrade and removing the fields, we will leave them in place but add validation that makes it so they can't be set, with a deprecation warning, and will hopefully be able to be removed in a future release.
If anyone has thoughts let us know :)
|
||
Pods are created with the appropriate _PriorityClass_ to ensure node scheduling matches. Task dependencies will be escalated as needed to prevent priority inversion. When (dep) tasks are already scheduled (Pending), the pod will be recreated referencing the updated priority class. | ||
|
||
When the nodes are saturated and task pods are phase=Pending, the task manager may reschedule (delete) task with lower priority in an effort for higher priority pods to be run by the node scheduler. PriorityClass preemption should handle this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried about anything that starts to try and reschedule things IIUC. If we are using PriorityClass already then leaving the pods in phase pending should be sufficient unless I am misunderstanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trust me, I pushed back on this as well.
Actually references to PriorityClass needs to be removed ( I thought I did).
The sections here about priority escalation and preemption are a necessary reaction to issues introduced by the automated discovery which is required by the "multi-provider feature". Mainly that analysis tasks cannot be delayed by discovery tasks.
But a few words about PriorityClasses. I looked pretty hard at them with high hopes.
A few problems:
- they are a cluster resource that affect all pods in all namepaces. I highly doubt any customer will let our operator install them (and should not). We would need to convince customers to add the classes we need and define them on the tackle CR. Customers would need to define a default (if one does not exist) to prevent our task pods from running at hiher priority than controllers. According to @rromannissen , requiring users to do this a pretty much a deal breaker.
- priority classes seem to only define a preference. I cannot find a claim in the documentation that the node scheduler will postpone scheduling lower priority pods (for which there is a node with available resources) to clear capacity for pending higher priority pods requiring more resources than available. This produces a priority inversion. In practice (what I observed) was that in a resource bound cluster, when many small pods are pending, larger, higher priority pods will are significantly delayed because it's easier to find small holes than large ones. Unfortunately, our analysis pods require a very large amount of memory which is likely to produce resource-based scheduling issues except on extremely large clusters.
Note: The tasking system already supports scheduling tasks (pods) in priority order. This is essential for installations where a ResourceQuota is placed in the konveyor namespace. According to @rromannissen, the quota will be common.
IMHO, this is the only complicated topic addressed by proposal. I'm continuing to run test to profile scheduling behaviors. I will add a use case to the Use Case section as told to me by @rromannissen .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is good information about priority classes; thanks for explaining that!
I want to push back on the need. IMO, the benefit is a premature optimization for a theoretical problem. From using Kantra to do language discovery, I am seeing times in milliseconds, and if we end up delaying by that time frame, it is hard for me to see the benefit as analysis is in the 10's of minutes, as far as I know.
To be clear, I would agree that this is a problem that is needed if the delay is in the minutes/10's of minutes.
Can we do this without the this, and come back if/when this does become an issue. Maybe we do something simple like an option to turn off automatic discovery for customers that do hit this issue as an escape hatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The priority escalation features is simpler and only involves redeploying pods that are still pending. This seems straight forward and simply ensures that the pods submitted to the node scheduler match task priority. This is critical when a quota exists in the namespace.
To us, the preemption feature is the controversial one and what we pushed back on for these reasons:
- As you stated, It seems like a premature optimization.
- There is no way to determine which pods should be preempted. The pods blocked the analysis pod are just as likely to be belonging to other operators as our own. Even within our own, it impossible to tell. In all cases it's a best-effort attempt using a sledge hammer.
For these reasons, I made the preemption feature optional. Enabled with a setting. I agree we should consider disabling by default. Users can enable it should they experience problem described in the use case, they can enabled it. However, we'll need to convince @rromannissen.
I am not opposed to making automated discovery enabled/disabled via setting as well. Again, we'll need to get with @rromannissen as to the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion here is what would be introducing the potential delay for the user requested analysis. My understanding is that language discovery does indeed take milliseconds to execute, but automated discovery would also include the technology stack/insights discovery, which implies the execution of an analysis via LSP and can effectively take minutes to complete. The end goal for automated discovery is to surface language and technology stack information at scale throughout the entire portfolio, so applications get automatically associated to their corresponding archetypes ASAP. But that shouldn't imply that the user is unable to do anything until automated tasks are finished, as that would lead to the system being perceived as unresponsive. The problem and the request are stated in #170
Signed-off-by: Jeff Ortel <[email protected]>
Signed-off-by: Jeff Ortel <[email protected]>
Signed-off-by: Jeff Ortel <[email protected]>
Signed-off-by: Jeff Ortel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very detailed. One concern that was brought up when we were talking about the .NET support effort was how to properly get the container running on the windows node. Assuming that we have analyzer-addon and dotnet-external-provider images that can be run on a windows node, we need a way to:
- let the hub know that this extension/provider can be run on a specific os/platform
- let the hub know that an addon can be run on a specific os/platform
- mechanism for determining under which conditions the addon/provider will be run on a windows node.
- Targets (rules) may be filtered by language. | ||
- Targets (rules) may be pre-filtered by the UI when language is known. | ||
- Language defaults to Java (until automated discovery feature added). | ||
- User initiated task execution (Example: analysis) must be prioritized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think we need to talk in this enhancement about how we are going to deal with providers that support being run on different platforms (ie. linux/amd64
and windows/amd64
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the addon and extensions run on the same pod, the addon would need to selected by OS. The addon selector could be used for this. For example: tag:OperatingSystem=Windows
would select the Windows version of the analyzer addon.
Fields: | ||
- `addon` - declares compatibility with addons. | ||
- `container` - defines the extension container. | ||
- `selector` - defines selector for inclusion in the addon task pod. | ||
- `metadata` - defines unstructured information. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need a way to expose the platform(s) an extension can run on.
Fields: | ||
- `kind` - declares compatibility with a _kind_ of task. | ||
- `container` - defines the addon container. | ||
- `selector` - defines addon selector for task (kind). | ||
- `metadata` - defines unstructured information. | ||
- `schema` - _FUTURE_ defines the Task.Data schema. | ||
|
||
For symmetry and completeness, the `metadata` field is added and the separate addon fields for the container are replaced with the entire k8s _Container_. This is mainly a general improvement while we're changing the CR. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need a mechanism to declare the platform/os an addon image can be run on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pranavgaikwad @eemcmullan @JustinXHale
After talking with Jeff, there are some changes on the preemption that we are adding; after this, I am a +1 on this enhancement!
Signed-off-by: Jeff Ortel <[email protected]>
Signed-off-by: Jeff Ortel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JustinXHale we need to add a new dialog/step in the analysis wizard to ask users if they want to prioritize their analysis over any background tasks that might be running and warn about the potential consequences. No big deal, we can figure it out offline and attach it to the enhancement afterwards. Aside from that, for me it's a +1 for the enhancement!
Ref: #155