-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating PMIx interface "classes" based on stability #179
Comments
Strawman Stability Classes ProposalFor the sake of discussion, one option is to fork the COSS and adapt it for both APIs (as opposed to RFCs) and the particulars of our community. First, I propose that once an RFC has been merged into pmix/RFCs, the interface shall be marked Second, I propose that once an interface has two third-party users, the interface gets marked As the PMIx community grows and we want to temper the rate of churn/change, we could amend the process such that two third-party users moves the interface from One major benefit that I believe COSS brings to the table is that working code and community interest are the driving factors behind the stability class of a particular interface. |
A few comments. First, I think your proposal is a good one and well thought out. I like the idea of having "classes" of interfaces as it allows for innovation while providing a path to stability. I'm not sure of the best name for the taxonomy, but I'm sure you folks can hash that detail out. Second, you'll need to work out a way to deal with the PMIx attributes. We adopted an approach aimed at creating somewhat generic interfaces and using key-value attributes to specify their behavior. The rationale behind that decision was a desire to avoid the common problem of community's modifying existing API definitions, or introducing new ones to deprecate/replace APIs, simply to support a slightly different new behavior. In our thinking, there should never ever be a "PMIx2_Get". Thus, in addition to having "classes" of APIs, you'll probably need a similar taxonomy for the attributes associated with each API. As you look through our RFCs, you'll probably see that the number of RFCs proposing new APIs has continued to drop - instead, they increasingly propose new attributes and behaviors for an existing API or combination of APIs. Our expectation is that this trend will continue into the future, so you may well have a Third, I would advise not tying transition between classes to the number of implementations as this may prove too confining. One might envision a world where there are only two implementations, each addressing different objectives. Forcing an API/attribute to remain Finally, the community actually had looked at COSS when considering what process to use and are practicing it to a degree. RFCs are written by one or two lead people, with contributions from others, posted as a PR to the RFC repo, and announced on the mailing list. The pending RFCs are cited again in each week's telecon agenda published to the mailing list as a reminder to the community. We strive for a rough consensus on each RFC among participating members as represented either via email or issue comments with final approval given at the weekly developer's telecon. Our rationale was based on the fact that we have a reasonably sized mailing list (approaching 65 members) of people monitoring what we do, but only a limited number of organizations actively involved in development of the standard and/or code. Thus, we took the "silence is lack of dissent" approach - i.e., we invite anyone on the mailing list or the call to voice dissent on an RFC. Any dissent must be addressed, either by adjustment or justification. In the event we cannot reach agreement, then the participants on the telecon make the final decision. We drifted a bit away from the RFC repo over the last year or so as we didn't see much action there and shifted to a more integrated approach that basically posts the RFC as a PR against the standard's doc itself. This more closely mirrored what we saw in other standards bodies - i.e., you propose actual language to the standard, backed by a prototype implementation. I have no strong feelings either way on this - if anything, I have a slight preference for the current PR-against-the-standard method as it more directly exposes the precise language the proposer is asking to insert. The community is small enough for us to make that work so far - I think a couple of proposals have been shot down, but in general we are able to find a path that provides the proposer with what they wanted to accomplish while addressing the concerns of others. As the community grows, we may need something more formal - but I personally would rather defer that to a time when we find we need it. |
Just one further FWIW: we actually have implemented this using the "query" command and some new attributes. You can ask for supported functions and attributes, getting back an array of |
I totally agree. Thank you for making this point. I always forget to mention the attributes. I believe @abouteiller made a similar point in last week's concall.
I agree that this not a useful criteria at this point in time. It is a topic to revisit once multiple implementations exist.
Excellent! |
I also wanted to clarify something that came up in the concall that I don't think I addressed very well. By |
I like where this is going in general. I believe the trigger from experimental to stable should be 'time based' in a loose sense. For example, if an experimental feature has been around for 2 minor versions, has one implementation, and has not caused outrage (to be refined as maybe a number of parties voicing formal opposition), it should become stable when we publish the next minor version. That has the advantage of tying the progression into stable to a concrete milestone and anchor the discussion. Note that 'stable' is not synonymous with mandatory. Many features are expected to remain 'optional', even when the group working on defining the core that has to be mandatory is done. Optional/mandatory is a property of an interface/attribute that may also be exploratory (or stable). |
A couple of notes from the teleconf today:
|
What is the exact definition of a user when you say "Move from experimental to widely used when 2 or more specific 3rd party users integrating the interface"? As I read it, it sounds like if a company developing a series of products based on a new set of PMIx interfaces and attributes for a use case that is not pure HPC (i.e., not the traditional PMIx ecosystem) and who has a very large pool of customers, the proposed interface/attribute will never transition to widely used. Is that correct and/or the intent? |
The sense of the concall (as I understood it) was to prevent standarizing new features that are only of interest to one entity. In the example you describe, if the changes are internal and not visible to users, I'd say that's probably not a good candidate for "widely used" even if those new features are widely deployed. If the new feature is user-visible (particularly if it's an interface that will be interacting with other software) and several* users are using it, that's a good candidate for moving into "widely used". Others may have a different/more nuanced interpretation. But that was my understanding of the intent. *For small values of "several." |
Okay, then I am clearly not a participant that fits with the described intent since I am not a pure HPC use case and will therefore be stuck in the experimental situation. Best of luck all and I wish PMIx will be successful as a "new" standard. |
I have not heard of a proposal to make relevance to HPC part of whether something gets labelled
The proposed changes are a compromise between stability and agility. In a 100% agile standard, you can "move fast and break things"™️ , so the standard would rapidly evolve to encompass a large number of use-cases and features, which can be beneficial, but it can quickly run into a problem with unstable interfaces that constantly change, making using them like building a house on quicksand. In a 100% stable standard, backwards-compatibility can never be broken, so the standard experiences an unbounded growth in size, increasing the maintenance burden and ultimately reducing usability, or glacial development, so new features are never added. This proposal attempts to be a middle ground that has the best of both worlds. Experimental interfaces/attributes can still be accepted into the standard, increasing agility, but their stability is not guaranteed until they are "widely used", reducing maintenance burden (and even increasing agility in some cases, as I mentioned below). In the case that you bring up, I believe any entity creating and using the interface is a great first step. I think the next step should be to encourage other users to leverage the same interface. If the interface is well-designed and generalized, then a second user should bear this out. If the interface is hyper-specialized to a particular user, then a second user attempting to leverage the interface will hopefully highlight this. If the latter is the case, then the
The discussion around the stability classes, their names, and the mechanism that move an interface/attribute from one class to the next are all still under discussion. Nothing is set in stone, and no changes have been made to any process yet. In fact, this is not even a PR yet, it is just a discussion in an issue. If you have other ideas for how stability should be handled, please share them. |
Well I guess the group (and I do not consider myself an actual member of the group anymore) will have to decide what is the best option. I am just an observer who ask questions at this point; I cannot personally try to influence one way or another. I do not consider myself as representing anything other than just me, so I cannot do more than ask questions and take answers for what they are: answers. |
I think the general concern being raised is akin to the one discussed earlier regarding requirements to actually insert something into the standard. I personally don't accept that an "agile" organization necessarily means that "standard" interfaces will be broken and unstable. It does take some thought and commitment to the principal of having generic interface definitions and using "attributes" to control behavior, but we seem to have found a "comfort zone" that works pretty well (admittedly after a couple of failed attempts). I therefore believe that PMIx can move quickly while still preserving stability. In addition to generic interfaces, we also have to commit to not requiring that everyone implement all interfaces and/or support for every attribute. There really is no reason to force every RM, PMIx lib, and/or programming library to implement everything, especially if their target market and/or user community doesn't need a particular feature. This principal is what enabled us to gain acceptance so quickly in the community and should be only set aside with great care. In reality, it is just a formalization of current common practice, as I noted earlier with MPI as the example. It would be nice if people at least "stub out" each interface to return "not supported", but that can also be their call - either the user won't compile or they will find out the operation isn't supported at runtime. I can make arguments either way, which just means (to me) that this is something best left to each group's notion of "best practices". Note that the "server" doesn't have to stub things out for the function pointers it provides as the current standard already states that any pointer not provided will be reported as "not supported". The "experimental" vs "stable" class concept might serve as a vehicle for realizing these ideas - it's too early to really tell as the devil is always in the details. We certainly want to make the process easy enough that a small company in a non-traditional HPC market can get their features into the standard with the same level of effort that a major lab seeking an MPI-supporting feature would require. I don't see any reason why we can't come up with something that will work - just need to poke at it a bit, try to avoid setting rules based on absolute numbers (whether of users or implementers), etc. |
@SteVwonder I was unclear: my point about non-HPC participants is that we will have less "users" (the definition of user is still unclear to me) and less support, at least at very first because the current PMIx community is 100% HPC. I did not mean that the proposed rules were explicitly designed against non-HPC users but it seems clear to me that the rules that are currently discussed will indeed implicitly make it difficult to include participants that are not in the HPC field. Again, ultimately, it is not a necessarily a problem, it is simply a choice from the community. Let me know if the point I am making is not clear enough. As for the agile methods leading to often breaking the code/standard, I cannot personally agree with that statement. Nowadays, many organizations rely on agile methods and fortunately, they still manage to deliver without breaking everything all the time. As for sharing ideas, in my mind it is what I am doing here. Do you have something else in mind? |
I think that there are two notions in this ticket:
The stability classes concept might work to address (A) in an agile style model for accepting changes into the standard. That agile model has worked well for the PMIx community thus far and allows it to be responsive to emerging use cases. Defining what the classes are, how many, what they are called, and how to transition between them I think is all still under active discussion here. One question I had in re-reading this thread is if we want to associate a backward compatibility guarantee with a given interface/attribute at different stability levels. I need to think a bit more about that. For (B) we have talked about identifying slices in the standard according to use cases. A grouping chapter/appendix that would help a user or RM interested in use case X to focus in the parts of the standard that are most relevant to that use case scenario. It would highlight the required vs suggested vs optional attributes that are needed to support that use case. It's likely that an attribute might be labeled as optional for one use case but required for another. This morning I was kicking around this idea (it's a bit rough): If an RFP were trying to identify a required subset of the PMIx Standard they could use the grouping appendices to articulate that. Something like "We require the following support as described in PMIx Standard version X.Y. All required functionality described in Use Case ABC, DEF, and GHI. Support for use case XYZ is optional but suggested." If the standard changes in those section from the time of the RFP to deployment (or over the course of support) then those providing that interface are vested in making sure that there is a transition path for those use cases and associated interfaces/attributes. |
@jjhursey So if I understand correctly the notion of "user" that was previously used will more or less be replaced by use cases. That is a really interesting suggestion. I will have to think more about it, but my raw reaction is that it may actually address my concerns. |
Yeah. Maybe we can have the definition of "user" to mean either multiple user apps/libs or use cases. |
Notes from Teleconf April 26, 2019:
|
Note on the issue title change: based on the April 26th telecon, we thought it's best to limit this particular issue to just the "stability classes/slices" and make a new issue for the "functionality classes/slices". |
To follow up on last weeks concall, we discussed potential alternatives to One proposal was I'll throw some more into the mix, maybe we used the terms |
I think we should be careful about connotations here. A new API or attribute is unlikely to be "unstable" - i.e., use of it shouldn't lead to unpredictable behavior. What I think you want is something that indicates more that it has been accepted on a provisional basis - this more accurately reflects its status. Maybe what you want would better fit just two stages: provisional, indicating it has been accepted (and thus won't be changing) but not on a permanent basis (i.e., acceptance must be renewed after some period of time based on usage and/or usefulness); and stable, indicating it is a permanent member of the standard. |
I agree that we have to be careful about connotations. Labeling something as |
Notes from Teleconf May 10, 2019:
|
Yeah. It appears that we are all suggesting names while operating under different assumptions as to the semantics of these levels. So as @jjhursey suggested, let's table the naming question for now, and just use Side-note: independent but related to this conversation is the concept of returning "not supported" for any APIs/attribute. I want to make clear that this issue does not seek to change that. Any APIs/attributes of any level can still be "not supported" by any given implementation of the PMIx standard. The only interaction may be in "compliance", for example, for an implementation to be 100% Level 3 PMIx-compliant, then it will probably need to support every Level 3 API/attr. This can be a separate issue though, and probably only makes sense to discuss in detail once #182 is either closed or more progress has been made on that front. I think the first thing to decide is: do we want to allow APIs into the standard that have varying level(s) of backwards compatibility guarantees? Or should every API in the standard have a permanent guarantee of backwards compatibility (extreme circumstances notwithstanding). If I understand correctly, in the current standardization process, there are no "stability classes" and only one form of stability: i.e., "modification of existing released APIs will only be permitted under extreme circumstances" [1]. One recent proposal from @rhc54 is a slight tweak on this that retains the idea of released APIs not being modified, but it splits the APIs into two levels. The original proposal in this issue removes the guarantee that no released APIs will be modified and instead reserves that guarantee for the highest level. The idea being that it would be useful to accept APIs into the standard without immediately guaranteeing permanent backwards-compatibility. In this way, time and flexibility are given for APIs to be "put through their paces" and the lessons learned from real-world usage can be re-incorporated into the API design. In the original proposal, there are three levels. APIs in any of the levels are a part of standard, but each level has its own compatibility guarnatees: One idea that was discussed during the telecon for the original proposal was to make the Note: about halfway through the issue (right before resummarizing the current proposals) I dropped usage of "APIs/attributes" and stuck solely with APIs. This was intentional to focus the discussion for now. In general, I think we should have processes for both, but maybe we should loop in attributes once we have made some headway on APIs. |
Sorry for the giant wall of text. In case it wasn't clear, the "too long; didn't read" (tl;dr) of the above comment is:
|
I believe the proposals you have captured so far would best be served with the notion of a provisional API being non-permanent. We would need to define some period of time associated with provisional status and a deprecation procedure to assure users of it that they won't wake up some morning to find it "gone" - perhaps a period of two years? We can debate the proper time extent.
I second that thought. Attributes are, by their very nature, more ephemeral than APIs. The philosophy used so far has focused on APIs as the point of stability, using attributes to generate flexibility. Thus, the thought was that APIs should become immutable quickly while attributes may come-and-go a little more freely. When it comes time to deal with attributes, we'll have to spend a little more time thinking about this point. Perhaps the biggest issue will be defining some way of deciding whether or not a given attribute should be in the standard vs defined solely by the implementation or the host environment. This gets into the "not required to support" (i.e., there are lots of attributes in the standard but not every implementation or environment has to support them) vs the "non-portable" (i.e., this symbol doesn't exist in this environment, so your app won't even compile) question. |
Ok. So (at least between the two of us) there is agreement as to the benefit of APIs that are not permanently backwards-compatible. (others should speak up if they disagree).
I think this opens up the next thing to try and agree on. What kind of non-permanent backwards-compatibility is the right kind to include. One proposal is to have every API in the standard be "solid" (i.e., the interface signature/semantics cannot change) as soon as it is accepted into the standard, but to provide two mechanisms: one for deprecating certain interfaces and ultimately removing them from the standard and another for "solidifying" interfaces into permanent APIs. The second proposal, thus far, is to have interfaces start with more "malleable" compatility guarantees (i.e, the interface signatures/semantics can change based on pre-defined rules/guidelines) and then "solidify" them into permanent APIs. The "solid"ness of the APIs when they enter the standard makes these two proposals mutually exclusive, but in general, there is no reason the second proposal could not also include a deprecation mechanism (and in fact, it should, if we decide to go that route). I think the main benefit of the latter proposal is getting more "experimental"/"immature" interfaces as well as more niche interfaces into the standard sooner with "softer" compatibility guarantees, and then gradually "hardening" the guarantees as the interfaces mature and/or gain users/traction. Ultimately, though, is a standard the right place for such a thing, or is that better left to implementations to "experiment"? On the concall, we have discussed the option for "experimental"/"immature" interfaces to just reside in an implementation(s) until the interface is mature enough to warrant adding to the standard (at which point it should have very strong compatibility guarantees). Where I get stuck is that PMIx is a bit non-traditional because there is only one implementation currently (AFAIK), where most standards have multiple implementations that they are trying to unify/standardize. So the criteria (in the multi-implementation case) can be connected to the number of implementors of an interface. If we go this route for PMIx, I think the main thing that needs to be decided is what should the criteria be for an interface to move from just residing in an implementation to also residing in the standard? |
You always have to include a way for deprecating and removing interfaces - nothing lasts forever, not even "permanent" APIs. Usual method is to first warn of impending deprecation, then deprecate but leave in, and then remove. So it takes three revisions to go away.
Why not simplify this and just say that all interfaces are provisional when initially included in the standard - i.e., they are acceptable in principal, but the precise signature is subject to change for some period of time. Changes are done similar to deprecation - you start with a warning, then perhaps retain both signatures for a time (defining a #ifdef flag to select which one is operational), and then remove the old one. This again requires a minimum of three revisions to have the old definition go away.
Yes, that is a major complication. At this time, we only know of one generalized implementation - i.e., an implementation intended to be used as a 3rd-party library - and that is the "reference implementation". We know of one and perhaps two other parties that are planning or working on their own environment-specific implementation, but these are customized to their environment and not intended for general use. Thus, any "extension" done by the reference implementation is going to become the equivalent of a modification to the standard, at least on a de facto basis. Adding those definitions into the standards doc actually serves as a "governing" operation on the reference implementation as it requires at least some oversight from the enviro-specific implementations. Otherwise, I suspect we will be hearing similar complaints again about how the reference implementation is driving the standard 😄 |
Sorry for the delayed response. It's been one of those weeks.
This sounds good to me. Just to make sure we are one the same page, you are proposing that "provisional" interfaces can be deprecated or change (after three revisions), and "permanent" interfaces can only be deprecated (after three revisions)? One suggestion from @kathrynmohror during today's phone call was to have the "least stable" interfaces (L1) not show up in the standards document by default, but if you include an "--L1" flag (or uncomment a latex macro, or something similar) when building the PDF, they would be included in the document. Just a thought as to how we can potentially include some newer interfaces without committing to as rigorous of a process. One other discussion during today's call was how intertwined this issue with #181 and #183. Particularly around the idea of interfaces transitioning between classes. Is the typically straw poll and two-week review process sufficient to transition an interface from provisional to permanent (or L2 to L3, or whatever terminology you want to use)? Or should moving to permanent status require a more rigorous process where vote are formally counted and recorded? |
No worries - same here.
Yes - I think that makes sense as a distinction.
I'm not wild about that as L1 interfaces are still part of the standard and shouldn't be "hidden". However, I do believe that it would be appropriate to put them in a separate section of the standard so their status is clear. They would then move to the L2 section when approved for that transition.
True - hard not to be that way, I guess. If we are defining classes we have to decide how they differ, and that is going to be a question of process as opposed to substance.
Tough call. The problem you face with a vote-based decision process is adequate representation. When you have a restricted scope (e.g., MPI or OpenMP), it can be fairly easy to obtain a representative sample of the affected population. However, PMIx has a rather broad constituency spanning the gamut of resource managers to programming libraries and even application developers themselves. As a result, knowing that you have adequate representation from affected parties is somewhat problematic. On the other hand, a two-week review process might catch some parties during a vacation, for example, precluding their opportunity to participate in the decision. Perhaps the best compromise is to retain the decision criteria, but provide a longer review time to ensure adequate notice has been given so that affected parties have an opportunity to become aware of the proposed change in status? What if we modify the time requirement to be more like one or two quarters for shifting something from L1 to L2? I don't see how that would impact someone using that particular feature (it remains in the standard - only its status would be changing) while it provides a reasonable amount of time for someone to become aware of the proposal. |
Notes from Teleconf May 31, 2019:
|
experimental
stable
long term support (LTS)
|
A suggestion of 3 classes are included in PR #193
|
Question: Will PR #193 close this issue or is there more to do? |
Per teleconf July 26, 2019 and Aug. 2, 2019 we think that this can be closed now that PR #193 has been merged. If there are outstanding issues to resolve this Issue can be reopened or (preferably) a new issue can be filed for discussion. |
Main Idea
Motivation
The main motivation for the stability "classes" is to enable the addition of new, prototype interfaces to the standard without immediately committing to backwards compatibility for those interfaces. Interfaces (and attributes) could start as experimental and slowly moves towards more stability as their usefulness is demonstrated and confirmed by community, which would correspond with increasing backwards-compatibility guarantees.
The stability classes could be combined with the functionality classes proposed in #182 to be even more precise, e.g., "we require all of the stable bootstrapping interfaces and the experimental fault-tolerance interfaces."
As a longer-term, potentially more controversial motivation, I could see this being useful for handling portability across implementations. For example, it may be useful to have an API for querying what "classes" a particular implementation supports, as an alternative to querying the availability of each individual function or key (#6). It could also simplify things for users. Rather than having to reason about and program their applications for the potential (un-)availability of each individual function/key, they could instead program for the (un-)availability of the more coarse-grained classes.
Prior Art
Rather than reinventing the wheel, I think we should leverage what others in the community are doing as much as possible. Below are some references to other projects that I believe are relevant.
Raw -> Draft -> Stable -> Deprecated -> Retired
as well as aDeleted
stateraw
state. Once working code exists, it moves todraft
. Once third-parties use it, it moves tostable
. There is some nuance to thedeprecated
,retired
, anddeleted
states, so I encourage you to consult the original doc for those, but basically, once an RFC is no longer useful, it moves to one of those states.Things to Discuss
EDIT: I accidentally posted pre-maturely. Modified to complete my initial draft.
EDIT2: removed text about the "functionality classes" and referenced the new issue (#182)
The text was updated successfully, but these errors were encountered: