Creating PMIx interface "classes" based on stability #179

SteVwonder · 2019-04-04T23:30:57Z

Main Idea

Formally specify the stability of PMIx function/attributes
Stability being things like: experimental, unstable, stable, and deprecated (terms TBD)

Motivation

The main motivation for the stability "classes" is to enable the addition of new, prototype interfaces to the standard without immediately committing to backwards compatibility for those interfaces. Interfaces (and attributes) could start as experimental and slowly moves towards more stability as their usefulness is demonstrated and confirmed by community, which would correspond with increasing backwards-compatibility guarantees.

The stability classes could be combined with the functionality classes proposed in #182 to be even more precise, e.g., "we require all of the stable bootstrapping interfaces and the experimental fault-tolerance interfaces."

As a longer-term, potentially more controversial motivation, I could see this being useful for handling portability across implementations. For example, it may be useful to have an API for querying what "classes" a particular implementation supports, as an alternative to querying the availability of each individual function or key (#6). It could also simplify things for users. Rather than having to reason about and program their applications for the potential (un-)availability of each individual function/key, they could instead program for the (un-)availability of the more coarse-grained classes.

Prior Art

Rather than reinventing the wheel, I think we should leverage what others in the community are doing as much as possible. Below are some references to other projects that I believe are relevant.

MediaWiki
- Uses experimental, unstable, stable, and deprecated their stability classes
- Changing stable interfaces requires a major version dump, changing unstable requires a minor version dump, and changing experimental interfaces can happen at any time
Scikit-bio
- Only has stable, experimental, deprecated (no unstable)
- Useful diagram of label transitions
ZeroMQ Consensus-Oriented Specification System (COSS)
- Mainly in reference to RFCs, but they use a lifecycle of Raw -> Draft -> Stable -> Deprecated -> Retired as well as a Deleted state
- Useful because it allows agility, doesn't require tons of formal processes, and emphasizes running code. RFCs change states based on objective measures rather than formal processes like voting.
- An RFC starts out as just text and in the raw state. Once working code exists, it moves to draft. Once third-parties use it, it moves to stable. There is some nuance to the deprecated, retired, and deleted states, so I encourage you to consult the original doc for those, but basically, once an RFC is no longer useful, it moves to one of those states.

Things to Discuss

Stability Classes
- Determine which ones to use
- Determine their semantics
- Determine the process from changing an interfaces stability class
- Label existing interfaces
Is there other relevant prior art that we should consider?

EDIT: I accidentally posted pre-maturely. Modified to complete my initial draft.
EDIT2: removed text about the "functionality classes" and referenced the new issue (#182)

The text was updated successfully, but these errors were encountered:

SteVwonder · 2019-04-05T00:53:16Z

Strawman Stability Classes Proposal

For the sake of discussion, one option is to fork the COSS and adapt it for both APIs (as opposed to RFCs) and the particulars of our community.

First, I propose that once an RFC has been merged into pmix/RFCs, the interface shall be marked experimental, which makes no compatibility guarantees across versions. Note, this assumes that before the RFC is merged that there is already one working prototype implementation.

Second, I propose that once an interface has two third-party users, the interface gets marked stable, which guarantees compatibility within the same major version. I propose two (rather than the one specified in COSS) third-party users since I expect our community to frequently "scratch our own itches". For example, I may propose a new I/O interface targeting burst buffers and then "scratch my own itch" by using it in an LLNL project like SCR. This seems necessary but not sufficient for marking an interface as stable. I think there should be another third-party user to demonstrate the generality and usefulness of the interface.

As the PMIx community grows and we want to temper the rate of churn/change, we could amend the process such that two third-party users moves the interface from experimental to unstable, which guarantees compatibility within the same minor version. Then, once two implementations of the interface exist (and potentially other criteria), the interface moves from unstable to stable, which as before guarantees compatibility within the same major version.

One major benefit that I believe COSS brings to the table is that working code and community interest are the driving factors behind the stability class of a particular interface.

rhc54 · 2019-04-05T02:59:09Z

A few comments. First, I think your proposal is a good one and well thought out. I like the idea of having "classes" of interfaces as it allows for innovation while providing a path to stability. I'm not sure of the best name for the taxonomy, but I'm sure you folks can hash that detail out.

Second, you'll need to work out a way to deal with the PMIx attributes. We adopted an approach aimed at creating somewhat generic interfaces and using key-value attributes to specify their behavior. The rationale behind that decision was a desire to avoid the common problem of community's modifying existing API definitions, or introducing new ones to deprecate/replace APIs, simply to support a slightly different new behavior. In our thinking, there should never ever be a "PMIx2_Get".

Thus, in addition to having "classes" of APIs, you'll probably need a similar taxonomy for the attributes associated with each API. As you look through our RFCs, you'll probably see that the number of RFCs proposing new APIs has continued to drop - instead, they increasingly propose new attributes and behaviors for an existing API or combination of APIs. Our expectation is that this trend will continue into the future, so you may well have a stable API with experimental attributes!

Third, I would advise not tying transition between classes to the number of implementations as this may prove too confining. One might envision a world where there are only two implementations, each addressing different objectives. Forcing an API/attribute to remain experimental simply because the other implementation isn't interested in that functionality seems too confining. I'd suggest simply reclassifying something once the community feels it has adequately matured - e.g., thru demonstrated adoption/use or some other means. I don't think you need to define hard criteria at this time.

Finally, the community actually had looked at COSS when considering what process to use and are practicing it to a degree. RFCs are written by one or two lead people, with contributions from others, posted as a PR to the RFC repo, and announced on the mailing list. The pending RFCs are cited again in each week's telecon agenda published to the mailing list as a reminder to the community.

We strive for a rough consensus on each RFC among participating members as represented either via email or issue comments with final approval given at the weekly developer's telecon. Our rationale was based on the fact that we have a reasonably sized mailing list (approaching 65 members) of people monitoring what we do, but only a limited number of organizations actively involved in development of the standard and/or code. Thus, we took the "silence is lack of dissent" approach - i.e., we invite anyone on the mailing list or the call to voice dissent on an RFC. Any dissent must be addressed, either by adjustment or justification. In the event we cannot reach agreement, then the participants on the telecon make the final decision.

We drifted a bit away from the RFC repo over the last year or so as we didn't see much action there and shifted to a more integrated approach that basically posts the RFC as a PR against the standard's doc itself. This more closely mirrored what we saw in other standards bodies - i.e., you propose actual language to the standard, backed by a prototype implementation. I have no strong feelings either way on this - if anything, I have a slight preference for the current PR-against-the-standard method as it more directly exposes the precise language the proposer is asking to insert.

The community is small enough for us to make that work so far - I think a couple of proposals have been shot down, but in general we are able to find a path that provides the proposer with what they wanted to accomplish while addressing the concerns of others. As the community grows, we may need something more formal - but I personally would rather defer that to a time when we find we need it.

rhc54 · 2019-04-05T15:04:14Z

For example, it may be useful to have an API for querying what "classes" a particular implementation supports, as an alternative to querying the availability of each individual function or key (#6).

Just one further FWIW: we actually have implemented this using the "query" command and some new attributes. You can ask for supported functions and attributes, getting back an array of pmix_regattr_t structs that contain both machine and human readable information. In the reference implementation, the pattrs command exercises that capability. Should be trivial to add the ability to request info on "classes" of functions.

SteVwonder · 2019-04-11T00:36:05Z

Thus, in addition to having "classes" of APIs, you'll probably need a similar taxonomy for the attributes associated with each API. As you look through our RFCs, you'll probably see that the number of RFCs proposing new APIs has continued to drop - instead, they increasingly propose new attributes and behaviors for an existing API or combination of APIs. Our expectation is that this trend will continue into the future, so you may well have a stable API with experimental attributes!

I totally agree. Thank you for making this point. I always forget to mention the attributes. I believe @abouteiller made a similar point in last week's concall.

Third, I would advise not tying transition between classes to the number of implementations as this may prove too confining....I don't think you need to define hard criteria at this time.

I agree that this not a useful criteria at this point in time. It is a topic to revisit once multiple implementations exist.

Just one further FWIW: we actually have implemented this using the "query" command and some new attributes. You can ask for supported functions and attributes, getting back an array of pmix_regattr_t structs that contain both machine and human readable information. In the reference implementation, the pattrs command exercises that capability. Should be trivial to add the ability to request info on "classes" of functions.

Excellent!

SteVwonder · 2019-04-11T02:50:16Z

Second, I propose that once an interface has two third-party users, the interface gets marked stable, which guarantees compatibility within the same major version.

I also wanted to clarify something that came up in the concall that I don't think I addressed very well. By user in the above, for the client-side API I meant projects like OpenMPI, Spectrum MPI, OpenSHMEM, or SCR, and for server-side API, I meant projects like SLURM.

abouteiller · 2019-04-11T15:43:07Z

I like where this is going in general. I believe the trigger from experimental to stable should be 'time based' in a loose sense. For example, if an experimental feature has been around for 2 minor versions, has one implementation, and has not caused outrage (to be refined as maybe a number of parties voicing formal opposition), it should become stable when we publish the next minor version. That has the advantage of tying the progression into stable to a concrete milestone and anchor the discussion.

Note that 'stable' is not synonymous with mandatory. Many features are expected to remain 'optional', even when the group working on defining the core that has to be mandatory is done. Optional/mandatory is a property of an interface/attribute that may also be exploratory (or stable).

jjhursey · 2019-04-12T16:21:43Z

A couple of notes from the teleconf today:

Movement between experimental to widely used to stable (three stages names still under discussion):
- Move to experimental when PR is merged
- Move from experimental to widely used when 2 or more specific 3rd party users integrating the interface.
- Move from widely used to stable when either more than one implementation adopts the interface or after a deterministic timeout (2 minor versions or 6 months or ...)
Use the PR voting process to signal the transition and voting to 'approve'/'accept' the transition
- A PR is made to transition the label in the document from X to Y. PR is linked to the original RFC.
- Eligibility rational provided in the PR (e.g., This is used by library ABC and RM XYZ, or this has been here for a year with lots of users)
- The standard voting process followed to accept the transition.
- Using this process provides nice traceability from the specific change back through the specific users and original RFC for the change.

gvallee · 2019-04-18T19:45:52Z

What is the exact definition of a user when you say "Move from experimental to widely used when 2 or more specific 3rd party users integrating the interface"? As I read it, it sounds like if a company developing a series of products based on a new set of PMIx interfaces and attributes for a use case that is not pure HPC (i.e., not the traditional PMIx ecosystem) and who has a very large pool of customers, the proposed interface/attribute will never transition to widely used. Is that correct and/or the intent?

rountree · 2019-04-18T20:15:27Z

The sense of the concall (as I understood it) was to prevent standarizing new features that are only of interest to one entity. In the example you describe, if the changes are internal and not visible to users, I'd say that's probably not a good candidate for "widely used" even if those new features are widely deployed. If the new feature is user-visible (particularly if it's an interface that will be interacting with other software) and several* users are using it, that's a good candidate for moving into "widely used".

Others may have a different/more nuanced interpretation. But that was my understanding of the intent.

*For small values of "several."

gvallee · 2019-04-18T20:20:52Z

Okay, then I am clearly not a participant that fits with the described intent since I am not a pure HPC use case and will therefore be stuck in the experimental situation. Best of luck all and I wish PMIx will be successful as a "new" standard.

SteVwonder · 2019-04-18T22:13:54Z

for a use case that is not pure HPC (i.e., not the traditional PMIx ecosystem)...
I am clearly not a participant that fits with the described intent since I am not a pure HPC

I have not heard of a proposal to make relevance to HPC part of whether something gets labelled experimental or widely used. The only proposals I have seen so far are in regards to either number of users and/or length of "time" (either walltime or number of versions) the interface/attribute has been in the standard. So I don't believe that any of the current proposals are disadvantageous to or negligent of non-HPC use-cases. If you believe differently, please point out those specific proposals so we can re-evaluate them.

it sounds like if a company develop a series of products based on a new set of PMIx interfaces and attributes for a use case that is not pure HPC (i.e., not the traditional PMIx ecosystem) and who has a very large pool of customers, the proposed interface/attribute will never transition to widely used.
I am not a pure HPC use case and will therefore be stuck in the experimental situation

The proposed changes are a compromise between stability and agility. In a 100% agile standard, you can "move fast and break things"™️ , so the standard would rapidly evolve to encompass a large number of use-cases and features, which can be beneficial, but it can quickly run into a problem with unstable interfaces that constantly change, making using them like building a house on quicksand. In a 100% stable standard, backwards-compatibility can never be broken, so the standard experiences an unbounded growth in size, increasing the maintenance burden and ultimately reducing usability, or glacial development, so new features are never added. This proposal attempts to be a middle ground that has the best of both worlds. Experimental interfaces/attributes can still be accepted into the standard, increasing agility, but their stability is not guaranteed until they are "widely used", reducing maintenance burden (and even increasing agility in some cases, as I mentioned below).

In the case that you bring up, I believe any entity creating and using the interface is a great first step. I think the next step should be to encourage other users to leverage the same interface. If the interface is well-designed and generalized, then a second user should bear this out. If the interface is hyper-specialized to a particular user, then a second user attempting to leverage the interface will hopefully highlight this. If the latter is the case, then the experimental label is actually a benefit, because the changes necessary to generalize the interface can be made with less effort than if it was already stable.

Best of luck all and I wish PMIx will be successful as a "new" standard.

The discussion around the stability classes, their names, and the mechanism that move an interface/attribute from one class to the next are all still under discussion. Nothing is set in stone, and no changes have been made to any process yet. In fact, this is not even a PR yet, it is just a discussion in an issue. If you have other ideas for how stability should be handled, please share them.

gvallee · 2019-04-18T22:44:29Z

Well I guess the group (and I do not consider myself an actual member of the group anymore) will have to decide what is the best option. I am just an observer who ask questions at this point; I cannot personally try to influence one way or another. I do not consider myself as representing anything other than just me, so I cannot do more than ask questions and take answers for what they are: answers.
Whatever the group will decide will be fine from my point of view. :)

rhc54 · 2019-04-19T00:22:12Z

I think the general concern being raised is akin to the one discussed earlier regarding requirements to actually insert something into the standard. I personally don't accept that an "agile" organization necessarily means that "standard" interfaces will be broken and unstable. It does take some thought and commitment to the principal of having generic interface definitions and using "attributes" to control behavior, but we seem to have found a "comfort zone" that works pretty well (admittedly after a couple of failed attempts). I therefore believe that PMIx can move quickly while still preserving stability.

In addition to generic interfaces, we also have to commit to not requiring that everyone implement all interfaces and/or support for every attribute. There really is no reason to force every RM, PMIx lib, and/or programming library to implement everything, especially if their target market and/or user community doesn't need a particular feature. This principal is what enabled us to gain acceptance so quickly in the community and should be only set aside with great care. In reality, it is just a formalization of current common practice, as I noted earlier with MPI as the example.

It would be nice if people at least "stub out" each interface to return "not supported", but that can also be their call - either the user won't compile or they will find out the operation isn't supported at runtime. I can make arguments either way, which just means (to me) that this is something best left to each group's notion of "best practices". Note that the "server" doesn't have to stub things out for the function pointers it provides as the current standard already states that any pointer not provided will be reported as "not supported".

The "experimental" vs "stable" class concept might serve as a vehicle for realizing these ideas - it's too early to really tell as the devil is always in the details. We certainly want to make the process easy enough that a small company in a non-traditional HPC market can get their features into the standard with the same level of effort that a major lab seeking an MPI-supporting feature would require. I don't see any reason why we can't come up with something that will work - just need to poke at it a bit, try to avoid setting rules based on absolute numbers (whether of users or implementers), etc.

gvallee · 2019-04-19T13:17:17Z

@SteVwonder I was unclear: my point about non-HPC participants is that we will have less "users" (the definition of user is still unclear to me) and less support, at least at very first because the current PMIx community is 100% HPC. I did not mean that the proposed rules were explicitly designed against non-HPC users but it seems clear to me that the rules that are currently discussed will indeed implicitly make it difficult to include participants that are not in the HPC field. Again, ultimately, it is not a necessarily a problem, it is simply a choice from the community. Let me know if the point I am making is not clear enough.

As for the agile methods leading to often breaking the code/standard, I cannot personally agree with that statement. Nowadays, many organizations rely on agile methods and fortunately, they still manage to deliver without breaking everything all the time.

As for sharing ideas, in my mind it is what I am doing here. Do you have something else in mind?

jjhursey · 2019-04-19T16:02:06Z

I think that there are two notions in this ticket:

A. Expressing stability of an interface or attribute as a means to communicate to the user adoption/availability and (maybe more importantly) backward compatibility guarantees by the community.
B. Expressing the importance of an interface or attribute to a given use case.

The stability classes concept might work to address (A) in an agile style model for accepting changes into the standard. That agile model has worked well for the PMIx community thus far and allows it to be responsive to emerging use cases. Defining what the classes are, how many, what they are called, and how to transition between them I think is all still under active discussion here. One question I had in re-reading this thread is if we want to associate a backward compatibility guarantee with a given interface/attribute at different stability levels. I need to think a bit more about that.

For (B) we have talked about identifying slices in the standard according to use cases. A grouping chapter/appendix that would help a user or RM interested in use case X to focus in the parts of the standard that are most relevant to that use case scenario. It would highlight the required vs suggested vs optional attributes that are needed to support that use case. It's likely that an attribute might be labeled as optional for one use case but required for another.

This morning I was kicking around this idea (it's a bit rough):
If we had such groupings around use cases, would that achieve part of the requirement around (A)? If an interface or attribute is associated with multiple use cases then it's importance to the community is more significant. So that interface/attribute is more broadly interpreted as 'stable' because there are more scenarios where it is useful in its current form. If it's only in one use case then it's stable for that use case. If the use case changes over time then those focused on it can determine the best way to handle backward compatibility concerns around it. If the change impacts another use case then the two groups need to reach some agreement.

If an RFP were trying to identify a required subset of the PMIx Standard they could use the grouping appendices to articulate that. Something like "We require the following support as described in PMIx Standard version X.Y. All required functionality described in Use Case ABC, DEF, and GHI. Support for use case XYZ is optional but suggested." If the standard changes in those section from the time of the RFP to deployment (or over the course of support) then those providing that interface are vested in making sure that there is a transition path for those use cases and associated interfaces/attributes.

gvallee · 2019-04-19T16:24:42Z

@jjhursey So if I understand correctly the notion of "user" that was previously used will more or less be replaced by use cases. That is a really interesting suggestion. I will have to think more about it, but my raw reaction is that it may actually address my concerns.

jjhursey · 2019-04-23T15:03:44Z

Yeah. Maybe we can have the definition of "user" to mean either multiple user apps/libs or use cases.

jjhursey · 2019-04-26T16:34:37Z

Notes from Teleconf April 26, 2019:

Discussed the three-phase proposal a bit more
- Once a PR is "Accepted" the interfaces are labeled somehow
  - Concern that "experimental" is too negative of an initial label. Consider other names or mechanisms (e.g., colors)
  - Is it possible for a PR to be "Accepted" into the standard at a different level, or must it first be marked as "new"/"experimental"?
- Move from "experimental" to "widely used" requires 2+ "users" of the interface, where "users" can be loosely defined.
  - A large following of a single implementation of the interface
  - Multiple use cases in the PMIx standard
  - Multiple consumers of the interface
- Move from "widely used" to "stable"
  - Time-based or implementation based
Discussed whether we should use an objective means to transition (N+ Users or X amount of time)
or a more subjective model (a solid argument to make it so).
General agreement that moving from one label to another should go through the normal PR process to allow for comment/discussion.
Lots of discussions still needed on this topic.

SteVwonder · 2019-05-02T20:37:49Z

Note on the issue title change: based on the April 26th telecon, we thought it's best to limit this particular issue to just the "stability classes/slices" and make a new issue for the "functionality classes/slices".

SteVwonder · 2019-05-09T23:01:00Z

To follow up on last weeks concall, we discussed potential alternatives to experimental, widely used, and stable.

One proposal was unstable -> widely used -> stable.
Another was after the phases of matter: gas -> liquid -> solid
Another was just numerals: level 1 -> level 2 -> level 3

I'll throw some more into the mix, maybe we used the terms alpha and beta?
So either alpha -> beta -> stable?
Or unstable -> beta -> stable?

rhc54 · 2019-05-10T00:14:35Z

I think we should be careful about connotations here. A new API or attribute is unlikely to be "unstable" - i.e., use of it shouldn't lead to unpredictable behavior. What I think you want is something that indicates more that it has been accepted on a provisional basis - this more accurately reflects its status.

Maybe what you want would better fit just two stages: provisional, indicating it has been accepted (and thus won't be changing) but not on a permanent basis (i.e., acceptance must be renewed after some period of time based on usage and/or usefulness); and stable, indicating it is a permanent member of the standard.

jjhursey · 2019-05-10T14:58:09Z

I agree that we have to be careful about connotations. Labeling something as unstable or experimental can imply that it's broken in some way and should be avoided. Where I think the intent of this discussion is to mark is as new in some way. I think provisional and stable might hit the right sentiment.

jjhursey · 2019-05-10T16:09:13Z

Notes from Teleconf May 10, 2019:

Suggested to table naming (or use generic level naming) and circle back to the meaning implied by the levels.
- @SteVwonder Will follow up with a comment with some suggestions from this thread and the conversation on the teleconf.

SteVwonder · 2019-05-14T03:28:47Z

Suggested to table naming (or use generic level naming) and circle back to the meaning implied by the levels.

Yeah. It appears that we are all suggesting names while operating under different assumptions as to the semantics of these levels. So as @jjhursey suggested, let's table the naming question for now, and just use level 1, level 2, ... level N (abbreviated as L1, L2, etc) as we discuss what the semantics should be behind them. Once we agree on that, we can circle back around and evaluate names.

Side-note: independent but related to this conversation is the concept of returning "not supported" for any APIs/attribute. I want to make clear that this issue does not seek to change that. Any APIs/attributes of any level can still be "not supported" by any given implementation of the PMIx standard. The only interaction may be in "compliance", for example, for an implementation to be 100% Level 3 PMIx-compliant, then it will probably need to support every Level 3 API/attr. This can be a separate issue though, and probably only makes sense to discuss in detail once #182 is either closed or more progress has been made on that front.

I think the first thing to decide is: do we want to allow APIs into the standard that have varying level(s) of backwards compatibility guarantees? Or should every API in the standard have a permanent guarantee of backwards compatibility (extreme circumstances notwithstanding).

If I understand correctly, in the current standardization process, there are no "stability classes" and only one form of stability: i.e., "modification of existing released APIs will only be permitted under extreme circumstances" [1].

One recent proposal from @rhc54 is a slight tweak on this that retains the idea of released APIs not being modified, but it splits the APIs into two levels. L1 APIs are only accepted into the standard provisionally and their "membership" must be re-affirmed at some interval of time to ensure continued usage and/or usefulness. Presumably there would be some deprecation mechanism such that these APIs are gracefully removed once they have been deemed to be no longer useful. L2 APIs are permanent members of the standard and (presumably) never deprecated. In this way, if a user writes code against a given API in the standard, they are guaranteed that for as long as the API is in the standard, their code will work (or at least compile). As I see it, this could be very powerful for "early adopters" of L1 APIs; they can play around with the new API and are guaranteed to not ever have to re-write their code during the time period that the API goes from L1 -> L2 (assuming it is not deprecated). The split into two levels also provides a new deprecation mechanism (over the existing process as I understand it) for APIs that are deemed to be no longer useful and not appropriate for transitioning to L2.

The original proposal in this issue removes the guarantee that no released APIs will be modified and instead reserves that guarantee for the highest level. The idea being that it would be useful to accept APIs into the standard without immediately guaranteeing permanent backwards-compatibility. In this way, time and flexibility are given for APIs to be "put through their paces" and the lessons learned from real-world usage can be re-incorporated into the API design. In the original proposal, there are three levels. APIs in any of the levels are a part of standard, but each level has its own compatibility guarnatees: L1 APIs have no compatibility guarantees (i.e., they can change at any time), L2 APIs have "moderate" compatibility guarantees, and L3 APIs have "strong" compatibility guarantees. The proposed semantics for L2 and L3 are flexible: L2s could change between minor versions and L3 could change between major versions, or if we want stronger guarantees, L2s could change only between major versions and L3s could be permanent (in the same way that the APIs are currently).

One idea that was discussed during the telecon for the original proposal was to make the L1 APIs (the ones with no compatibility guarantees) not a part of the standard and instead purely a part of the various PMIx implementations. The benefit being that the standard would only contains APIs that contain some kind of compatibility guarantee. The downside being that there is less "advertising" and "publicity" for L1 APIs, since their documentation will be fragmented across implementations. One interesting thing to note, dropping the L1 from the original proposal brings it much closer to the proposal that @rhc54 brought forward recently. They both have two levels, the higher level of both being permanent members of the standard with strong compatibility guarantees. The main difference being with the APIs in the lower level.

Note: about halfway through the issue (right before resummarizing the current proposals) I dropped usage of "APIs/attributes" and stuck solely with APIs. This was intentional to focus the discussion for now. In general, I think we should have processes for both, but maybe we should loop in attributes once we have made some headway on APIs.

SteVwonder · 2019-05-14T03:37:23Z

Sorry for the giant wall of text. In case it wasn't clear, the "too long; didn't read" (tl;dr) of the above comment is:

I think the first thing to decide is: do we want to allow APIs into the standard that have varying level(s) of backwards compatibility guarantees? Or should every API in the standard have a permanent guarantee of backwards compatibility (extreme circumstances notwithstanding).

rhc54 · 2019-05-15T00:36:38Z

I think the first thing to decide is: do we want to allow APIs into the standard that have varying level(s) of backwards compatibility guarantees? Or should every API in the standard have a permanent guarantee of backwards compatibility (extreme circumstances notwithstanding).

I believe the proposals you have captured so far would best be served with the notion of a provisional API being non-permanent. We would need to define some period of time associated with provisional status and a deprecation procedure to assure users of it that they won't wake up some morning to find it "gone" - perhaps a period of two years? We can debate the proper time extent.

Note: about halfway through the issue (right before resummarizing the current proposals) I dropped usage of "APIs/attributes" and stuck solely with APIs.

I second that thought. Attributes are, by their very nature, more ephemeral than APIs. The philosophy used so far has focused on APIs as the point of stability, using attributes to generate flexibility. Thus, the thought was that APIs should become immutable quickly while attributes may come-and-go a little more freely.

When it comes time to deal with attributes, we'll have to spend a little more time thinking about this point. Perhaps the biggest issue will be defining some way of deciding whether or not a given attribute should be in the standard vs defined solely by the implementation or the host environment. This gets into the "not required to support" (i.e., there are lots of attributes in the standard but not every implementation or environment has to support them) vs the "non-portable" (i.e., this symbol doesn't exist in this environment, so your app won't even compile) question.

SteVwonder · 2019-05-17T00:24:43Z

I believe the proposals you have captured so far would best be served with the notion of a provisional API being non-permanent.

Ok. So (at least between the two of us) there is agreement as to the benefit of APIs that are not permanently backwards-compatible. (others should speak up if they disagree).

We would need to define some period of time associated with provisional status and a deprecation procedure to assure users of it that they won't wake up some morning to find it "gone" - perhaps a period of two years? We can debate the proper time extent.

I think this opens up the next thing to try and agree on. What kind of non-permanent backwards-compatibility is the right kind to include. One proposal is to have every API in the standard be "solid" (i.e., the interface signature/semantics cannot change) as soon as it is accepted into the standard, but to provide two mechanisms: one for deprecating certain interfaces and ultimately removing them from the standard and another for "solidifying" interfaces into permanent APIs.

The second proposal, thus far, is to have interfaces start with more "malleable" compatility guarantees (i.e, the interface signatures/semantics can change based on pre-defined rules/guidelines) and then "solidify" them into permanent APIs.

The "solid"ness of the APIs when they enter the standard makes these two proposals mutually exclusive, but in general, there is no reason the second proposal could not also include a deprecation mechanism (and in fact, it should, if we decide to go that route).

I think the main benefit of the latter proposal is getting more "experimental"/"immature" interfaces as well as more niche interfaces into the standard sooner with "softer" compatibility guarantees, and then gradually "hardening" the guarantees as the interfaces mature and/or gain users/traction. Ultimately, though, is a standard the right place for such a thing, or is that better left to implementations to "experiment"? On the concall, we have discussed the option for "experimental"/"immature" interfaces to just reside in an implementation(s) until the interface is mature enough to warrant adding to the standard (at which point it should have very strong compatibility guarantees). Where I get stuck is that PMIx is a bit non-traditional because there is only one implementation currently (AFAIK), where most standards have multiple implementations that they are trying to unify/standardize. So the criteria (in the multi-implementation case) can be connected to the number of implementors of an interface. If we go this route for PMIx, I think the main thing that needs to be decided is what should the criteria be for an interface to move from just residing in an implementation to also residing in the standard?

rhc54 · 2019-05-17T01:15:32Z

to provide two mechanisms: one for deprecating certain interfaces and ultimately removing them from the standard and another for "solidifying" interfaces into permanent APIs.

You always have to include a way for deprecating and removing interfaces - nothing lasts forever, not even "permanent" APIs. Usual method is to first warn of impending deprecation, then deprecate but leave in, and then remove. So it takes three revisions to go away.

The second proposal, thus far, is to have interfaces start with more "malleable" compatility guarantees (i.e, the interface signatures/semantics can change based on pre-defined rules/guidelines) and then "solidify" them into permanent APIs.

Why not simplify this and just say that all interfaces are provisional when initially included in the standard - i.e., they are acceptable in principal, but the precise signature is subject to change for some period of time. Changes are done similar to deprecation - you start with a warning, then perhaps retain both signatures for a time (defining a #ifdef flag to select which one is operational), and then remove the old one. This again requires a minimum of three revisions to have the old definition go away.

Where I get stuck is that PMIx is a bit non-traditional because there is only one implementation currently (AFAIK), where most standards have multiple implementations that they are trying to unify/standardize.

Yes, that is a major complication. At this time, we only know of one generalized implementation - i.e., an implementation intended to be used as a 3rd-party library - and that is the "reference implementation". We know of one and perhaps two other parties that are planning or working on their own environment-specific implementation, but these are customized to their environment and not intended for general use.

Thus, any "extension" done by the reference implementation is going to become the equivalent of a modification to the standard, at least on a de facto basis. Adding those definitions into the standards doc actually serves as a "governing" operation on the reference implementation as it requires at least some oversight from the enviro-specific implementations. Otherwise, I suspect we will be hearing similar complaints again about how the reference implementation is driving the standard 😄

SteVwonder · 2019-05-24T23:42:21Z

Sorry for the delayed response. It's been one of those weeks.

Why not simplify this and just say that all interfaces are provisional when initially included in the standard - i.e., they are acceptable in principal, but the precise signature is subject to change for some period of time. Changes are done similar to deprecation - you start with a warning, then perhaps retain both signatures for a time (defining a #ifdef flag to select which one is operational), and then remove the old one. This again requires a minimum of three revisions to have the old definition go away.

This sounds good to me. Just to make sure we are one the same page, you are proposing that "provisional" interfaces can be deprecated or change (after three revisions), and "permanent" interfaces can only be deprecated (after three revisions)?

One suggestion from @kathrynmohror during today's phone call was to have the "least stable" interfaces (L1) not show up in the standards document by default, but if you include an "--L1" flag (or uncomment a latex macro, or something similar) when building the PDF, they would be included in the document. Just a thought as to how we can potentially include some newer interfaces without committing to as rigorous of a process.

One other discussion during today's call was how intertwined this issue with #181 and #183. Particularly around the idea of interfaces transitioning between classes. Is the typically straw poll and two-week review process sufficient to transition an interface from provisional to permanent (or L2 to L3, or whatever terminology you want to use)? Or should moving to permanent status require a more rigorous process where vote are formally counted and recorded?

rhc54 · 2019-05-25T08:38:34Z

Sorry for the delayed response. It's been one of those weeks.

No worries - same here.

This sounds good to me. Just to make sure we are one the same page, you are proposing that "provisional" interfaces can be deprecated or change (after three revisions), and "permanent" interfaces can only be deprecated (after three revisions)?

Yes - I think that makes sense as a distinction.

One suggestion from @kathrynmohror during today's phone call was to have the "least stable" interfaces (L1) not show up in the standards document by default, but if you include an "--L1" flag (or uncomment a latex macro, or something similar) when building the PDF, they would be included in the document. Just a thought as to how we can potentially include some newer interfaces without committing to as rigorous of a process.

I'm not wild about that as L1 interfaces are still part of the standard and shouldn't be "hidden". However, I do believe that it would be appropriate to put them in a separate section of the standard so their status is clear. They would then move to the L2 section when approved for that transition.

One other discussion during today's call was how intertwined this issue with #181 and #183.

True - hard not to be that way, I guess. If we are defining classes we have to decide how they differ, and that is going to be a question of process as opposed to substance.

Is the typically straw poll and two-week review process sufficient to transition an interface from provisional to permanent (or L2 to L3, or whatever terminology you want to use)? Or should moving to permanent status require a more rigorous process where vote are formally counted and recorded?

Tough call. The problem you face with a vote-based decision process is adequate representation. When you have a restricted scope (e.g., MPI or OpenMP), it can be fairly easy to obtain a representative sample of the affected population. However, PMIx has a rather broad constituency spanning the gamut of resource managers to programming libraries and even application developers themselves. As a result, knowing that you have adequate representation from affected parties is somewhat problematic.

On the other hand, a two-week review process might catch some parties during a vacation, for example, precluding their opportunity to participate in the decision.

Perhaps the best compromise is to retain the decision criteria, but provide a longer review time to ensure adequate notice has been given so that affected parties have an opportunity to become aware of the proposed change in status? What if we modify the time requirement to be more like one or two quarters for shifting something from L1 to L2? I don't see how that would impact someone using that particular feature (it remains in the standard - only its status would be changing) while it provides a reasonable amount of time for someone to become aware of the proposal.

jjhursey · 2019-05-31T16:41:48Z

Notes from Teleconf May 31, 2019:

Having a longer timeout to move from a lower stability level to a higher one (say L1 to L2) sounds good.
- Folks can still use the interface, and it gives more time for community review.
We should clearly define the depreciation process per level.
- Maybe lower stability levels have faster depreciation timelines than higher stability levels. It seems like there should be a correlation there.

abouteiller · 2019-06-06T20:08:30Z

experimental

new interface, may disappear (removed) or change (signature and behavior) in next minor without warning.
After 2, 3 minors (or tbd criterion as discussed above), an experimental interface automatically becomes stable, except if someone voice a formal concern, in which case the interface may remain experimental or be removed (specifics of the voting/disagreement resolution procedure is a separate topic).

stable

Signature is immutable
Semantic changes should be backward compatible (addition of keys, for example), or small errata only.
A stable interface may be deprecated in next minor, removed in next major.
Even after removal, the interface name should never be reused.

long term support (LTS)

an LTS interface is stable
in addition it may be deprecated only in next major, removed in next-next major
LTS interfaces are flagged as such by explicit proposal and accepted through (tbd) voting procedures.

jjhursey · 2019-06-28T15:50:31Z

A suggestion of 3 classes are included in PR #193

Provisional
Stable
Deprecated

jjhursey · 2019-07-25T13:49:25Z

Question: Will PR #193 close this issue or is there more to do?

SteVwonder · 2019-07-26T16:09:24Z

Question: Will PR #193 close this issue or is there more to do?

The consensus on the phone call was that #193 does close this issue.

jjhursey · 2019-08-02T16:00:44Z

Per teleconf July 26, 2019 and Aug. 2, 2019 we think that this can be closed now that PR #193 has been merged.

If there are outstanding issues to resolve this Issue can be reopened or (preferably) a new issue can be filed for discussion.

jjhursey added this to the PMIx v5 Standard milestone Apr 5, 2019

SteVwonder mentioned this issue Apr 11, 2019

Proposed Changes to Standardization Process #181

Closed

SteVwonder changed the title ~~Creating PMIx interface "classes" based on functionality and stability~~ Creating PMIx interface "classes" based on stability May 2, 2019

SteVwonder mentioned this issue May 2, 2019

Creating PMIx "classes/slices" based on functionality #182

Closed

jjhursey mentioned this issue May 4, 2019

Description of the PMIx standardization process #183

Closed

gvallee mentioned this issue May 31, 2019

Define mission statement and core functionalities #190

Open

jjhursey closed this as completed Aug 2, 2019

Creating PMIx interface "classes" based on stability #179

Creating PMIx interface "classes" based on stability #179

Comments

SteVwonder commented Apr 4, 2019 • edited Loading

Main Idea

Motivation

Prior Art

Things to Discuss

SteVwonder commented Apr 5, 2019

Strawman Stability Classes Proposal

rhc54 commented Apr 5, 2019

rhc54 commented Apr 5, 2019

SteVwonder commented Apr 11, 2019

SteVwonder commented Apr 11, 2019

abouteiller commented Apr 11, 2019

jjhursey commented Apr 12, 2019

gvallee commented Apr 18, 2019 • edited Loading

rountree commented Apr 18, 2019

gvallee commented Apr 18, 2019

SteVwonder commented Apr 18, 2019

gvallee commented Apr 18, 2019 • edited Loading

rhc54 commented Apr 19, 2019

gvallee commented Apr 19, 2019

jjhursey commented Apr 19, 2019

gvallee commented Apr 19, 2019

jjhursey commented Apr 23, 2019

jjhursey commented Apr 26, 2019

SteVwonder commented May 2, 2019

SteVwonder commented May 9, 2019

rhc54 commented May 10, 2019

jjhursey commented May 10, 2019

jjhursey commented May 10, 2019

SteVwonder commented May 14, 2019

SteVwonder commented May 14, 2019

rhc54 commented May 15, 2019

SteVwonder commented May 17, 2019

rhc54 commented May 17, 2019

SteVwonder commented May 24, 2019

rhc54 commented May 25, 2019

jjhursey commented May 31, 2019

abouteiller commented Jun 6, 2019 • edited Loading

experimental

stable

long term support (LTS)

jjhursey commented Jun 28, 2019

jjhursey commented Jul 25, 2019

SteVwonder commented Jul 26, 2019

jjhursey commented Aug 2, 2019

SteVwonder commented Apr 4, 2019 •

edited

Loading

gvallee commented Apr 18, 2019 •

edited

Loading

gvallee commented Apr 18, 2019 •

edited

Loading

abouteiller commented Jun 6, 2019 •

edited

Loading