Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative replication order #415

Open
smolkaj opened this issue Feb 17, 2023 · 8 comments
Open

Relative replication order #415

smolkaj opened this issue Feb 17, 2023 · 8 comments

Comments

@smolkaj
Copy link
Member

smolkaj commented Feb 17, 2023

PSA/P4Runtime has a "packet replication engine" (PRE), e.g. for cloning and multicast.

I am interested in mutlicast cases where fairness matters. In hardware, packet replication for multicast often happens sequentially, meaning that multicast replicas get made one by one, and potentially egress the switch one-by-one. These relative delays don't matter in most uses cases, but if you have a very large multicast tree (i.e. many levels of switches replicating the packets to their childreen in the tree) and really care about fairness, they add up.

With that motivation in mind, I am proposing adding a new relative_replication_order field to the Replica message, as follows:

// Used for replicas created for cloning and multicasting actions.
message Replica {
  uint32 egress_port = 1;
  uint32 instance = 2;

  // Specifies the relative order (with respect to other replicas within the same
  // group) in which this replica gets processed by the packet replication engine. 
  enum RelativeReplicationOrder {
    // Default. The order is arbitrary.
    UNSPECIFIED = 0;
    // The order of this replica is fixed and equal to its position within the
    // group. If this replica is at position i, it will be the ith replica to be 
    // processed
    FIXED = 1;
    // The order of this replica is subject to uniform shuffling with all other
    // replicas within the same group participating in shuffling.
    SHUFFLED = 2;
  }
  RelativeReplicationOrder relative_replication_order = 3;
}

With this, the switch control plane would regularly permute the multicast group members.
Permutations would be chosen uniformly at random from the set of permutations that have all FIXED replicas as fixed points.
The switch should preserve read-write symmetry, i.e. the reordering would be transparent to the controller and wouldn't be visible by reading back the multicast group.


I would like to discuss with the community if this seems like a reasonable/general enough feature, or if people feel it's too exotic to be included in the standard. I imagine this functionality would be optional, i.e. switches would not have to support anything but the UNSPECIFIED replication order.

@jonathan-dilorenzo @kheradmandG for visibility.

@smolkaj
Copy link
Member Author

smolkaj commented Feb 17, 2023

We discussed this in the WG meeting. Thanks for all the feedback! Summarizing some issues that came up:

Why is the ordering a per-replica property instead of a per-group property?

The per-replica property is more general, and allow us to have multicast groups such as

F|S|S|S|F

where the first and last replica are FIXED, but the three middle replicas are SHUFFLED. Thus the switch will always start by making a copy for the first replica, then it will process the next three replicas in random order, and then it will always finish by making a copy for the last replica.

Does shuffling happen only among contiguous SHUFFLED replicas, or among all SHUFFLED replicas.

Consider

S|F|S

Should the first and the last replica be shuffled or not?

My proposal is that they should be shuffled. This is not based on a particular use case, but merely because it seems easier/more canonical from a mathematical perspective, in the following way: we can apply any permutations that has the FIXED replicas as fixed points.

Maybe there are reasons to prefer only contiguous shuffling though?

What is the semantics on hardware that can replicate in parallel?

Parallel replication should always be legal -- there is no difference between F and S in the parallel setting.

Does replication order even matter, given that there could also be delays due to buffering?

It does matter in the following cases:

  • we are replicating to a single port
  • we don't expect any buffering

Reproducibility

We should be cautious to introduce randomness, since it makes testing/reproducability harder.
That being said:

  • The randomness here is arguably very limited/contained. It does not affect input-output behavior of the switch, at least in the absence of buffer drops. It only affects relative timing between output packets.
  • Replication order is already "random" (unspecified) today.

Backward compatibility

Targets that don't support the new property should return an UNIMPLEMENTED error. (However legacy targets may just ignore it.)

@smolkaj
Copy link
Member Author

smolkaj commented Feb 17, 2023

@jafingerhut do you have a sense if this generalizes reasonably well to other platforms? Any other thoughts on this? Thanks!

@jafingerhut
Copy link
Contributor

Sorry I missed the 2023-Feb-17 P4 API work group meeting. I could have asked some of these questions live instead of in a comment.

Your description of shuffled replicas is:

    // The order of this replica is subject to uniform shuffling with all other
    // replicas within the same group participating in shuffling.

I am curious about your intended behavior of "subject to uniform shuffling".

Do you mean that you REQUIRE that an implementation must pseudo-randomly pick one of the N! permutations of the N shuffled replicas in a group, and pick among those N! with equal probability?

If so, why would you want to require such behavior?

If that isn't what you mean by the shuffled replicas, would it be legal in your intended meaning for an implementation to simply always create replicas in the precise order given, regardless of what their RelativeReplicationOrder property values were?

I have seen several switch implementations that implemented a fixed predictable order for creating replicas. I haven't seen one that randomly picked replication orders before. That sounds like extra implementation work for a benefit that I do not yet understand.

@jafingerhut
Copy link
Contributor

Oh, and another detail question to follow up: are you thinking that shuffling means that for each packet to be replicated in the data plane, each one might create copies in a different order from the other packets to be replicated using that multicast group?

Or are you thinking that at the time you modify the replication list of a group, at that time the shuffling is performed, and then all copies made using that multicast group will have the same relative order from that point onwards, until and unless the multicast group's replication list is modified?

@smolkaj
Copy link
Member Author

smolkaj commented Feb 24, 2023

Thanks for the the comments, @jafingerhut.

Do you mean that you REQUIRE that an implementation must pseudo-randomly pick one of the N! permutations of the N shuffled replicas in a group, and pick among those N! with equal probability?

Correct. (Note that this is efficiently implementable, no need to represent the distribution over N! permutations explicitly.)

If so, why would you want to require such behavior?

In short, to achieve fairness. I would like to ensure that "in expectation" (on average, across many input packets that get replicated via the same multicast group), no replica is advantaged over another in the sense of being processed first.

I can make this more precise and formal if that would be helpful.

would it be legal in your intended meaning for an implementation to simply always create replicas in the precise order given, regardless of what their RelativeReplicationOrder property values were?

No -- this would be legal for UNSPECIFIED and FIXED only.

I haven't seen one that randomly picked replication orders before. That sounds like extra implementation work

+1. To mitigate this, I suggest we make supporting this optional.

are you thinking that shuffling means that for each packet to be replicated in the data plane, each one might create copies in a different order from the other packets to be replicated using that multicast group?

Yes, exactly. We likely have to weaken this a bit, i.e. instead of this shuffling happening for every single packet, it may be sufficient for it to happen every ~ second or so.

@jafingerhut
Copy link
Contributor

I have seen switch implementations where doing an occasional reordering of the replication list can lead to a transient effect where copies going to the same (output port, replication id) might get reordered briefly, e.g. packets P1 and P2 arriving at the switch might leave the switch to one or more (output port, replication id) destinations in the order P2 followed by P1, due to internal implementation details of the multicast replication.

If that is acceptable to you, then the idea of having the shuffling happening every second or so sounds like something that could fairly straightforwardly be implemented in device driver software, rather than in the hardware itself. It could even be implemented in the controller software, albeit with a potentially high rate of P4Runtime API messages required to achieve it.

@smolkaj
Copy link
Member Author

smolkaj commented Feb 25, 2023

Interesting point about reordering, I will double check about that one.

could fairly straightforwardly be implemented in device driver software, rather than in the hardware itself

Yes, in fact that is what we have in mind.

It could even be implemented in the controller software, albeit with a potentially high rate of P4Runtime API messages required to achieve it.

That is also something we have thought about, but we don't love the amount of controller-visible churn this would cause.

@smolkaj
Copy link
Member Author

smolkaj commented Mar 17, 2023

Just a heads up that this may no longer be required on our end, so unless someone else would like to pursue this at this time, we can put the discussion on hold for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants