-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[muxorch] Using bulker to program routes/neighbors during switchover #3148
Conversation
Uses entity bulker to program routes and neighbors during mux switchover. Mux switchover performance suffers when switching over with a large number of neighbors on the mux port. This uses the optimization of programming the neighbors and routes in bulk to avoid sequentially programming each. Signed-off-by: Nikola Dancejic <[email protected]>
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
orchagent/neighorch.cpp
Outdated
* @brief Creates a neighbor add entry and adds it to bulker. | ||
* @param ctx NeighborBulkContext contains neighbor information and list of object statuses. | ||
*/ | ||
bool NeighOrch::addBulkNeighbor(NeighborBulkContext& ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it use bulk neighbor only for mux scenario or even for the existing path? I think we should enable it only for mux flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's only for mux flow for this PR
@dgsudharsan for viz |
Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Nikola Dancejic <[email protected]>
397f058
to
87281fb
Compare
bulk switchover feature chages the sai_api calls in switchover flow. Because of this mux_rollback_ut tests fail. This PR adds bulk operations to mock_sai_neighbor and mock_sai_api, and adjusts the expected function calls in mux_rollback_ut. Signed-off-by: Nikola Dancejic <[email protected]>
87281fb
to
f259759
Compare
Dependant on PR sonic-net/sonic-sairedis#1373 |
still pulling from wrong sairedis pipeline, running again: |
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Nikola Dancejic <[email protected]>
Last run was still failing because the correct syncd was not installed for some reason. Ran mux tests locally with syncd package pulled from https://dev.azure.com/mssonic/build/_build/results?buildId=559999&view=results and only had one test_multi_nexthop test fail due to a small bug I was able to resolve. Tested with latest commit and all tests are passing. ================================================================= -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html |
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you post the dualttor sonic-mgmt test results?
orchagent/muxorch.cpp
Outdated
while (it != neighbors_.end()) | ||
{ | ||
NextHopKey nh_key = NextHopKey(it->first, alias_); | ||
if (update_rt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is update_rt
checked here? Its already done above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will fix that
orchagent/muxorch.cpp
Outdated
} | ||
|
||
it = neighbors_.begin(); | ||
while (it != neighbors_.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've introduced three loops now in the new code. Can we optimize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the loop out from the comment below, I believe the two left over are necessary, we need the neighbors enabled before we update routes to point to their next hops
orchagent/muxorch.cpp
Outdated
NextHopKey nh_key = NextHopKey(it->first, alias_); | ||
if (update_rt) | ||
{ | ||
updateTunnelRoute(nh_key, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we move this to original section and do this loop only if Bulk api fails. So we don't have to do it everytime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying, I think I just wanted to make sure the order was maintained. I'll fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to take out this loop entirely then, if bulk api fails we will fall back, which will remove the tunnel routes anyways
orchagent/neighorch.cpp
Outdated
PortsOrch* ports_orch = gDirectory.get<PortsOrch*>(); | ||
auto vlan_ports = ports_orch->getAllVlans(); | ||
|
||
for (auto vlan_port: vlan_ports) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of repeated code from existing addNeighbor
. I think we should only add the bulk function as discussed offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced the amount of repeated code in latest commit
- using existing add/removeNeighbor functions to do bulker operations. - Created addNeighborPost and removeNeighborPost - Created addRoutes and removeRoutes to deal with bulk operations - Created enableNeighbors and disableNeighbors to handle bulk switchover. Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Nikola Dancejic <[email protected]>
orchagent/muxorch.cpp
Outdated
} | ||
|
||
updateTunnelRoute(nh_key, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be inside the update_rt
check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'll move that back
orchagent/muxorch.cpp
Outdated
return false; | ||
} | ||
|
||
gRouteBulker.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we clear this even if its returned false above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at this, it's not needed since we call clear from removeRoutes
orchagent/neighorch.cpp
Outdated
@@ -1060,6 +1081,7 @@ bool NeighOrch::addNeighbor(const NeighborEntry &neighborEntry, const MacAddress | |||
} | |||
else if (isHwConfigured(neighborEntry)) | |||
{ | |||
ctx.set_neigh_attr_count = (int)neighbor_attrs.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed
orchagent/muxorch.cpp
Outdated
} | ||
|
||
updateTunnelRoute(nh_key, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'll move that back
orchagent/muxorch.cpp
Outdated
return false; | ||
} | ||
|
||
gRouteBulker.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at this, it's not needed since we call clear from removeRoutes
|
||
SWSS_LOG_INFO("Checking neighbor remove entry status %s on %s.", ip_address.to_string().c_str(), m_syncdNeighbors[neighborEntry].mac.to_string().c_str()); | ||
|
||
if (isHwConfigured(neighborEntry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a loot at this, we still need the HwConfigured check since this isn't checked for in the disableNeighbors patch
Signed-off-by: Nikola Dancejic <[email protected]>
Uses entity bulker to program routes and neighbors during mux
switchover. Mux switchover performance suffers when switching over with
a large number of neighbors on the mux port. This uses the optimization
of programming the neighbors and routes in bulk to avoid sequentially
programming each.
What I did
Changed mux switchover logic to use neighbor and bulk switchover instead of programming neighbors sequentially.
Why I did it
Testing shows this improves switchover time by an average of 30% at 128 neighbors.
How I verified it
added a test to vstests to test functionality of code, and tested performance on a dualtor lab testbed.
Details if related
ado:#25072867
bulk_switchover_test_results.xlsx