Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not apply QoS mapping object until it is resolved #3163

Merged
merged 2 commits into from
May 31, 2024

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented May 23, 2024

What I did

Do not apply the global DSCP to TC map to the switch object until the mapping object has been created.

Why I did it

Fix issue: if orchagent handles tables in the following order, it will fail in step 1 and the configure will never applied.

  1. PORT_QOS_MAP|global object
  2. and then DSCP_TO_TC object

How I verified it

Mock and manual test

Details if related

@stephenxs stephenxs force-pushed the fix-global-qos-map branch 2 times, most recently from 47f3fff to f0e7809 Compare May 23, 2024 07:53
@stephenxs
Copy link
Collaborator Author

vs test failure is not relevant to my change.

test_chassis_system_lag_id_allocator_table_full failed (1 runs remaining out of 2).
	<class 'AssertionError'>
	LAG ID allocator table full error is not returned
assert '0' == '1'
  - 0
  + 1
	[<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:695>]
test_chassis_system_lag_id_allocator_table_full failed; it passed 0 out of the required 1 times.
	<class 'AssertionError'>
	LAG ID allocator table full error is not returned
assert '0' == '1'
  - 0
  + 1
	[<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:695>]
test_chassis_system_lag_id_allocator_del_id failed (1 runs remaining out of 2).
	<class 'AssertionError'>
	Unexpected number of keys: expected=1, received=2 (('oid:0x200000000098c', 'oid:0x200000000098b')), table="ASIC_STATE:SAI_OBJECT_TYPE_LAG"
	[<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:778>, <TracebackEntry /agent/_work/1/s/tests/dvslib/dvs_database.py:402>]
test_chassis_system_lag_id_allocator_del_id failed; it passed 0 out of the required 1 times.
	<class 'AssertionError'>
	Unexpected number of keys: expected=1, received=0 ([]), table="ASIC_STATE:SAI_OBJECT_TYPE_LAG_MEMBER"
	[<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:763>, <TracebackEntry /agent/_work/1/s/tests/dvslib/dvs_database.py:402>]
test_chassis_add_remove_ports passed 1 out of the required 1 times. Success!
test_voq_egress_queue_counter passed 1 out of the required 1 times. Success!
test_chassis_wred_profile_on_system_ports passed 1 out of the required 1 times. Success!
test_nonflaky_dummy passed 1 out of the required 1 times. Success!

@stephenxs stephenxs marked this pull request as ready for review May 29, 2024 23:04
@stephenxs stephenxs requested a review from prsunny as a code owner May 29, 2024 23:04
@stephenxs
Copy link
Collaborator Author

vs failed due to installing .net core

Hit:6 http://security.ubuntu.com/ubuntu focal-security InRelease
Fetched 3632 B in 1s (5025 B/s)
Reading package lists...
+ sudo apt-get install -y dotnet-sdk-7.0
Reading package lists...
##[debug]Agent environment resources - Disk: / Available 21392.00 MB out of 29598.00 MB, Memory: Used 1360.00 MB out of 32114.00 MB, CPU: Usage 10.12%
Building dependency tree...
Reading state information...
dotnet-sdk-7.0 is already the newest version (7.0.410-1).
0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded.
+ sudo dotnet tool install dotnet-reportgenerator-globaltool --tool-path /usr/bin
Tool 'dotnet-reportgenerator-globaltool' is already installed.

##[debug]Exit code 1 received from tool '/usr/bin/bash'
##[debug]STDIO streams have closed for tool '/usr/bin/bash'
##[error]Bash exited with code '1'.
##[debug]Processed: ##vso[task.issue type=error;source=TaskInternal;]Bash exited with code '1'.
##[debug]task result: Failed
##[debug]Processed: ##vso[task.complete result=Failed;done=true;]

@bingwang-ms
Copy link
Contributor

The change LGTM. Just wondering how the issue was triggered (PORT_QOS_MAP|global before DSCP_TO_TC_MAP) ?

@stephenxs
Copy link
Collaborator Author

The change LGTM. Just wondering how the issue was triggered (PORT_QOS_MAP|global before DSCP_TO_TC_MAP) ?

Theoretically, the order is not guaranteed between redos db and orchagnet.
We observed it in the regression only once.
I believe it occurred by chance.
But we can reproduce it by setting PORT QOS MAP first and then the QoS mapping, with a delay in between.

@prsunny prsunny merged commit 6568193 into sonic-net:master May 31, 2024
17 checks passed
@stephenxs stephenxs deleted the fix-global-qos-map branch June 1, 2024 01:19
mssonicbld pushed a commit to mssonicbld/sonic-swss that referenced this pull request Jun 4, 2024
…ed (sonic-net#3163)

What I did

Do not apply the global DSCP to TC map to the switch object until the mapping object has been created.

Why I did it

Fix issue: if orchagent handles tables in the following order, it will fail in step 1 and the configure will never applied.

PORT_QOS_MAP|global object and then DSCP_TO_TC object
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #3184

mssonicbld pushed a commit that referenced this pull request Jun 4, 2024
…ed (#3163)

What I did

Do not apply the global DSCP to TC map to the switch object until the mapping object has been created.

Why I did it

Fix issue: if orchagent handles tables in the following order, it will fail in step 1 and the configure will never applied.

PORT_QOS_MAP|global object and then DSCP_TO_TC object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants