Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unavailable UPS agent causes monitor crash in HWP Supervisor #770

Open
BrianJKoopman opened this issue Oct 7, 2024 · 1 comment
Open
Labels
agent: hwp supervisor bug Something isn't working

Comments

@BrianJKoopman
Copy link
Member

BrianJKoopman commented Oct 7, 2024

I was helping satp2 try to recover their HWP system this morning and found the supervisor agent in this state:

2024-09-26T17:09:50+0000 startup-op: launching monitor
2024-09-26T17:09:50+0000 start called for monitor
2024-09-26T17:09:50+0000 monitor:0 Status is now "starting".
2024-09-26T17:09:50+0000 startup-op: launching spin_control
2024-09-26T17:09:50+0000 start called for spin_control
2024-09-26T17:09:50+0000 spin_control:1 Status is now "starting".
2024-09-26T17:09:50+0000 monitor:0 Status is now "running".
2024-09-26T17:09:50+0000 spin_control:1 Status is now "running".
2024-09-26T17:09:55+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-ups-az.ops>'], {}]
2024-09-26T17:09:55+0000 Could not connect to client: power-ups-az
2024-09-26T17:09:55+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-iboot-hwp-2.ops>'], {}]
2024-09-26T17:09:56+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.power-iboot-hwp-2.ops>'], {}]
2024-09-26T17:09:56+0000 Error getting status: [0, 0, 0, 0, 'wamp.error.no_such_procedure', ['no callee registered for procedure <satp2.acu.ops>'], {}]
2024-09-26T17:09:56+0000 Could not connect to client: power-iboot-hwp-2
2024-09-26T17:09:56+0000 monitor:0 CRASH: [Failure instance: Traceback: <class 'ValueError'>: Could not find upsOutputSource OID
/usr/lib/python3.10/threading.py:1016:_bootstrap_inner
/usr/lib/python3.10/threading.py:953:run
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_threadworker.py:49:work
/opt/venv/lib/python3.10/site-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:269:inContext
/opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:285:<lambda>
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:117:callWithContext
/opt/venv/lib/python3.10/site-packages/twisted/python/context.py:82:callWithContext
/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py:984:_running_wrapper
/opt/venv/lib/python3.10/site-packages/socs/agents/hwp_supervisor/agent.py:1374:monitor
/opt/venv/lib/python3.10/site-packages/socs/agents/hwp_supervisor/agent.py:442:update_ups_state
]
2024-09-26T17:09:56+0000 monitor:0 Status is now "done".

It seems like it wasn't able to connect to any of the clients so when monitor goes to grab state info it hits this raise, which it doesn't handle:

raise ValueError('Could not find upsOutputSource OID')

EDIT: This was on socs image: v0.5.1-22-g7d2f158-dev

@BrianJKoopman BrianJKoopman added agent: hwp supervisor bug Something isn't working labels Oct 7, 2024
@jlashner
Copy link
Collaborator

jlashner commented Oct 8, 2024

Thanks for this. The correct behavior is probably to catch this in the monitor_state process and mark it as degraded... and also raise a flag to make sure none of the spin-up commands can run.

I think it might make sense to move the safety check logic from the control-update function into properties of the HWPState object, such as spin_up_safe and grip_safe that check internal state variables like this and return a bool. (I don't think UPS state is currently checked anywhere before)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent: hwp supervisor bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants