Sensible log behavior when redis is unavailable #15466

AlanCoding · 2024-08-26T02:14:37Z

SUMMARY

This picks up on the "redis" part of a prior PR #12698 and goes further.

That prior PR was too unfocused, trying to solve both the redis and receptor problems. Backing up, why am I looking at this problem in the first place? Because it gets in my way (personally) trying to diagnose other problems. Because when I look at logs, those logs are swamped due to 2 main reasons:

Problems with install ordering mean that dispatcher and other stuff runs before receptor is fully running and available
When redis goes down (which we do intentionally sometimes) various services absolutely scream in the logs about this fact

This PR is only concerned with the 2nd bullet point.

So when you get the log file you want (finally, after wading through the rest of the SOS report), you find that 95% of that log is stuff you don't want. Even worse, the noise is all "Traceback:" entries... which isn't great when what you're looking for is a stack trace of that format.

With that segway, here's a demo of the log behavior after taking redis down:

tools_redis_1     | 1:M 26 Aug 2024 02:03:54.472 # Redis is now ready to exit, bye bye...
tools_awx_1       | 2024-08-26 02:03:54,475 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Connection closed by server.
tools_awx_1       | 2024-08-26 02:03:54,474 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Connection closed by server.
tools_awx_1       | 2024-08-26 02:03:54,474 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Connection closed by server.
tools_awx_1       | 2024-08-26 02:03:54,474 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Connection closed by server.
tools_redis_1 exited with code 0
tools_awx_1       | 2024-08-26 02:03:57,650 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:03:59,498 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:03:59,498 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:03:59,498 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:03:59,502 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:02,707 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:02,894 WARNING  [-] awx.main.dispatch Redis connection error saving status data:
tools_awx_1       | Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | missed data:
tools_awx_1       | Recorded at: 2024-08-26 02:04:02 UTC 
tools_awx_1       | awx-1[pid:3477] workers total=4 min=4 max=631 
tools_awx_1       | .  worker[pid:3484] sent=0 qsize=0 rss=152.594MB [IDLE]
tools_awx_1       | .  worker[pid:3485] sent=0 qsize=0 rss=152.621MB [IDLE]
tools_awx_1       | .  worker[pid:3486] sent=0 qsize=0 rss=152.625MB [IDLE]
tools_awx_1       | .  worker[pid:3487] sent=0 qsize=0 rss=152.633MB [IDLE]
tools_awx_1       | 2024-08-26 02:04:04,524 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:04,523 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:04,530 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:04,534 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:07,762 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:08,819 WARNING  [-] awx.main.dispatch Redis connection error saving status data:
tools_awx_1       | Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | missed data:
tools_awx_1       | Recorded at: 2024-08-26 02:04:08 UTC 
tools_awx_1       | awx-1[pid:3477] workers total=4 min=4 max=631 
tools_awx_1       | .  worker[pid:3484] sent=0 qsize=0 rss=152.594MB [IDLE]
tools_awx_1       | .  worker[pid:3485] sent=0 qsize=0 rss=152.621MB [IDLE]
tools_awx_1       | .  worker[pid:3486] sent=0 qsize=0 rss=152.625MB [IDLE]
tools_awx_1       | .  worker[pid:3487] sent=0 qsize=0 rss=152.633MB [IDLE]
tools_awx_1       | 2024-08-26 02:04:08,869 WARNING  [-] awx.main.scheduler Redis connection error saving metrics for task_manager, error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:09,546 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:09,550 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:09,550 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:09,558 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:09,853 WARNING  [-] awx.main.scheduler Redis connection error saving metrics for dependency_manager, error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:12,811 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:14,569 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:14,572 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:14,572 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:14,583 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:14,818 WARNING  [-] awx.main.dispatch Redis connection error saving status data:
tools_awx_1       | Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | missed data:
tools_awx_1       | Recorded at: 2024-08-26 02:04:14 UTC 
tools_awx_1       | awx-1[pid:3477] workers total=4 min=4 max=631 
tools_awx_1       | .  worker[pid:3484] sent=1 finished=1 qsize=0 rss=159.254MB [IDLE]
tools_awx_1       | .  worker[pid:3485] sent=0 qsize=0 rss=152.621MB [IDLE]
tools_awx_1       | .  worker[pid:3486] sent=1 qsize=1 rss=159.078MB
tools_awx_1       |      - running cf27e9ed-2fd6-4a32-b675-ae1403e1ce0e dependency_manager(*[])
tools_awx_1       | .  worker[pid:3487] sent=0 qsize=0 rss=152.633MB [IDLE]
tools_awx_1       | 2024-08-26 02:04:14,823 WARNING  [-] awx.main.analytics Connection error in send_metrics: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:14,823 WARNING  [-] awx.main.analytics Connection error in send_metrics: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:17,856 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:19,593 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:19,595 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:19,595 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:19,609 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:22,823 WARNING  [-] awx.main.dispatch Redis connection error saving status data:
tools_awx_1       | Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | missed data:
tools_awx_1       | Recorded at: 2024-08-26 02:04:22 UTC 
tools_awx_1       | awx-1[pid:3477] workers total=4 min=4 max=631 
tools_awx_1       | .  worker[pid:3484] sent=1 finished=1 qsize=0 rss=159.254MB [IDLE]
tools_awx_1       | .  worker[pid:3485] sent=1 qsize=1 rss=158.738MB
tools_awx_1       |      - running 4e5206e4-31f0-48ad-92fb-4957ed1f5d54 awx_periodic_scheduler(*[])
tools_awx_1       | .  worker[pid:3486] sent=2 finished=1 qsize=1 rss=159.078MB
tools_awx_1       |      - running 8d600d28-831e-4306-8b63-ffb33e491cba send_subsystem_metrics(*[])
tools_awx_1       | .  worker[pid:3487] sent=0 qsize=0 rss=152.633MB [IDLE]
tools_awx_1       | 2024-08-26 02:04:22,833 WARNING  [-] awx.main.dispatch Redis connection error saving dispatcher metrics, error:
tools_awx_1       | Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:22,910 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:24,620 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:24,620 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:24,620 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:24,630 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:27,939 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:28,820 WARNING  [-] awx.main.dispatch Redis connection error saving status data:
tools_awx_1       | Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | missed data:
tools_awx_1       | Recorded at: 2024-08-26 02:04:28 UTC 
tools_awx_1       | awx-1[pid:3477] workers total=4 min=4 max=631 
tools_awx_1       | .  worker[pid:3484] sent=1 finished=1 qsize=0 rss=159.254MB [IDLE]
tools_awx_1       | .  worker[pid:3485] sent=1 qsize=1 rss=158.738MB
tools_awx_1       |      - running 4e5206e4-31f0-48ad-92fb-4957ed1f5d54 awx_periodic_scheduler(*[])
tools_awx_1       | .  worker[pid:3486] sent=2 finished=1 qsize=1 rss=159.078MB
tools_awx_1       |      - running 8d600d28-831e-4306-8b63-ffb33e491cba send_subsystem_metrics(*[])
tools_awx_1       | .  worker[pid:3487] sent=0 qsize=0 rss=152.633MB [IDLE]
tools_awx_1       | 2024-08-26 02:04:28,866 WARNING  [-] awx.main.scheduler Redis connection error saving metrics for task_manager, error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:29,650 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:29,655 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:29,653 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:29,651 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:29,836 WARNING  [-] awx.main.scheduler Redis connection error saving metrics for dependency_manager, error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:32,995 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:34,670 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:34,673 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:34,678 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:34,679 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:34,823 WARNING  [-] awx.main.dispatch Redis connection error saving status data:
tools_awx_1       | Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | missed data:
tools_awx_1       | Recorded at: 2024-08-26 02:04:34 UTC 
tools_awx_1       | awx-1[pid:3477] workers total=4 min=4 max=631 
tools_awx_1       | .  worker[pid:3484] sent=1 finished=1 qsize=0 rss=159.254MB [IDLE]
tools_awx_1       | .  worker[pid:3485] sent=1 finished=1 qsize=0 rss=158.738MB [IDLE]
tools_awx_1       | .  worker[pid:3486] sent=4 finished=3 qsize=1 rss=159.488MB
tools_awx_1       |      - running c5ec804f-f846-4226-aa48-84104e56d129 dependency_manager(*[])
tools_awx_1       | .  worker[pid:3487] sent=0 qsize=0 rss=152.633MB [IDLE]
tools_awx_1       | 2024-08-26 02:04:34,876 WARNING  [-] awx.main.analytics Connection error in send_metrics: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:34,877 WARNING  [-] awx.main.analytics Connection error in send_metrics: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:38,047 WARNING  [-] awx.analytics.broadcast_websocket Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:39,703 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:39,702 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:39,703 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:39,711 ERROR    [-] awx.main.commands.run_callback_receiver redis connection error: Error 2 connecting to unix socket: /var/run/redis/redis.sock. No such file or directory.
tools_awx_1       | 2024-08-26 02:04:45,860 WARNING  [-] awx.main.tasks.system Heartbeat skew - interval=62.9535, expected=60

This is noisy on a certain point, but that is actually an interesting point.

The dispatcher "statistics" are used for the --status command. So if we can't get the statistics to stash into redis... what do we do? Before, we would drop that data on the floor and then log a giant stack trace. But, since redis connection errors are a very well-known quality, better to show the details of the error, and then print the data that we're dropping. Right? There's a non-zero chance that we have a pool management bug while at the same time hitting this, confounding debugging even more.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

AlanCoding · 2024-08-28T14:25:35Z

started YOLO 10231

AlanCoding · 2024-08-28T19:09:14Z

From running a deployment, it looks like there's another loop not yet solved.

2024-08-28 15:00:04,792 INFO     [-] awx.main.dispatch Running worker dispatcher listening to queues ['tower_broadcast_all', 'tower_settings_change', 'automationcontroller-0']
2024-08-28 15:00:04,833 ERROR    [-] awx.main.dispatch Encountered unhandled error in dispatcher main loop
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/base.py", line 244, in run
    self.worker.on_start()
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/dispatch/worker/task.py", line 141, in on_start
    dispatch_startup()
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/system.py", line 117, in dispatch_startup
    cluster_node_heartbeat()
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/system.py", line 575, in cluster_node_heartbeat
    inspect_execution_and_hop_nodes(instance_list)
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/system.py", line 514, in inspect_execution_and_hop_nodes
    mesh_status = ctl.simple_command('status')
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 81, in simple_command
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/receptorctl/socket_interface.py", line 99, in connect
    raise ValueError(f"Socket path does not exist: {path}")
ValueError: Socket path does not exist: /var/run/receptor/receptor.sock

I don't really want this either, so it'll probably take another commit. Let's see... it could be due to starting the dispatcher without redis available.

AlanCoding · 2024-08-28T20:16:23Z

Wait, never mind about my last error. That's receptor, not redis. Not in scope here.

AlanCoding · 2024-08-28T20:20:12Z

YOLO 10245

sonarcloud · 2024-09-23T15:30:56Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

github-actions bot added the component:api label Aug 26, 2024

AlanCoding requested a review from kdelee August 26, 2024 02:15

AlanCoding force-pushed the lost_redis branch from aa65627 to 6d3b3d9 Compare August 28, 2024 20:09

AlanCoding marked this pull request as ready for review August 28, 2024 20:19

AlanCoding requested a review from chrismeyersfsu September 23, 2024 15:28

AlanCoding assigned TheRealHaoLiu Sep 23, 2024

AlanCoding added 2 commits September 23, 2024 11:29

Sensible log behavior when redis is unavailable

1234ef2

Consistent behavior with dispatcher and callback

d9b6e0f

AlanCoding force-pushed the lost_redis branch from 3af35db to d9b6e0f Compare September 23, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sensible log behavior when redis is unavailable #15466

Sensible log behavior when redis is unavailable #15466

AlanCoding commented Aug 26, 2024 •

edited

Loading

AlanCoding commented Aug 28, 2024

AlanCoding commented Aug 28, 2024

AlanCoding commented Aug 28, 2024

AlanCoding commented Aug 28, 2024

sonarcloud bot commented Sep 23, 2024

Sensible log behavior when redis is unavailable #15466

Are you sure you want to change the base?

Sensible log behavior when redis is unavailable #15466

Conversation

AlanCoding commented Aug 26, 2024 • edited Loading

SUMMARY

ISSUE TYPE

COMPONENT NAME

AlanCoding commented Aug 28, 2024

AlanCoding commented Aug 28, 2024

AlanCoding commented Aug 28, 2024

AlanCoding commented Aug 28, 2024

sonarcloud bot commented Sep 23, 2024

Quality Gate passed

AlanCoding commented Aug 26, 2024 •

edited

Loading