Skip to content

Latest commit

 

History

History
110 lines (78 loc) · 3.12 KB

troubleshooting_101.rst

File metadata and controls

110 lines (78 loc) · 3.12 KB

Troubleshooting

A key skill for anyone doing operations, is the ability to successfully troubleshoot problems.

Here we will got over a few steps you can take to help quickly narrow down problems to their causes.

What is broken? First think about how it works in most basic terms. Then build on that the things which can break. Then from those, pick the ones that could cause the symptoms you see.

Example:
You cannot ping a server.

You have a variety of tools at your fingertips to help work out the cause of a problem. Over time you will expand what is in your toolbelt, but to start with you must know how to use each of these:

  • top, vmstat, iostat, systat, sar, mpstat These help you see the current state of the system - what is running, what is using cpu, memory? Is the disk being heavily used? There is a lot of information, and knowing how these tools work will help you pick out the bits you should focus on.
  • tcpdump, ngrep If you suspect you have a network-related problem, tcpdump and ngrep can help you confirm it.
  • Eliminating variables
  • What changed recently?
  • Could any of the symptoms be red herrings?
  • Common culprits (is it plugged in?)
  • Look through your logs
  • Communicating during an outage
  • 'Talking Out-Loud' (IRC/GroupChat)
  • Communicating after an outage (postmortems)

Often problems can be traced back to recent changes. Problems that start around the time of a change aren't usually coincidence.

Over time you may find that a small set of errors cause a large portion of the problems you have to fix. Let's cause some of these problems and see how we identify and fix them.

(finding large files, and also finding deleted-but-open files)

Manifests as "disk full" when df claims you have disk space free.

Being able to work successfully through a crisis is crucial to being a good operations person. For some it is a personality trait, but it can certainly be learned and is almost a requirement for many employers.

Situational Awareness (Mica Endsley) Decision Making (NDM and RPD) - Klein Communication (Common ground, Basic Compact, Assertiveness) Team Working (Joint Activity, fundamentals of coordination and collaboration) Leadership (before, during, after incidents) (Weick, Sutcliffe work on HROs) Managing Stress Coping with Fatigue Training and Assessment Methods Cognitive Psychology concerns (escalating scenarios, team-based troubleshooting)