-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathmainpage.doc
362 lines (277 loc) · 15 KB
/
mainpage.doc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
/**
@mainpage
@section whatis What is Upkeeper.
Upkeeper is a process control, monitor, with auditing system.
Upkeeper consists of 4 logical components
- Buddy - The individual process monitor.
- Controller - The audit and control service.
- Libupkeeper - The client library.
- Clients - tools and utilities to interact with upkeeper.
upkeeper's is almost between DJB's daemontools, and freedesktop's systemd, in terms of functionality and capabilities. However, it adds the additional
capability of recording state changes of monitored processes to a data store, which can then be interogated, preferably via the API, for the purpose
troubleshooting, event correlation, metrics collection, trending, and analysis.
@section howitworks How the system works:
A high-level view of the interaction between components is as follows:
@dot
digraph highlevel {
rankdir=TD;
subgraph cluster_bc {
node [shape="record",style="filled",color="white"];
size="2,1";
style="filled";
color="lightgrey";
{
rank = same;
buddy[label = "Buddy"];
ctrl[label = "Controller"];
}
"ctrl" -> "buddy" -> "ctrl" [minlen=5];
}
lib[label = "Libupkeeper"];
client[label = "client", shape="record"];
"client" -> "lib";
"lib" -> "ctrl";
"ctrl" -> "lib";
"lib" -> "client";
}
@enddot
Controller is responsible for maintaining the configuration of buddy process, establishing the environment in which buddy runs, and invoking buddies.
Once a buddy has been invoked, it runs autonymously until terminated. Excessive care has been taken to ensure that a buddy will not terminate under normal, and even some rather extraordinary, circumstances.
Buddy is a simplistic-by-design component that primarily just sits in a loop and waits for the process its monitoring to change terminate, or otherwise catch a signal.
To communicate status changes with controller, buddy listens to a socket, which controller polls at a configurable interval. If controller dies for some reason, buddy has a user-configurably sized ringbuffer which will backlog status events, and report them once a controller becomes available. If a ringbuffer becomes >= 75% full, or if a buddy is attempting to terminate while having events in its ringbuffer, the buddy will attempt to contact a controller on the controller's domain socket, as a last-resort.
The environment buddy runs in is a directory structure with files and symlinks that buddy uses to invoke actions, and log output.
@dot
digraph tree
{
rankdir=LR;
ranksep=.1;
node [fontsize=11, height=.1, width=2.5 ];
edge [fontsize=11, arrowsize=.5];
buddy [label="buddy/\l" shape=record, fontcolor=blue, width=.75 ];
p_actions [ shape=point, width=.03, width=.03 ];
a_actions [ label="actions/\l", shape=record, fontcolor=blue, width=.75 ];
p_actions -> a_actions;
p_actions_start [shape=point, width=.03];
a_actions_start [ label=" start -> ../scripts/start\l", shape=note, style=filled, fillcolor="#a0e0a0" ];
p_actions_start -> a_actions_start;
p_actions_stop [shape=point, width=.03];
a_actions_stop [ label=" stop -> ../scripts/stop\l", shape=note, style=filled, fillcolor="#a0e0a0" ];
p_actions_stop -> a_actions_stop;
p_actions_reload [shape=point, width=.03];
a_actions_reload [ label=" reload -> ../scripts/reload\l", shape=note, style=filled, fillcolor="#a0e0a0" ];
p_actions_reload -> a_actions_reload;
p_actions_custom_action_01 [shape=point, width=.03];
a_actions_custom_action_01 [ label=" 01 -> ../scripts/<custom_action_01>\l", shape=note, style=filled, fillcolor="#a0e0a0" ];
p_actions_custom_action_01 -> a_actions_custom_action_01;
p_actions_custom_action_n [shape=point, width=.03];
a_actions_custom_action_n [ label=" ...\l", shape=note, style=filled, fillcolor="#a0e0a0" ];
p_actions_custom_action_n -> a_actions_custom_action_n;
p_actions_custom_action_32 [shape=point, width=.03];
a_actions_custom_action_32 [ label=" 32 -> ../scripts/<custom_action_32>\l", shape=note, style=filled, fillcolor="#a0e0a0" ];
p_actions_custom_action_32 -> a_actions_custom_action_32;
{
rank=same;
a_actions -> p_actions_start -> p_actions_stop -> p_actions_reload -> p_actions_custom_action_01 -> p_actions_custom_action_n -> p_actions_custom_action_32 [arrowhead=none, len=.1];
}
p_scripts [ shape=point, width=.03 ];
a_scripts [ label=" scripts/\l" shape=record, fontcolor=blue, width=.75 ];
p_scripts -> a_scripts;
p_scripts_start [shape=point, width=.03];
a_scripts_start [ label=" start\l" shape=note, style=filled fillcolor=lightblue ];
p_scripts_start -> a_scripts_start;
p_scripts_stop [shape=point, width=.03];
a_scripts_stop [ label=" stop\l" shape=note, style=filled fillcolor=lightblue ];
p_scripts_stop -> a_scripts_stop;
p_scripts_reload [shape=point, width=.03];
a_scripts_reload [ label=" reload\l" shape=note, style=filled fillcolor=lightblue ];
p_scripts_reload -> a_scripts_reload;
p_scripts_custom_action_01 [shape=point, width=.03];
a_scripts_custom_action_01 [ label=" <custom_action_01>\l" shape=note, style=filled fillcolor=lightblue ];
p_scripts_custom_action_01 -> a_scripts_custom_action_01;
p_scripts_custom_action_n [shape=point, width=.03];
a_scripts_custom_action_n [ label=" ...\l" shape=note, style=filled fillcolor=lightblue ];
p_scripts_custom_action_n -> a_scripts_custom_action_n;
p_scripts_custom_action_32 [shape=point, width=.03];
a_scripts_custom_action_32 [ label=" <custom_action_32>\l" shape=note, style=filled fillcolor=lightblue ];
p_scripts_custom_action_32 -> a_scripts_custom_action_32;
{
rank=same;
a_scripts -> p_scripts_start -> p_scripts_stop -> p_scripts_reload -> p_scripts_custom_action_01 -> p_scripts_custom_action_n -> p_scripts_custom_action_32 [arrowhead=none];
}
p_log [ shape=point, width=.03 ];
a_log [ label=" log/\l" shape=record, width=.75, fontcolor=blue ];
p_log -> a_log;
p_log_stderr_pipe [shape=point, width=.03];
a_log_stderr_pipe [ label=" stderr_pipe\l" shape=note, style=filled fillcolor=lightblue ];
p_log_stderr_pipe -> a_log_stderr_pipe;
p_log_stdout_pipe [shape=point, width=.03];
a_log_stdout_pipe [ label=" stdout_pipe\l" shape=note, style=filled fillcolor=lightblue ];
p_log_stdout_pipe -> a_log_stdout_pipe;
p_log_stderr [shape=point, width=.03];
a_log_stderr [ label=" stderr\l" shape=note, style=filled fillcolor=lightblue ];
p_log_stderr -> a_log_stderr;
p_log_stdout [shape=point, width=.03];
a_log_stdout [ label=" stdout\l" shape=note, style=filled fillcolor=lightblue ];
p_log_stdout -> a_log_stdout;
p_log_buddy [shape=point, width=.03];
a_log_buddy [ label=" buddy\l" shape=note, style=filled fillcolor=lightblue ];
p_log_buddy -> a_log_buddy;
{
rank=same;
a_log -> p_log_stderr_pipe -> p_log_stdout_pipe -> p_log_stderr -> p_log_stdout -> p_log_buddy [arrowhead=none];
}
{
rank=same;
buddy -> p_actions-> p_scripts -> p_log [arrowhead=none];
}
}
@enddot
Clients consist of things like "uptop", which is a status reporting tool that is similar to the common "top" utility. Clients communicate with controller via the API
provided in libupkeeper. Hence, you can easily create any client you require, including metrics reporting and/or monitoring and alerting systems.
To explain this, the following sequence chart demonstrates how the components interact:
Start a managed process:
@msc
cl [label="Client"],
l [label="libupkeeper"],
ct [label="Controller"],
b [label="Buddy"],
p [label="Managed Process"];
cl => l [label="upk_net_register_callback"];
cl => l [label="upk_net_client_connect"];
l >> ct [label="domain socket"];
...;
cl <<= l [label="established"];
cl => l [label="action request: start"];
l -> ct [label="upk_action_req"];
ct abox ct [label="write audit record"];
ct >> b [label="domain socket"];
ct -> b [label="UPK_CTRL_ACTION_START"];
b >> p [label="invoke process"];
ct <- b [label="buddy flush_ringbuffer"];
ct abox ct [label="write buddy info"];
l <- ct [label="report status of request"];
cl <<= l [label="handle reply"];
@endmsc
Whereas a read-only process, such as subscribing to, and receiving events over time would look like this:
@msc
cl [label="Client"],
l [label="libupkeeper"],
ct [label="Controller"];
cl => l [label="upk_net_register_callback"];
cl => l [label="upk_net_client_connect"];
l >> ct [label="domain socket"];
...;
cl <<= l [label="established"];
cl => l [label="subscription request"];
l -> ct [label="upk_subscription_req"];
ct abox ct [label="Controller polls buddies for status at a specified interval, and reports to subscribers status changes"];
...;
l <- ct [label="publish status changes"];
cl <<= l [label="handle status publications via callback"];
@endmsc
@section sampleconfig Sample Configuration.
@verbatim
{
// StateDir
// Path to variable state-dir for controller and buddies
"StateDir": "/usr/var/upkeeper",
// SvcConfigPath
// Path to location of service configuration files
"SvcConfigPath": "/usr/etc/upkeeper.d",
// SvcRunPath
// Path to location to setup and run buddies
"SvcRunPath": "/usr/var/upkeeper/buddies",
// Path to the buddy executable
"UpkBuddyPath": "/usr/libexec/upk_buddy",
// How frequently buddy sockets should be polled for events
// in seconds and fractions of a second
"BuddyPollingInterval": 0.5,
// ServiceDefaults:
"ServiceDefaults": {
// An array of strings (['foo','bar']) describing what this service provides. These are then used in ordering service
// startup via prerequisites. Service name, package, and UUID are implicitely added to the list of Provides.
"Provides": [],
// A valid UUID for the service, will be automatically generated if not provided.
"UUID": Null,
// A brief description of the service
"ShortDescription": Null,
// A longer and more complete description of the service
"LongDescription": Null,
// An array of strings (['foo','uuid#<...>']) of what must already be running (as named in 'Provides')
// before this service should start. The syntax <package-name>:<service-name> may be used to specify a service from
// a specific package. The syntax 'uuid#<uuid-string>' can be used to specify an explicit UUID.
"Prerequisites": [],
// Numeric start priority, used in lieu of, or in conjunction with Provides/Prerequisites to determine start order
"StartPriority": 0,
// Shutdown timeout before resorting to SIGKILL, Values < 0 will never SIGKILL
"KillTimeout": 60,
// Maximum number of times a process may fail in-a-row before its state is changed to down
// a negative value indicates to restart forever (and is the default)
"MaxConsecutiveFailures": -1,
// user-defined max number of restarts within restart window
"UserMaxRestarts": Null,
// User-defined restart window, in seconds
"UserRestartWindow": Null,
// duration, in seconds, to wait between respawn attempts
"UserRateLimit": Null,
// a flag to enable/disable adding a randomized jitter to the user_ratelimit
"RandomizeRateLimit": false,
// if controller and/or buddy is run euid root; which uid to run the service as
"SetUID": 0,
// if controller and/or buddy is run euid root; which gid to run the service as
"SetGID": 0,
// size of the ringbuffer to maintain in the buddy
"RingbufferSize": 64,
// number of times to retry connections to the controler when emergent actions
// occur in the buddy; (-1 for indefinate)
"ReconnectRetries": 10,
// command to exec for start, see 'StartScript'
"ExecStart": Null,
// script to start the monitored process; The default is 'exec %(EXEC_START)'
"StartScript": "#!/bin/sh\nexec %(EXEC_START)\n",
// command to exec for stop. Default: 'kill', see 'StopScript'
"ExecStop": "kill",
// Script to stop the monitored process. The default is 'exec %(EXEC_STOP) $1'; where $1 passed
// to it will be the pid of monitored process (and also the pgrp, and sid, unless the monitored process changed them
"StopScript": "#!/bin/sh\nexec %(EXEC_STOP) $1\n",
// command to exec for reload. Default: 'kill -HUP', see 'ReloadScript'
"ExecReload": "kill -HUP",
// Script to reload the monitored process. The default is 'exec %(EXEC_RELOAD) $1'; where $1 passed
// to it will be the pid of monitored process (and also the pgrp, and sid, unless the monitored process changed them
"ReloadScript": "#!/bin/sh\nexec %(EXEC_RELOAD) $1\n",
// A collection of key/value (JSON Object: {"foo":"bar"}) pairs where the key is the name of the action, and
// the value is the contents of a script script to run for that action
"CustomActions": {},
// optional script to pipe stdout to. For example: 'exec logger -p local0.notice'
"PipeStdoutScript": Null,
// optional script to pipe stderr to. For example: 'exec logger -p local0.warn'
"PipeStderrScript": Null,
// optional place to direct stdout.
// Note that if you pipe stdout elsewhere, this might never be written to, unless the thing you pipe to prints
// to stdout itself
"RedirectStdout": Null,
// optional place to direct stderr.
// Note that if you pipe stderr elsewhere, this might never be written to, unless the thing you pipe to prints
// to stderr itself
"RedirectStderr": Null,
// state the service should be set to initially. this is used only when a service is first configured.
"InitialState": "stopped",
// May be used by a package to instruct the controler to remove a configured service
// if the file defining that service ever disappears. Possibly useful in packaging
// to cleanup the controller on package removal. The default behavior is to ignore
// file removal, and require explicit manual removal of configured services
"UnconfigureOnFileRemoval": false,
// If the controller starts/restarts, and buddy has a service state set to 'stopped',
// but controller's data-store believes the service should be running, prefer
// buddy's world view, and update the data-store to reflect the stopped state.
// The default is to trust the data-store; which would cause the service to be started
"PreferBuddyStateForStopped": false,
// if the controller starts/restarts, and buddy has a service state set to 'running',
// but controller's data-store believes the service should be stopped, prefer
// buddy's world view, and update the data-store to reflect the running state.
// The default is to trust the buddy, which would leave the service running
"PreferBuddyStateForRunning": true,
},
}
@endverbatim
*/