forked from Scribery/aushape
-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
318 lines (234 loc) · 13.3 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
=====================================================================
BELOW IS A RUNNING STREAM OF NOTES WHICH DON'T HAVE TO MAKE ANY SENSE
=====================================================================
So, we map one input field to one or more output fields
For each output field we need to:
Know the output field name.
Decide do we have a value or not ("?"/"(none)"/"(null)"/etc.).
Decide how to retrieve the value (str/interpret).
Decide whether the value is a decimal number, or a string.
Escape the value according to output format.
Know if we should add a separator or not.
We need to output both raw and "interpreted" value for some fields
For some fields we know what the "interpretation" output is easily, for some
we need to copy the complicated logic in auparse to actually know. Like the
file opening mode argument decoding.
However, we can request field types from auparse, and perhaps those can help
us know what the interpreted field means. Still, it is a *source* field type,
not the interpretation.
We can get a type for each field and infer the meaning of the interpreted
value. However, we cannot do that statically, as some field names are reused
and so their type can change dynamically, depending on record type or such.
However, we can't infer the meaning of syscall arguments, since they don't
have type returned properly. We might need to fallback to general
"interpretation" of those.
So, theoretically we can use the field type to avoid storing interpreted
value, to save on overhead. That seems about manageable.
Otherwise we won't be able to keep the schema synched with the logic in
auparse.
So, for each input field we need to:
Decide do we have a value or not ("?"/"(none)"/"(null)"/etc.).
Decide whether the value is a decimal number, or a string.
Escape the value according to output format.
Know if we should add a separator or not.
How can we generate schema:
Get record types from auditd/lib/msg_typetab.h
List record elements named after record types,
say they contain any of the field elements for the start.
Get fields from audit-documentation/specs/fields/field-dictionary.csv
List field elements. Specify "i" attribute as mandatory. Specify
whether it is a string or a number based on the type from the CSV.
Either specify "r" attribute as optional for every field with type
according to the CSV, similar to the "i" attribute. Or try to figure
out which field doesn't have to have it based on... what?
Wait, perhaps that's separate. We can still attach type to either "r"
or "i" and mark all "i" attributes optional. Can't we?
We can add another column to the CSV file (aargh!) which specifies if
the value is actually interpreted, or if it is simply unescaped or
copied as is.
arg_num
arg_idx
arg_len_total
arg_len_read
(partially) formatted arg list gbuf
Do we need splitting the stream into documents with n events each?
If we write to a log, then we'd rather write single event per message
What if we separate the document output and the event output?
Have the converter just output events, either single-line, or multi-line with
the initial and successive indent specified.
Have the converter user wrap that into documents as necessary. One of such
users can be the aushape program. Or perhaps we can have another module which
does a more complete job, once the approach used in aushape is figured out.
To keep to that idea we need to stop assuming anything about the order of
events being output and not to attempt to arrange separators and whitespace
that go before, between, or after them. Have no newlines at the end at all.
This will be one thing done well.
----------------
Command line sketches
aushape -l xml < input.log > output.xml
aushape -l xml input.log > output.xml
aushape -l xml -o stdout input.log > output.xml
aushape -l xml -o stdout < input.log > output.xml
aushape -l xml -o syslog --syslog-facility=authpriv --syslog-priority=info
aushape -l xml -o file --file-path=/var/log/aushape.log
aushape -l json --fold 4 --indent 4
aushape -l json --fold 0
aushape -l xml input.log >output.xml
aushape -l xml input.log output.xml
aushape -l json < input.log > output.json
aushape -o file -l xml input.log
aushape -o file -l xml input.log -
aushape -o
aushape -o file --file-path
Audispd plugin configuration doesn't support passing arguments well, so we'll
have to have a wrapper or a separate program installed, with all the necessary
arguments specified. For now we'll relax about audispd argument requirements.
Actually, what if we simply make several executables, instead of one?
The thing is, there are two separate tasks: convert a live stream, one event at
a time, or convert a complete stream as a whole.
How about this:
aushape-stream
aushape-convert
Or, perhaps, we should simply go ahead with that document wrapping logic that
we discarded before, only this time we actually write the whole document at a
time. Or rather have that concept.
We can say that the output can be discrete, or continuous. If we're writing
to a discrete output, we have to write whole documents always. If we're
writing to a continuous output, we can write in any pieces.
So if we say enclose all events into a document, and are writing to syslog,
then we'll have to accumulate all the events in memory and invoke syslog(3)
only when the input stream is closed. Well, duh, don't do that, unless you
know what you're doing.
But if we have the same configuration with a (continuous) file or stdout
output, then we can write pieces as we want them, no matter the configuration.
So, for this to work, we'll have to have an intermediate aggregator
object/module, which would look at the type of the output and collect the
data before writing, if the output is discrete. It would also deal with making
documents of all, each n, or no events (make them bare).
So, we can have these outputs:
file
syslog
journal
We can make those be simply "writers". And we can make a special kind of
writer, which expects being written complete events and accepts another writer
to write documents to.
So we can have:
aushape_fd_writer - writing to an FD, continuous, expects whatever
aushape_syslog_writer - writing to syslog, discrete, expects whatever
aushape_journal_writer - writing to journal, discrete, expects whatever
and
aushape_doc_writer - writing to a writer, discrete, expects events
Let's try adding some more:
aushape_http_writer - writing to fluentd, discrete, expects whatever
Yet, perhaps we can stuff the document aggregation into the converter still.
That may simplify things.
There is a problem of writing the document prologue and epilogue for the case
of "all events in one document". When do we write them?
First of all, writing any events before or after the document doesn't make
sense, so these operations cannot be on the same level as feeding input into
the converter. I.e. there cannot be a function beside aushape_conv_input. If
there was, then we would need to prohibit calling aushape_conv_input
before/after it. If we have to call these special functions only after
creation and before destruction, why not make them part of
creation/destruction?
The thing about making them a part of destruction is that we will either need
to be able to fail destruction, or to ignore the problem and go ahead with
freeing the converter. Even if we report a problem with finishing the document
during destruction, and go ahead and destroy it, there would be no way to
recover.
If we make these separate functions, they would still feel strange, because
they would only be necessary in that "all events in one doc" case.
OTOH, if we really want to recover from that, then we need them. Yet, how
would we recover, which cases would need that? Well, at least debugging would
be nice.
Having the document formatting in converter would also make the folding and
indenting logic more solid and simple.
BTW, even if we make document formatting in a separate "writer", we would
still need to deal with handling finishing failure.
We can allow passing a "force" flag argument to aushape_conv_destroy, but it
would feel awkward. Perhaps separate functions are not that bad, after all.
They can be made thus:
enum aushape_conv_rc aushape_conv_begin(struct aushape_conv *conv);
enum aushape_conv_rc aushape_conv_end(struct aushape_conv *conv);
We will need to store converter state and refuse any calls out of this order:
aushape_conv_begin
aushape_conv_input|flush
...
aushape_conv_end
----
If we're going to implement "writers", or perhaps something else beside the
converter and simple additional modules, we might need to return to a global
return code system, since it might be beneficial to know what exactly happened
below the converter. E.g. why the writer failed, instead of just a single
return code meaning "writer failed".
----
There is one good thing about implementing document wrapping in writers: we
can add other policies, like limiting document size in bytes, instead of in
events. With wrapping logic inside the converter we can't do that easily.
OTOH, we can implement size limiting inside the converter too.
----
We need to limit event size. The question is what to cut. On the top level we
have records *and* raw log lines. Sometimes we need to process *all* records,
because some records have data for some other records. Like the syscall
record having "items" field specifying number of "path" records logged. Or the
execve records having an argument split between them.
What if we simply format everything in memory and then decide what to cut and
cut it before output? Sure, that can take a lot of memory, but it will be much
simpler and still faster than passing huge messages up the pipeline and more
reliable than just dropping them. For this to work we need to quantize the
buffers somehow, so that we can go and count out the items that would fit.
This would make the truncating algorithm much simpler. But first, we might
need to deal with the separators and all that passing "first" argument around.
----
If we want to have structured truncating, we need to define that structure.
What if we model the output as a prologue, epilogue, and a series of items?
The prologue and epilogue cannot be removed then, but items can. Each item can
then have a priority which will decide the order of the items being removed.
As most of the time we'll need the tail to be removed, and it's more
human-friendly to be counting up, the top priority would have the smallest
value. I.e. 1 - top priority, SIZE_MAX - lowest priority.
Each item can be a piece of text, or another, nested structure.
We can further abstract the prologue and epilogue to simply be top priority
items that cannot be removed. E.g. priority zero. Then we don't have to know
anything more about them. Although, we would need to have a rule that items of
the same priority can only be removed together.
In fact, we can even model the whole truncating hierarchy simply by using the
priority value. The only difficulty is managing the value appropriately with
arbitrary-length nested lists. We can surrender a few top bits of a size_t for
nesting level and still have plenty of space in the remaining bits. Let's say
we support 16 nesting levels (even though we use less than 8 currently), then
we need only four bits for that and we have 28 or 60 bits left still. Even
with 28 bits we have space for 268435455 items, which is a lot.
However, we have a problem of locating maximums in that flat list, or a tree,
which can take a while for a large event.
Alright, first, a tree would make more sense, as parts of the output can be
accumulated in a random order and we would need a tree anyway. Then, to prune
the excess, we would need to descend into highest-priority branches first and
keep adding to the global counter and then stop when we exceed the maximum.
We can once again sacrifice the memory for performance and have an array of
priorities vs size for each list, assuming we need at most the number of items
of priorities, and they grow steadily. Then we can do one sweep of the item
array, sum up all the sizes for each priority, and then go down that priority
array, summing it up, and decide on which priority level we want to keep.
We can start with the dumbest, slowest algorightm, though. Not that the one
above is very smart either, but we can use a simpler one still.
----
We're supposed to only *remove* priority level completely, i.e. either remove
all nodes from a priority level or none.
One priority level can have nodes of different sizes, including text nodes,
which can only be trimmed completely.
First, we can either remove, or trim each node. What's the difference between
trimming a node to zero and removing it?
How can we decide which nodes to trim and by how much in a single priority
level?
What if we make the zero priority level non-removable, and instead force the
caller to remove the node itself?
While the priority level size is bigger than the limit
For each node that is not known to be atomic
Trim it proportionally to the reduction we need
If it did trim
Subtract the trim from overall length
else
Mark it atomic
If there are no more trimmable nodes, break
return resulting length