forked from EddyRivasLab/easel
-
Notifications
You must be signed in to change notification settings - Fork 0
/
esl_buffer.tex
456 lines (327 loc) · 18.6 KB
/
esl_buffer.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
The \eslmod{buffer} module provides an abstract layer for building
input parsers. Different types of input -- including files, standard
input, piped output from executed commands, C strings, and raw memory
-- can be handled efficiently in a single API and a single object, an
\ccode{ESL\_BUFFER}.
%The API is summarized in Table~\ref{tbl:buffer_api}.
The main rationale for \eslmod{buffer} is to enable multipass parsing
of any input, even a nonrewindable stream or pipe. A canonical problem
in sequence file parsing is that we need to know both the format (
FASTA or Genbank, for instance) and the alphabet (protein or nucleic
acid, for instance) in order to parse Easel-digitized sequence data
records. To write ``smart'' parsers that automagically determine the
file format and alphabet, so programs work transparently on lots of
different file types without users needing to specify them, we need
three-pass parsing: one pass to read raw data and determine the
format, a second pass to parse the format for sequence data and
determine its alphabet, and finally the actual parsing of digitized
sequences. Multiple pass parsing of a nonrewindable stream, such as
standard input or the output of a \ccode{gunzip} call, isn't possible
without extra support. The \eslmod{buffer} module standardizes that
support for all Easel input.
\subsection{Examples of using the buffer API}
Here's an example of using \eslmod{buffer} to read a file line by
line:
\input{cexcerpts/buffer_example}
This shows how to open an input, get each line sequentially, do
something to each line (here, count the number of x's), and close the
input. To compile this example, then run it on a file (any file would
do, but here, \ccode{esl\_buffer.c} itself):
\user{gcc -I. -o esl\_buffer\_example -DeslBUFFER\_EXAMPLE esl\_buffer.c easel.c -lm}
\user{./esl\_buffer\_example esl\_buffer.c}
\response{Counted 181 x's in 3080 lines.}
The most important thing to notice here is that
\ccode{esl\_buffer\_Open()} function implements a standard Easel idiom
for finding input sources. If the \ccode{filename} argument is a
single dash '-', it will read from \ccode{stdin}. If the
\ccode{filename} argument ends in \ccode{.gz}, it will assume the file
is a \ccode{gzip}-compressed input, and it will decompress it on the
fly with \ccode{gzip -dc} before reading it. If it does not find the
\ccode{filename} relative to the current directory, and if the second
argument (here \ccode{"TESTDIR"}) is non-\ccode{NULL}, it looks at the
setting of an environment variable \ccode{envvar}, which should
contain a colon-delimited list of directories to search to try to find
\ccode{filename}. Therefore all of the following commands will work
and give the same result:
\begin{userchunk}
% ./esl_buffer_example esl_buffer.c
\end{userchunk}
\begin{userchunk}
% cat esl_buffer.c | ./esl_buffer_example -
\end{userchunk}
\begin{userchunk}
% cp esl_buffer.c foo
% gzip foo
% ./esl_buffer_example foo.gz
\end{userchunk}
\begin{userchunk}
% cp esl_buffer.c ${HOME}/mydir2/baz
% export TESTDIR=${HOME}/mydir1:${HOME}/mydir2
% ./esl_buffer_example baz
\end{userchunk}
This idiomatic flexibility comes in handy when using biological data.
Data are are often kept in standard directories on systems (for
example, we maintain a symlink \ccode{/misc/data0/databases/Uniprot}
on ours), so having applications look for directory path listings in
standardized environment variables can help users save a lot of typing
of long paths. Data files can be big, so it's convenient to be able to
compress them and not have to decompress them to use them. It's
convenient to have applications support the power of using UNIX
command invocations in pipes, chaining the output of one command into
the input of another, so it's nice to automatically have any
Easel-based application read from standard input.
A couple of other things to notice about this example:
\begin{enumerate}
\item If the \ccode{esl\_buffer\_Open()} fails, it still returns a
valid \ccode{ESL\_BUFFER} structure, which contains nothing except a
user-directed error message \ccode{bf->errmsg}. If you were going to
continue past this error, you'd want to \ccode{esl\_buffer\_Close()}
the buffer.
\item \ccode{esl\_buffer\_GetLine()} returns a pointer to the start of
the next line \ccode{p}, and its length in chars \ccode{n}
(exclusive of any newline character). It does \emph{not} return a
string - \ccode{p[n]} is \emph{not} a \ccode{NUL} byte
\verb+\0+. Standard C string functions, which expect
\ccode{NUL}-terminated strings, can't be used on \ccode{p}. The
reason is efficiency: the \ccode{ESL\_BUFFER} is potentially looking
at a read-only exact image of the input, and
\ccode{esl\_buffer\_GetLine()} is not wasting any time making a copy
of it. If you need a string, with an appended \verb+\0+ in the
right place, see \ccode{esl\_buffer\_FetchLineAsStr()}.
\end{enumerate}
\subsubsection{Reading tokens}
Because \ccode{ESL\_BUFFER} prefers to give you pointers into a
read-only image of the input, the standard C \ccode{strtok()} function
can't be used to define tokens (whitespace-delimited fields, for
example), because \ccode{strtok()} tries to write a \verb+\0+ byte
after each token it defines. Therefore \ccode{ESL\_BUFFER} provides
its own token parsing mechanism. Depending on whether or not you
include newline characters (\verb+\r\n+) in the list of separator
(delimiter) characters, it either ignores newlines altogether, or it
detects newlines separately and expects to find a known number of
tokens per line.
For example, our x counting program could be implemented to parse
every token instead of every line:
\input{cexcerpts/buffer_example2}
\user{gcc -I. -o esl\_buffer\_example2 -DeslBUFFER\_EXAMPLE2 esl\_buffer.c easel.c -lm}
\user{./esl\_buffer\_example2 esl\_buffer.c}
\response{Counted 181 x's in 14141 words.}
In the \ccode{esl\_buffer\_GetToken()} call, including \verb+\r\n+
with \verb+" \t"+ in the separators causes newlines to be treated like
delimiters like any space or tab character. If you omit \verb+\r\n+
newline characters from the separators, then the parser detects them
specially anyway; when it sees a newline instead of a token, it
returns \ccode{eslEOL} and sets the point to the next character
following the newline. For example, we can count both lines and
tokens:
\input{cexcerpts/buffer_example3}
\user{gcc -I. -o esl\_buffer\_example3 -DeslBUFFER\_EXAMPLE3 esl\_buffer.c easel.c -lm}
\user{./esl\_buffer\_example3 esl\_buffer.c}
\response{Counted 181 x's in 14141 words on 3080 lines.}
What happens if the last line in a text file is missing its terminal
newline? In the example above, the number of lines would be one fewer;
the nonterminated last line wouldn't be
counted. \ccode{esl\_buffer\_GetToken()} would return \ccode{eslEOF}
on the last line of the file, rather than \ccode{eslEOL} followed by
\ccode{eslEOF} at its next call as it'd do if the newline were there.
\subsubsection{Reading fixed-width binary input}
You can also read fixed-width binary input directly into storage,
including scalar variables, using the \ccode{esl\_buffer\_Read()}
call. This is similar to C's \ccode{fread()}:
\input{cexcerpts/buffer_example4}
The \ccode{Read()} call needs to know exactly how many bytes \ccode{n}
it will read. For variable-width binary input, see the
\ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} calls.
In fact all inputs are treated by \ccode{ESL\_BUFFER} as binary
input. That is, platform-dependent newlines are not converted
automatically to C \verb+\n+ characters, as would happen when using
the C \ccode{stdio.h} library to read an input stream in ``text
mode''. You can freely mix different types of \ccode{esl\_buffer\_*}
parsing calls as you see appropriate.
\subsubsection{A more complicated example, a FASTA parser}
An example of a simple FASTA parsing function:
\input{cexcerpts/buffer_example5a}
and an example of using that function in a program:
\input{cexcerpts/buffer_example5b}
One thing to note here is the use of \ccode{esl\_buffer\_Set()} to
push characters back into the parser. For example, when we look for
the starting '>', we do a raw \ccode{esl\_buffer\_Get()}, look at the
first character, then call \ccode{esl\_buffer\_Set()} with
\ccode{nused=1} to tell the parser we used 1 character of what it gave
us. This is an idiomatic usage of the
\ccode{esl\_buffer\_Get()}/\ccode{esl\_buffer\_Set()} pair. The
\ccode{esl\_buffer\_Get()} call doesn't even move the point until the
companion \ccode{esl\_buffer\_Set()} tells it where to move to.
The other idiomatic use of \ccode{esl\_buffer\_Set()} is to implement
a ``peek'' at a next line or a next token, using a
\ccode{esl\_buffer\_GetLine()}/\ccode{esl\_buffer\_Set()} or
\ccode{esl\_buffer\_GetToken()}/\ccode{esl\_buffer\_Set()}
combination. You see this when we're in the sequence reading loop, we
get a line, and we want to peek at its first character. If it's a '>'
we're seeing the start of the next sequence, so we want to return
while leaving the point on the '>'. To do this, we use
\ccode{esl\_buffer\_GetLine()} to get the line, and if the first char
is a '>' we use \ccode{esl\_buffer\_Set()} to push the line pointer
(with 0 used characters) back to the parser.
You can also see examples here of using
\ccode{esl\_buffer\_FetchTokenAsStr()}
\ccode{esl\_buffer\_FetchLineAsStr()} to copy the name and description
directly to allocated, \verb+\0+-terminated C strings. Note how they
interact: because \ccode{esl\_buffer\_FetchTokenAsStr()} moves the
point past any trailing separator characters to the start of the next
token, and because \ccode{esl\_buffer\_FetchLineAsStr()} doesn't need
the point to be at the start of a line, the
\ccode{esl\_buffer\_FetchLineAsStr()} call finds the description
without leading spaces or trailing newline (but with any trailing
spaces).
\subsection{Using anchors: caller-defined limits on random access}
The naive way to enable random access on a sequential stream is to
slurp the whole stream into memory. If the stream is large, this may
be very memory inefficient. Many parsers do not need full random
access, but instead need a limited form of it -- for instance, the
three-pass case of determining format and alphabet from the start of a
sequence file. \ccode{ESL\_BUFFER} allows the caller to define an
\emph{anchor} to define a start point in the input that is not allowed
to go away until the caller says so.
Setting an anchor declares that \ccode{mem[anchor..n-1]} is not be
overwritten by new input reads. A new input read may first relocate
(``reoffset'') \ccode{mem[anchor..n-1]} to \ccode{mem[0..n-anchor-1]}
in order to use its current allocation efficiently. Setting an anchor
may therefore cause \ccode{mem} to be reoffset and/or reallocated, and
\ccode{balloc} may grow, if the buffer is not large enough to hold
everything starting from the \ccode{anchor} position. When no anchors
are set, \ccode{mem} will not be reoffset or reallocated.
If we set an anchor at offset 0 in the input, then the entire input
will be progressively slurped into a larger and larger allocation of
memory as we read sequentially. We are guaranteed to be able to
reposition the buffer anywhere from the anchor to n-1, even in a
normally nonrewindable, nonpositionable stream. If we've read enough
to determine what we need (format, alphabet...), we can release the
anchor, and the buffer's memory usage will stop growing.
The functions that get a defined chunk of memory --
\ccode{esl\_buffer\_GetLine()}, \ccode{esl\_buffer\_GetToken()}, and
\ccode{esl\_buffer\_CopyBytes()} -- set an anchor at the start of the
line, token, or chunk of bytes before they go looking for its end.
This takes advantage of the anchor mechanism to make sure that the
buffer will contain the entire line, token, or chunk of bytes, not just a
truncated part.
\subsection{Token-based parsing}
A \esldef{token} is a substring consisting of characters not in a set
of caller-defined \esldef{separator} characters. Typically, separator
chararacters might be whitespace (\ccode{" \t"}).
Additionally, newlines are always considered to be separators. Tokens
cannot include newlines.
In token-based parsing, we can handle newlines in two ways. Sometimes
we might know exactly how many tokens we expect on the line. Sometimes
we don't care.
If the caller knows exactly how many tokens are expected on each line
of the input, it should not include newline characters in its
separator string. Now, if the caller asks for a token but no token
remains on the line, it will see a special \ccode{eslEOL} return code
(and the parser will be positioned at the next character after that
newline). A caller can check for this deliberately with one last call
to \ccode{esl\_buffer\_GetToken()} per line, to be sure that it sees
\ccode{eslEOL} rather than an unexpected token.
If the caller doesn't care how many tokens occur on each line, it
should include newline characters (\verb+"\r\n"+) in the separator
string. Then newlines are treated (and skipped) like any other
separator.
Starting from the current buffer position, the procedure for defining
a token is:
\begin{itemize}
\item Skip characters in the separator string. (If end-of-file is
reached, return \ccode{eslEOF}.)
\item If parser is on a newline, skip past it, and return
\ccode{eslEOL}. (Note that if the caller had newline characters
in the separator string, the first step already skipped any
newline, and no \ccode{eslEOL} return is possible.)
\item Anchor at the current buffer position, \ccode{p}.
\item From the current point, count characters \emph{not} in the
separator, \ccode{n}. (Expand/refill the buffer as needed.)
\item Define the token: \ccode{p[0..n]}.
\item Move the current point to the character following the token.
\end{itemize}
\subsection{Newline handling.}
Easel assumes that newlines are encoded as \verb+\n+ (UNIX, Mac OS/X)
or \verb+\r\n+ (MS Windows).
All streams are opened as binary data. This is necessary to guarantee
a one:one correspondence between data offsets in memory and data
offsets on the filesystem, which we need for file positioning
purposes. It is also necessary to guarantee that we can read text
files that have been produced on a system other than the system we're
reading them on (that we can read Windows text files on a Linux
system, for example).\footnote{That is, the usual ANSI C convention of
reading/writing in ``text mode'' does not suffice, because it
assumes the newlines of the system we're on, not necessarily the
system that produced the file.} However, it makes us responsible
for handling system-specific definition of ``newline'' character(s) in
ASCII text files.
\subsection{Implementation notes (for developers)}
\paragraph{The state guarantee.} An \ccode{ESL\_BUFFER} is exchangeable
and sharable even amongst entirely different types of parsers because
it is virtually always guaranteed to be in a well-defined
state. Specifically:
\begin{itemize}
\item \ccode{bf->mem[bf->pos]} is ALWAYS positioned at the next byte
that a parser needs to parse, unless the buffer is at EOF.
\item There are ALWAYS at least \ccode{pagesize} bytes available to
parse, provided the input stream has not reached EOF.
\end{itemize}
\paragraph{State in different input type modes}
There are six types (``modes'') of inputs:
\begin{tabular}{ll}
Mode & Description \\ \hline
\ccode{eslBUFFER\_STDIN} & Standard input. \\
\ccode{eslBUFFER\_CMDPIPE} & Output piped from a command. \\
\ccode{eslBUFFER\_FILE} & A \ccode{FILE} being streamed. \\
\ccode{eslBUFFER\_ALLFILE} & A file entirely slurped into RAM. \\
\ccode{eslBUFFER\_MMAP} & A file that's memory mapped (\ccode{mmap()}). \\
\ccode{eslBUFFER\_STRING} & A string or memory. \\ \hline
\end{tabular}
The main difference between modes is whether the input is being read
into the buffer's memory in chunks, or whether the buffer's memory
effectively contains the entire input:
\begin{tabular}{lll}
& \ccode{STDIN, CMDPIPE, FILE} & \ccode{ALLFILE, MMAP, STRING} \\
\ccode{mem} & input chunk: \ccode{mem[0..n-1]} is \ccode{input[baseoffset..baseoffset+n-1]} & entire input: \ccode{mem[0..n-1]} is \ccode{input[0..n-1]} \\
\ccode{n} & current chunk size & entire input size (exclusive of \verb+\0+ on a \ccode{STRING}) \\
\ccode{balloc} & $>0$; \ccode{mem} is reallocatable & 0; \ccode{mem} is not reallocated \\
\ccode{fp} & open; \ccode{feof(fp) = TRUE} near EOF & \ccode{NULL} \\
\ccode{baseoffset} & offset of byte \ccode{mem[0]} in input & 0 \\
\end{tabular}
\paragraph{Behavior at end-of-input (``end-of-file'', EOF).}
The buffer can three kinds of states with respect to how near to EOF
it is, as follows.
During normal parsing, \ccode{bf->n - bf->pos >= bf->pagesize}:
\begin{cchunk}
mem-> {[. . . . . . . . . . . . . . . .] x x x x}
^ baseoffset ^ pos ^ n ^ balloc
[~ ~ ~ ~ ~ ~ ~ ~]
n-pos >= pagesize
\end{cchunk}
As input is nearing EOF, and we are within last <pagesize> bytes,
\ccode{bf->n - bf->pos < bf->pagesize}:
\begin{cchunk}
mem-> {[. . . . . . . . . . . . . . . .] x x x x}
^ baseoffset ^ pos ^ n ^ balloc
\end{cchunk}
In modes where we might be reading input in streamed chunks
(\ccode{eslBUFFER\_STDIN}, \ccode{eslBUFFER\_CMDPIPE}
\ccode{eslBUFFER\_FILE}), \ccode{feof(bf->fp)} becomes \ccode{TRUE}
when the buffer nears EOF.
When the input is entirely EOF, then \ccode{bf->pos == bf->n}:
\begin{cchunk}
mem-> {[. . . . . . . . . . . . . . . .] x x x x}
^ baseoffset ^ n ^ balloc
^ pos
\end{cchunk}
\paragraph{ The use of \ccode{esl\_pos\_t}. }
All integer variables for a position or length in memory or in a file
are of type \ccode{esl\_pos\_t}. In POSIX, memory positions are an
unsigned integer type \ccode{size\_t}, and file positions are a signed
integer type \ccode{off\_t}. Easel wants to assure an integer type
that we can safely cast to either \ccode{size\_t} or \ccode{off\_t},
and in which we can safely store a negative number as a status flag
(such as -1 for ``currently unset''). \ccode{esl\_pos\_t} is defined
as the largest signed integer type that can be safely cast to
\ccode{size\_t} or \ccode{off\_t}.