Releases: johnkerl/miller
Interpolated percentiles, markdown-tabular output format, CSV-quote preservation
Major features:
- Interpolated percentiles are now available using
mlr stats1 -i
ormlr merge-fields -i
. Non-interpolated percentiles are the default. The former resemble R'stype=7
quantiles and the latter resemble R'stype=1
quantiles. See also http://johnkerl.org/miller/doc/reference.html#stats1 and http://johnkerl.org/miller/doc/reference.html#merge-fields. - Markdown-tabular output format is now available using
--omd
: please see http://johnkerl.org/miller/doc/file-formats.html#Markdown_tabular and #106. - For files using CSV input as well as CSV output, there is now a --quote-original option which outputs fields with quotes if they had them on input. The was-quoted flag isn't tracked on derived fields, e.g. if fields
a
andb
were quoted on input, then inmlr put '$c = $a . $b
thec
field won't be quoted on output. As such, this option is most useful withmlr cut
,mlr filter
, etc. The use-case from the original feature request #77 (comment) is in trimming down a huge CSV file in order to facilitate subsequent in-memory processing using spreadsheet software. - The cookbook at http://johnkerl.org/miller/doc/cookbook.html has been extended significantly.
Minor features:
- You can now set a
MLR_CSV_DEFAULT_RS=lf
environment variable if you're tired of always putting--rs lf
arguments for your CSV files: http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc. - The
printn
andeprintn
commands formlr put
are identical toprint
andeprint
except they don't print final newlines. - It is now an error if boundvars in the same for-loop expression have duplicate names, e.g.
for (a,a in $*) {...}
results in the error messagemlr: duplicate for-loop boundvars "a" and "a"
. - The
strptime
function would announce an internal coding error on malformed format strings; now, it correctly points out the user-level error.
Bug fixes:
- Percentiles in
merge-fields
were not working. This was fixed; also, the lacking unit-test cases which would have caught this sooner have been filled in. - Miller's CSV output-quoting was non-RFC-compliant: double-quotes within field names were not being duplicated. This has been fixed (#104).
Brew update: Homebrew/homebrew-core#2698
Multi-emit
You can now emit multiple out-of-stream variables side-by-side.
Doc link: http://johnkerl.org/miller/doc/reference.html#Multi-emit_statements_for_put
Example:
$ mlr --from data/medium --opprint put -q '
@x_count[$a][$b] += 1;
@x_sum[$a][$b] += $x;
end {
for ((a, b), _ in @x_count) {
@x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
}
emit (@x_sum, @x_count, @x_mean), "a", "b"
}
'
a b x_sum x_count x_mean
pan pan 219.185129 427 0.513314
pan wye 198.432931 395 0.502362
pan eks 216.075228 429 0.503672
pan hat 205.222776 417 0.492141
pan zee 205.097518 413 0.496604
eks pan 179.963030 371 0.485076
eks wye 196.945286 407 0.483895
eks zee 176.880365 357 0.495463
eks eks 215.916097 413 0.522799
eks hat 208.783171 417 0.500679
wye wye 185.295850 377 0.491501
wye pan 195.847900 392 0.499612
wye hat 212.033183 426 0.497730
wye zee 194.774048 385 0.505907
wye eks 204.812961 386 0.530604
zee pan 202.213804 389 0.519830
zee wye 233.991394 455 0.514267
zee eks 190.961778 391 0.488393
zee zee 206.640635 403 0.512756
zee hat 191.300006 409 0.467726
hat wye 208.883010 423 0.493813
hat zee 196.349450 385 0.509999
hat eks 189.006793 389 0.485879
hat hat 182.853532 381 0.479931
hat pan 168.553807 363 0.464336
Note that this example simply recapitulates the easier-to-type
mlr --from ../data/medium --opprint stats1 -a sum,count,mean -f x -g a,b
Brew update: Homebrew/homebrew-core#2213
for/if/while and various features
While one of Miller’s strengths is its brevity, and so its domain-specific language is intentionally simple, the ability to loop over field names is a basic thing to want. Likewise for other control structures on the same complexity level as awk
. Miller has always owed much inspiration to awk
; 4.1.0 makes this more explicit by providing several common language idioms.
Major features:
- For-loops over key-value pairs in stream records and out-of-stream variables
- Loops using
while
anddo while
break
andcontinue
infor
,while
, anddo while
loops- If-elif-else statements
- Nestability of all the above, as well as of existing pattern-action blocks
Additional features:
- Computable field names using square brackets, e.g.
$[$a.$b] = $a * $b
- Type-predicate functions:
isnumeric
,isint
,isfloat
,isbool
,isstring
- Commenting using pound signs
- The new
print
andeprint
allow formatting of arbitrary expressions to stdout/stderr, respectively - In addition to the existing
dump
which formats all out-of-stream variables to stdout as JSON, the newedump
does the same to stderr - Semicolon is no longer required after closing curly brace
emit @*
andunset @*
are new synonyms foremit all
andunset all
unset $*
now existsmlr -n
is synonymous withmlr --from /dev/null
, which is useful in dataless contexts wherein all yourput
statements are contained withinbegin
/end
blocks- Bugfix: in 4.0.0,
mlr put -v '@a[1][2]=$b;$new=@a[1][2]' mydata.tbl
would crash with a memory-management error.
Syntax example:
% mlr --from estimates.tbl put '
for (k,v in $*) {
if (isnumeric(v) && k =~ "^[t-z].*$") {
$sum += v; $count += 1
}
}
$mean = $sum / $count # no assignment if count unset
'
Document links:
- http://johnkerl.org/miller/doc/reference.html#If-statements_for_put
- http://johnkerl.org/miller/doc/reference.html#While_and_do-while_loops_for_put
- http://johnkerl.org/miller/doc/reference.html#For-loops_for_put
- http://johnkerl.org/miller/doc/reference.html#Field_names_for_filter
- http://johnkerl.org/miller/doc/reference.html#Field_names_for_put
- http://johnkerl.org/miller/doc/reference.html#Functions_for_filter_and_put
- http://johnkerl.org/miller/doc/reference.html#Semicolons,_newlines,_and_curly_braces_for_put
- http://johnkerl.org/miller/doc/cookbook.html
Brew update: Homebrew/homebrew-core#1895
Variables, begin/end blocks, pattern-action blocks
This major release dramatically expands the expressive power of Miller's put
DSL. The TL;DR is that you can now write things like
mlr put '@x_sum += $x; end { emit @x_sum }'
For full details please see the following reference sections:
- http://johnkerl.org/miller/doc/reference.html#put
- http://johnkerl.org/miller/doc/reference.html#Out-of-stream_variables_for_put
- http://johnkerl.org/miller/doc/reference.html#Pattern-action_blocks_for_put
- http://johnkerl.org/miller/doc/reference.html#Begin/end_blocks_for_put
- http://johnkerl.org/miller/doc/reference.html#Indexed_out-of-stream_variables_for_put
- http://johnkerl.org/miller/doc/reference.html#Emit_statements_for_put
- http://johnkerl.org/miller/doc/reference.html#Unset_statements_for_put
as well as the following cookbook section:
Additional minor features in Miller 4.0.0:
- Compound assignment operators such as
+=
,<<=
, etc. are not new but were not previously announced in a release note. - Double-backslashing behavior for
sub
andgsub
has been fixed:echo 'x=a\tb' | mlr put '$x=sub($x,"\\t","TAB")'
now printsaTABb
as desired. (The underlying issue was an unfortunate interaction between Miller's backslash-handling and the system regex library's backslash-handling.) - As an alternative to specifying input files as the last items on the Miller command line, you can now specify a single input file before other command-line switches and verbs using
--from
: for example,mlr --from myfile.dat put '$z = $x + $y' then stats1 -a sum -f z
. The context is simple keystroke-reduction for interactively appending then-chains by up-arrowing at the command line: it's easier to iterate when you don't have to left-arrow past the input file name.
New data-rearrangers: nest, shuffle, repeat; misc. features
Major features in this release:
mlr nest
is a companion tomlr reshape
which was introduced in Miller 3.4.0: it allows unpacking key-value pairs which are nested within field values, and repacking them. Please see http://johnkerl.org/miller/doc/reference.html#nest.mlr shuffle
is a simple output-record permutor: http://johnkerl.org/miller/doc/reference.html#shufflemlr repeat
can be used as a data-generator, to expand a few input records (or even a single one) into arbitrarily many. This is particularly useful in conjunction with pseudorandom-number generators. As well, it can be used to reconstruct individual samples from data which have been count-aggregated, so that statistics such asmode
, percentiles, etc. may be computed on them. Please see http://johnkerl.org/miller/doc/reference.html#repeat.mlr put
andmlr filter
now accept a-f {filename}
option, so that the DSL expression may be placed within a file instead of being typed out on the command line when desired. Please see http://johnkerl.org/miller/doc/reference.html#put and http://johnkerl.org/miller/doc/reference.html#filter.
Minor features:
put
/filter
DSL string literals now may include\t
,\"
, etc.: e.g.mlr put '$out = $left . "\t" . $right'
- There is now a
typeof
function for theput
/filter
DSLs:mlr put '$xtype = typeof($x)'
. This is occasionally useful for debugging type-conversion questions. - You may now do
mlr --nr-progress-mod 1000000 ...
to get something printed to stderr every 1000000th input record, and so on. For long-running aggregations on large input file(s), this can provide reassurance that processing is indeed proceeding apace. Example:
$ mlr --nr-progress-mod 100000 check data/big.dkvp
NR=100000 FNR=100000 FILENAME=data/big.dkvp
NR=200000 FNR=200000 FILENAME=data/big.dkvp
NR=300000 FNR=300000 FILENAME=data/big.dkvp
NR=400000 FNR=400000 FILENAME=data/big.dkvp
NR=500000 FNR=500000 FILENAME=data/big.dkvp
NR=600000 FNR=600000 FILENAME=data/big.dkvp
NR=700000 FNR=700000 FILENAME=data/big.dkvp
...
mlr cat -n
had a bug wherein it counted zero-up while its documentation claimed it counted one-up. Now it counts one-up as documented.
JSON, reshape, regex captures, and more
Primary features:
- JSON is now a supported format for input and output. Miller handles tabular data, and JSON supports arbitrarily deeply nested data structures, so if you want general JSON processing you should use
jq
. But if you have tabular data represented in JSON then Miller can now handle that for you. Please see the reference page and the FAQ. - Reshape is a standard data-processing idiom, now available in Miller: http://johnkerl.org/miller/doc/reference.html#reshape
- Incidentally (not part of this release, but new since the last release) Miller is now available in FreeBSD's package manager: https://www.freshports.org/textproc/miller/. A full list of distributions containing Miller may be found here.
- Miller is not yet available from within Fedora/CentOS, but as a step toward this goal, an SRPM is included in this release (see file-list below).
DSL enhancements for mlr put
and mlr filter
:
- Regex captures
\0
through\9
: http://johnkerl.org/miller/doc/reference.html#Regex_captures - Ternary operator in expression right-hand sides: e.g.
mlr put '$y = $x < 0.5 ? 0 : 1'
- Boolean literals
true
andfalse
- Final semicolon is now allowed: e.g.
mlr put '$x=1;$y=2;'
- Environment variables are now accessible, where environment-variable names may be string literals or arbitrary expressions:
mlr put '$home = ENV["HOME"]'
ormlr put '$value = ENV[$name]'
. - While records are still string-to-string maps for input and output, and between
then
statements, types are preserved between multiple statements within aput
. Example:mlr put '$y = string($x); $z = $y . $y'
works as expected, without requringmlr put '$y = string($x); $z = string($y) . string($y)'
as before.
Bug fixes:
- Mixed-format join, e.g. CSV file joined with DKVP file, was incorrectly computing default separators (
IRS
,IFS
,IPS
). This resulted in records not being joined together. - Segmentation violation on non-standard-input read of files with size an exact multiple of page size and not ending in
IRS
, e.g. newline. (This is less of a corner case than it sounds: for example, leave a long-running program running with output redirected to a file, then in a sleep-and-process loop, have Miller process that file. The former program's stdio library will likely be doing block-sized buffered I/O, where block sizes will often be multiples of system page size and the block will almost surely not ending a newline.)
Acknowledgements: Big thank-yous to @gregfr and @aaronwolen for feature requests including reshape and regex captures, and to @jungle-boogie for his work getting Miller into FreeBSD. Also, ongoing thanks to @0-wiz-0 for his past work on configure support, making it possible for Miller to be put to use in multiple operating systems.
Bootstrap sampling, EWMA, merge-fields, isnull/isnotnull functions
- Bootstrap sampling in
mlr bootstrap
: http://johnkerl.org/miller/doc/reference.html#bootstrap. Compare to reservoir sampling inmlr sample
: http://johnkerl.org/miller/doc/reference.html#sample. - Exponentially weighted moving averages in
mlr step -a ewma
: principally useful for smoothing of noisy time series, e.g. finely sampled system-resource utilization to give one of many possible examples. Please see http://johnkerl.org/miller/doc/reference.html#step. - "Horizontal" univariate statistics in
mlr merge-fields
, compared tomlr stats
which is "vertical". Also allows collapsing multiple fields into one, such asin_bytes
andout_bytes
data fields summing tobytes_sum
. This can also be done easily usingmlr put
. However,mlr merge-fields
allows aggregation of more than just a pair of field names, and supports pattern-matching on field names. Please see http://johnkerl.org/miller/doc/reference.html#merge-fields for more information. isnull
andisnotnull
functions formlr filter
andmlr put
.stats1
,stats2
,merge-fields
,step
, andtop
correctly handle not only missing fields (in the row-heterogeneous-data case) but also null-valued fields.- Minor memory-management improvements.
Performance improvements, compressed I/O, and variable-name escaping
- RFC-CSV read performance is dramatically improved and is now on par with other formats; read performance for all formats is slightly improved as well.
- Variable names can now be escaped, using curly braces if there are special characters in the input-data field names. Example:
mlr put '${bytes.total} = ${bytes.in} + ${bytes.out}'
. See also #77 where this was requested. - Compressed I/O is now supported, using built-in compatibility with local system tools: http://johnkerl.org/miller/doc/reference.html#Compression. See also #77 where this was requested.
mlr uniq
is now streaming (bounded memory use, functionality intail -f
contexts) when possible: i.e. when-n
and-c
are not specified.- Thorough valgrind-driven testing has been used to tighten memory usage. This is mostly an invisible internal improvement, although it has a slight across-the-board performance improvement as well as allowing Miller to handle even larger files in limited-memory contexts.
Bugfix for stats1 max
mlr stats1 max
was reporting the same value as mlr stats1 min
, although p100
was unaffected. This error has been present since the 3.0.0 release. It was reported on #92.
Fix regression tests for i386
No functionality had been broken for i386: the changes are for the test framework only, to get validated builds on all available platforms.