Autodetected line-endings, in-place mode, user-defined functions, and more
This major release significantly expands the expressiveness of the DSL for mlr put
and mlr filter
. (The upcoming 5.1.0 release will add the ability to aggregate across all columns for non-DSL verbs such as mlr stats1
and mlr stats2
. As well, a Windows port is underway.)
Please also see the Miller main docs.
Simple but impactful features:
- Line endings (CRLF vs. LF, Windows-style vs. Unix-style) are now autodetected. For example, files (including CSV) with LF input will lead to LF output unless you specify otherwise.
- There is now an in-place mode using
mlr -I
.
Major DSL features:
- You can now define your own functions and subroutines: e.g.
func f(x, y) { return x**2 + y**2 }
. - New local variables are completely analogous to out-of-stream variables:
sum
retains its value for the duration of the expression it's defined in;@sum
retains its value across all records in the record stream. - Local variables, function parameters, and function return types may be defined untyped or typed as in
x = 1
orint x = 1
, respectively. There are also expression-inline type-assertions available. Type-checking is up to you: omit it if you want flexibility with heterogeneous data; use it if you want to help catch misspellings in your DSL code or unexpected irregularities in your input data. - There are now four kinds of maps. Out-of-stream variables have always been scalars, maps, or multi-level maps:
@a=1
,@b[1]=2
,@c[1][2]=3
. The same is now true for local variables, which are new to 5.0.0. Stream records have always been single-level maps;$*
is a map. And as of 5.0.0 there are now map literals, e.g.{"a":1, "b":2}
, which can be defined using JSON-like syntax (with either string or integer keys) and which can be nested arbitrarily deeply. - You can loop over maps --
$*
, out-of-stream variables, local variables, map-literals, and map-valued function return values -- usingfor (k, v in ...)
or the newfor (k in ...)
(discussed next). All flavors of map may also be used inemit
anddump
statements. - User-defined functions and subroutines may take map-valued arguments, and may return map values.
- Some built-in functions now accept map-valued input:
typeof
,length
,depth
,leafcount
,haskey
. There are built-in functions producing map-valued output:mapsum
andmapdiff
. There are now string-to-map and map-to-string functions:splitnv
,splitkv
,splitnvx
,splitkvx
,joink
,joinv
, andjoinkv
.
Minor DSL features:
- For iterating over maps (namely, local variables, out-of-stream variables, stream records, map literals, or return values from map-valued functions) there is now a key-only for-loop syntax: e.g.
for (k in $*) { ... }
. This is in addition to the already-existingfor (k, v in ...)
syntax. - There are now triple-statement for-loops (familiar from many other languages), e.g.
for (int i = 0; i < 10; i += 1) { ... }
. mlr put
andmlr filter
now accept multiple-f
for script files, freely intermixable with-e
for expressions. The suggested use case is putting user-defined functions in script files and one-liners calling them using-e
. Example:myfuncs.mlr
defines the functionf(...)
, thenmlr put -f myfuncs.mlr -e '$o = f($i)' myfile.dat
. More information is here.mlr filter
is now almost identical tomlr put
: it can have multiple statements, it can usebegin
and/orend
blocks, it can define and invoke functions. Its final expression must evaluate to boolean which is used as the filter criterion. More details are here.- The min and max functions are now variadic:
$o = max($a, $b, $c)
. - There is now a substr function.
- While
ENV
has long provided read-access to environment variables on the right-hand side of assignments (as agetenv
), it now can be at the left-hand side of assignments (as aputenv
). This is useful for subsidiary processes created bytee
,emit
,dump
, orprint
when writing to a pipe. - Handling for the
#
in comments is now handled in the lexer, so you can now (correctly) include#
in strings. - Separators are now available as read-only variables in the DSL:
IPS
,IFS
,IRS
,OPS
,OFS
,ORS
. These are particularly useful with the split and join functions: e.g. withmlr --ifs tab ...
, theIFS
variable within a DSL expression will evaluate to a string containing a tab character. - Syntax errors in DSL expressions now have a little more context.
- DSL parsing and execution are a bit more transparent. There have long been
-v
and-t
options tomlr put
andmlr filter
, which print the expression's abstract syntax tree and do a low-level parser trace, respectively. There are now additionally-a
which traces stack-variable allocation and-T
which traces statements line by line as they execute. While-v
,-t
, and-a
are most useful for development of Miller, the-T
option gives you more visibility into what your Miller scripts are doing. See also here.
Verbs:
- most-frequent and least-frequent as requested in #110.
- seqgen makes it easy to generate data from within Miller: please also see here for a usage example.
- unsparsify makes it easy to rectangularize data where not all records have the same fields.
- cat -n now takes a group-by (-g) option, making it easy to number records within categories.
- count-distinct,
uniq,
most-frequent,
least-frequent,
top, and
histogram
now take a-o
option for specifying their output field names, as requested in #122. - Median is now a synonym for p50 in stats1.
- You can now start a
then
chain with an initialthen
, which is nice in backslashy/multiline-continuation contexts.
This was requested in #130.
I/O options:
- The
print
statement may now be used with no arguments, which prints a newline, and a no-argumentprintn
prints nothing but creates a zero-length file in redirected-output context. - Pretty-print format now has a
--pprint --barred
option (for output only, not input). For an example, please see here. - There are now keystroke-savers of the form
--c2p
which abbreviate--icsvlite --opprint
, and so on. - Miller's map literals are JSON-looking but allow integer keys which JSON doesn't. The
--jknquoteint
and--jvquoteall
flags formlr
(when using JSON output) andmlr put
(fordump
) provide control over double-quoting behavior.
Documents new since the previous release:
- Miller in 10 minutes is a long-overdue addition: while Miller's detailed documentation is evident, there has been a lack of more succinct examples.
- The cookbook has likewise been expanded, and has been split out
into three parts: part 1, part
2, part 3. - A bit more background on C performance compared to other languages I experimented with, early on in the development of Miller, is here.
On-line help:
- Help for DSL built-in functions, DSL keywords, and verbs is accessible using
mlr -f
,mlr -k
, andmlr -l
respectively; name-only lists are available withmlr -F
,mlr -K
, andmlr -L
.
Bugfixes:
- A corner-case bug causing a segmentation violation on two
sub
/gsub
statements within a singleput
, the first one matching its pattern and the second one not matching its pattern, has been fixed.
Backward incompatibilities: This is Miller 5.0.0, not 4.6.0, due to the following (all relatively minor):
- The
v
variables bound in for-loops such asfor (k, v in some_multi_level_map) { ... }
can now be map-valued if thev
specifies a non-terminal in the map. - There are new keywords such as
var
,int
,float
,num
,str
,bool
,map
,IPS
,IFS
,IRS
,OPS
,OFS
,ORS
which can no longer be used as variable names. Seemlr -k
for the complete list. - Unset of the last key in an map-valued variable's map level no longer removes the level: e.g. with
@v[1][2]=3
andunset @v[1][2]
the@v
variable would be empty. As of 5.0.0,@v
has key 1 with an empty-map value. - There is no longer type-inference on literals:
"3"+4
no longer gives 7. (That was never a good idea.) - The
typeof
function used to say things likeMT_STRING
; now it says things likestring
.
Homebrew request pending: Homebrew/homebrew-core#10426