Skip to content

Autodetected line-endings, in-place mode, user-defined functions, and more

Compare
Choose a tag to compare
@johnkerl johnkerl released this 28 Feb 21:43

This major release significantly expands the expressiveness of the DSL for mlr put and mlr filter. (The upcoming 5.1.0 release will add the ability to aggregate across all columns for non-DSL verbs such as mlr stats1 and mlr stats2. As well, a Windows port is underway.)

Please also see the Miller main docs.

Simple but impactful features:

Major DSL features:

  • You can now define your own functions and subroutines: e.g. func f(x, y) { return x**2 + y**2 }.
  • New local variables are completely analogous to out-of-stream variables: sum retains its value for the duration of the expression it's defined in; @sum retains its value across all records in the record stream.
  • Local variables, function parameters, and function return types may be defined untyped or typed as in x = 1 or int x = 1, respectively. There are also expression-inline type-assertions available. Type-checking is up to you: omit it if you want flexibility with heterogeneous data; use it if you want to help catch misspellings in your DSL code or unexpected irregularities in your input data.
  • There are now four kinds of maps. Out-of-stream variables have always been scalars, maps, or multi-level maps: @a=1, @b[1]=2, @c[1][2]=3. The same is now true for local variables, which are new to 5.0.0. Stream records have always been single-level maps; $* is a map. And as of 5.0.0 there are now map literals, e.g. {"a":1, "b":2}, which can be defined using JSON-like syntax (with either string or integer keys) and which can be nested arbitrarily deeply.
  • You can loop over maps -- $*, out-of-stream variables, local variables, map-literals, and map-valued function return values -- using for (k, v in ...) or the new for (k in ...) (discussed next). All flavors of map may also be used in emit and dump statements.
  • User-defined functions and subroutines may take map-valued arguments, and may return map values.
  • Some built-in functions now accept map-valued input: typeof, length, depth, leafcount, haskey. There are built-in functions producing map-valued output: mapsum and mapdiff. There are now string-to-map and map-to-string functions: splitnv, splitkv, splitnvx, splitkvx, joink, joinv, and joinkv.

Minor DSL features:

  • For iterating over maps (namely, local variables, out-of-stream variables, stream records, map literals, or return values from map-valued functions) there is now a key-only for-loop syntax: e.g. for (k in $*) { ... }. This is in addition to the already-existing for (k, v in ...) syntax.
  • There are now triple-statement for-loops (familiar from many other languages), e.g. for (int i = 0; i < 10; i += 1) { ... }.
  • mlr put and mlr filter now accept multiple -f for script files, freely intermixable with -e for expressions. The suggested use case is putting user-defined functions in script files and one-liners calling them using -e. Example: myfuncs.mlr defines the function f(...), then mlr put -f myfuncs.mlr -e '$o = f($i)' myfile.dat. More information is here.
  • mlr filter is now almost identical to mlr put: it can have multiple statements, it can use begin and/or end blocks, it can define and invoke functions. Its final expression must evaluate to boolean which is used as the filter criterion. More details are here.
  • The min and max functions are now variadic: $o = max($a, $b, $c).
  • There is now a substr function.
  • While ENV has long provided read-access to environment variables on the right-hand side of assignments (as a getenv), it now can be at the left-hand side of assignments (as a putenv). This is useful for subsidiary processes created by tee, emit, dump, or print when writing to a pipe.
  • Handling for the # in comments is now handled in the lexer, so you can now (correctly) include # in strings.
  • Separators are now available as read-only variables in the DSL: IPS, IFS, IRS, OPS, OFS, ORS. These are particularly useful with the split and join functions: e.g. with mlr --ifs tab ..., the IFS variable within a DSL expression will evaluate to a string containing a tab character.
  • Syntax errors in DSL expressions now have a little more context.
  • DSL parsing and execution are a bit more transparent. There have long been -v and -t options to mlr put and mlr filter, which print the expression's abstract syntax tree and do a low-level parser trace, respectively. There are now additionally -a which traces stack-variable allocation and -T which traces statements line by line as they execute. While -v, -t, and -a are most useful for development of Miller, the -T option gives you more visibility into what your Miller scripts are doing. See also here.

Verbs:

  • most-frequent and least-frequent as requested in #110.
  • seqgen makes it easy to generate data from within Miller: please also see here for a usage example.
  • unsparsify makes it easy to rectangularize data where not all records have the same fields.
  • cat -n now takes a group-by (-g) option, making it easy to number records within categories.
  • count-distinct,
    uniq,
    most-frequent,
    least-frequent,
    top, and
    histogram
    now take a -o option for specifying their output field names, as requested in #122.
  • Median is now a synonym for p50 in stats1.
  • You can now start a then chain with an initial then, which is nice in backslashy/multiline-continuation contexts.
    This was requested in #130.

I/O options:

  • The print statement may now be used with no arguments, which prints a newline, and a no-argument printn prints nothing but creates a zero-length file in redirected-output context.
  • Pretty-print format now has a --pprint --barred option (for output only, not input). For an example, please see here.
  • There are now keystroke-savers of the form --c2p which abbreviate --icsvlite --opprint, and so on.
  • Miller's map literals are JSON-looking but allow integer keys which JSON doesn't. The
    --jknquoteint and --jvquoteall flags for mlr (when using JSON output) and mlr put (for dump) provide control over double-quoting behavior.

Documents new since the previous release:

  • Miller in 10 minutes is a long-overdue addition: while Miller's detailed documentation is evident, there has been a lack of more succinct examples.
  • The cookbook has likewise been expanded, and has been split out
    into three parts: part 1, part
    2
    , part 3.
  • A bit more background on C performance compared to other languages I experimented with, early on in the development of Miller, is here.

On-line help:

  • Help for DSL built-in functions, DSL keywords, and verbs is accessible using mlr -f, mlr -k, and mlr -l respectively; name-only lists are available with mlr -F, mlr -K, and mlr -L.

Bugfixes:

  • A corner-case bug causing a segmentation violation on two sub/gsub statements within a single put, the first one matching its pattern and the second one not matching its pattern, has been fixed.

Backward incompatibilities: This is Miller 5.0.0, not 4.6.0, due to the following (all relatively minor):

  • The v variables bound in for-loops such as for (k, v in some_multi_level_map) { ... } can now be map-valued if the v specifies a non-terminal in the map.
  • There are new keywords such as var, int, float, num, str, bool, map, IPS, IFS, IRS, OPS, OFS, ORS which can no longer be used as variable names. See mlr -k for the complete list.
  • Unset of the last key in an map-valued variable's map level no longer removes the level: e.g. with @v[1][2]=3 and unset @v[1][2] the @v variable would be empty. As of 5.0.0, @v has key 1 with an empty-map value.
  • There is no longer type-inference on literals: "3"+4 no longer gives 7. (That was never a good idea.)
  • The typeof function used to say things like MT_STRING; now it says things like string.

Homebrew request pending: Homebrew/homebrew-core#10426