Skip to content

Releases: johnkerl/miller

Interpolated percentiles, markdown-tabular output format, CSV-quote preservation

03 Jul 17:39
Compare
Choose a tag to compare

Major features:

Minor features:

  • You can now set a MLR_CSV_DEFAULT_RS=lf environment variable if you're tired of always putting --rs lf arguments for your CSV files: http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc.
  • The printn and eprintn commands for mlr put are identical to print and eprint except they don't print final newlines.
  • It is now an error if boundvars in the same for-loop expression have duplicate names, e.g. for (a,a in $*) {...} results in the error message mlr: duplicate for-loop boundvars "a" and "a".
  • The strptime function would announce an internal coding error on malformed format strings; now, it correctly points out the user-level error.

Bug fixes:

  • Percentiles in merge-fields were not working. This was fixed; also, the lacking unit-test cases which would have caught this sooner have been filled in.
  • Miller's CSV output-quoting was non-RFC-compliant: double-quotes within field names were not being duplicated. This has been fixed (#104).

Brew update: Homebrew/homebrew-core#2698

Multi-emit

21 Jun 01:16
Compare
Choose a tag to compare

You can now emit multiple out-of-stream variables side-by-side.

Doc link: http://johnkerl.org/miller/doc/reference.html#Multi-emit_statements_for_put

Example:

$ mlr --from data/medium --opprint put -q '
  @x_count[$a][$b] += 1;
  @x_sum[$a][$b] += $x;
  end {
      for ((a, b), _ in @x_count) {
          @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
      }
      emit (@x_sum, @x_count, @x_mean), "a", "b"
  }
'
a   b   x_sum      x_count x_mean
pan pan 219.185129 427     0.513314
pan wye 198.432931 395     0.502362
pan eks 216.075228 429     0.503672
pan hat 205.222776 417     0.492141
pan zee 205.097518 413     0.496604
eks pan 179.963030 371     0.485076
eks wye 196.945286 407     0.483895
eks zee 176.880365 357     0.495463
eks eks 215.916097 413     0.522799
eks hat 208.783171 417     0.500679
wye wye 185.295850 377     0.491501
wye pan 195.847900 392     0.499612
wye hat 212.033183 426     0.497730
wye zee 194.774048 385     0.505907
wye eks 204.812961 386     0.530604
zee pan 202.213804 389     0.519830
zee wye 233.991394 455     0.514267
zee eks 190.961778 391     0.488393
zee zee 206.640635 403     0.512756
zee hat 191.300006 409     0.467726
hat wye 208.883010 423     0.493813
hat zee 196.349450 385     0.509999
hat eks 189.006793 389     0.485879
hat hat 182.853532 381     0.479931
hat pan 168.553807 363     0.464336

Note that this example simply recapitulates the easier-to-type

mlr --from ../data/medium --opprint stats1 -a sum,count,mean -f x -g a,b

Brew update: Homebrew/homebrew-core#2213

for/if/while and various features

11 Jun 11:15
Compare
Choose a tag to compare

While one of Miller’s strengths is its brevity, and so its domain-specific language is intentionally simple, the ability to loop over field names is a basic thing to want. Likewise for other control structures on the same complexity level as awk. Miller has always owed much inspiration to awk; 4.1.0 makes this more explicit by providing several common language idioms.

Major features:

  • For-loops over key-value pairs in stream records and out-of-stream variables
  • Loops using while and do while
  • break and continue in for, while, and do while loops
  • If-elif-else statements
  • Nestability of all the above, as well as of existing pattern-action blocks

Additional features:

  • Computable field names using square brackets, e.g. $[$a.$b] = $a * $b
  • Type-predicate functions: isnumeric, isint, isfloat, isbool, isstring
  • Commenting using pound signs
  • The new print and eprint allow formatting of arbitrary expressions to stdout/stderr, respectively
  • In addition to the existing dump which formats all out-of-stream variables to stdout as JSON, the new edump does the same to stderr
  • Semicolon is no longer required after closing curly brace
  • emit @* and unset @* are new synonyms for emit all and unset all
  • unset $* now exists
  • mlr -n is synonymous with mlr --from /dev/null, which is useful in dataless contexts wherein all your put statements are contained within begin/end blocks
  • Bugfix: in 4.0.0, mlr put -v '@a[1][2]=$b;$new=@a[1][2]' mydata.tbl would crash with a memory-management error.

Syntax example:

% mlr --from estimates.tbl put '
  for (k,v in $*) {
    if (isnumeric(v) && k =~ "^[t-z].*$") {
      $sum += v; $count += 1
    }
  }
  $mean = $sum / $count # no assignment if count unset
'

Document links:

Brew update: Homebrew/homebrew-core#1895

Variables, begin/end blocks, pattern-action blocks

09 May 03:36
Compare
Choose a tag to compare

This major release dramatically expands the expressive power of Miller's put DSL. The TL;DR is that you can now write things like

mlr put '@x_sum += $x; end { emit @x_sum }'

For full details please see the following reference sections:

as well as the following cookbook section:

Additional minor features in Miller 4.0.0:

  • Compound assignment operators such as +=, <<=, etc. are not new but were not previously announced in a release note.
  • Double-backslashing behavior for sub and gsub has been fixed: echo 'x=a\tb' | mlr put '$x=sub($x,"\\t","TAB")' now prints aTABb as desired. (The underlying issue was an unfortunate interaction between Miller's backslash-handling and the system regex library's backslash-handling.)
  • As an alternative to specifying input files as the last items on the Miller command line, you can now specify a single input file before other command-line switches and verbs using --from: for example, mlr --from myfile.dat put '$z = $x + $y' then stats1 -a sum -f z. The context is simple keystroke-reduction for interactively appending then-chains by up-arrowing at the command line: it's easier to iterate when you don't have to left-arrow past the input file name.

New data-rearrangers: nest, shuffle, repeat; misc. features

05 Apr 03:09
Compare
Choose a tag to compare

Major features in this release:

Minor features:

  • put/filter DSL string literals now may include \t, \", etc.: e.g. mlr put '$out = $left . "\t" . $right'
  • There is now a typeof function for the put/filter DSLs: mlr put '$xtype = typeof($x)'. This is occasionally useful for debugging type-conversion questions.
  • You may now do mlr --nr-progress-mod 1000000 ... to get something printed to stderr every 1000000th input record, and so on. For long-running aggregations on large input file(s), this can provide reassurance that processing is indeed proceeding apace. Example:
$ mlr --nr-progress-mod 100000 check data/big.dkvp
NR=100000 FNR=100000 FILENAME=data/big.dkvp
NR=200000 FNR=200000 FILENAME=data/big.dkvp
NR=300000 FNR=300000 FILENAME=data/big.dkvp
NR=400000 FNR=400000 FILENAME=data/big.dkvp
NR=500000 FNR=500000 FILENAME=data/big.dkvp
NR=600000 FNR=600000 FILENAME=data/big.dkvp
NR=700000 FNR=700000 FILENAME=data/big.dkvp
...
  • mlr cat -n had a bug wherein it counted zero-up while its documentation claimed it counted one-up. Now it counts one-up as documented.

JSON, reshape, regex captures, and more

14 Feb 15:57
Compare
Choose a tag to compare

Primary features:

  • JSON is now a supported format for input and output. Miller handles tabular data, and JSON supports arbitrarily deeply nested data structures, so if you want general JSON processing you should use jq. But if you have tabular data represented in JSON then Miller can now handle that for you. Please see the reference page and the FAQ.
  • Reshape is a standard data-processing idiom, now available in Miller: http://johnkerl.org/miller/doc/reference.html#reshape
  • Incidentally (not part of this release, but new since the last release) Miller is now available in FreeBSD's package manager: https://www.freshports.org/textproc/miller/. A full list of distributions containing Miller may be found here.
  • Miller is not yet available from within Fedora/CentOS, but as a step toward this goal, an SRPM is included in this release (see file-list below).

DSL enhancements for mlr put and mlr filter:

  • Regex captures \0 through \9: http://johnkerl.org/miller/doc/reference.html#Regex_captures
  • Ternary operator in expression right-hand sides: e.g. mlr put '$y = $x < 0.5 ? 0 : 1'
  • Boolean literals true and false
  • Final semicolon is now allowed: e.g. mlr put '$x=1;$y=2;'
  • Environment variables are now accessible, where environment-variable names may be string literals or arbitrary expressions: mlr put '$home = ENV["HOME"]' or mlr put '$value = ENV[$name]'.
  • While records are still string-to-string maps for input and output, and between then statements, types are preserved between multiple statements within a put. Example: mlr put '$y = string($x); $z = $y . $y' works as expected, without requring mlr put '$y = string($x); $z = string($y) . string($y)' as before.

Bug fixes:

  • Mixed-format join, e.g. CSV file joined with DKVP file, was incorrectly computing default separators (IRS, IFS, IPS). This resulted in records not being joined together.
  • Segmentation violation on non-standard-input read of files with size an exact multiple of page size and not ending in IRS, e.g. newline. (This is less of a corner case than it sounds: for example, leave a long-running program running with output redirected to a file, then in a sleep-and-process loop, have Miller process that file. The former program's stdio library will likely be doing block-sized buffered I/O, where block sizes will often be multiples of system page size and the block will almost surely not ending a newline.)

Acknowledgements: Big thank-yous to @gregfr and @aaronwolen for feature requests including reshape and regex captures, and to @jungle-boogie for his work getting Miller into FreeBSD. Also, ongoing thanks to @0-wiz-0 for his past work on configure support, making it possible for Miller to be put to use in multiple operating systems.

Bootstrap sampling, EWMA, merge-fields, isnull/isnotnull functions

11 Jan 03:28
Compare
Choose a tag to compare
  • Bootstrap sampling in mlr bootstrap: http://johnkerl.org/miller/doc/reference.html#bootstrap. Compare to reservoir sampling in mlr sample: http://johnkerl.org/miller/doc/reference.html#sample.
  • Exponentially weighted moving averages in mlr step -a ewma: principally useful for smoothing of noisy time series, e.g. finely sampled system-resource utilization to give one of many possible examples. Please see http://johnkerl.org/miller/doc/reference.html#step.
  • "Horizontal" univariate statistics in mlr merge-fields, compared to mlr stats which is "vertical". Also allows collapsing multiple fields into one, such as in_bytes and out_bytes data fields summing to bytes_sum. This can also be done easily using mlr put. However, mlr merge-fields allows aggregation of more than just a pair of field names, and supports pattern-matching on field names. Please see http://johnkerl.org/miller/doc/reference.html#merge-fields for more information.
  • isnull and isnotnull functions for mlr filter and mlr put.
  • stats1, stats2, merge-fields, step, and top correctly handle not only missing fields (in the row-heterogeneous-data case) but also null-valued fields.
  • Minor memory-management improvements.

Performance improvements, compressed I/O, and variable-name escaping

29 Dec 20:40
Compare
Choose a tag to compare
  • RFC-CSV read performance is dramatically improved and is now on par with other formats; read performance for all formats is slightly improved as well.
  • Variable names can now be escaped, using curly braces if there are special characters in the input-data field names. Example: mlr put '${bytes.total} = ${bytes.in} + ${bytes.out}'. See also #77 where this was requested.
  • Compressed I/O is now supported, using built-in compatibility with local system tools: http://johnkerl.org/miller/doc/reference.html#Compression. See also #77 where this was requested.
  • mlr uniq is now streaming (bounded memory use, functionality in tail -f contexts) when possible: i.e. when -n and -c are not specified.
  • Thorough valgrind-driven testing has been used to tighten memory usage. This is mostly an invisible internal improvement, although it has a slight across-the-board performance improvement as well as allowing Miller to handle even larger files in limited-memory contexts.

Bugfix for stats1 max

09 Dec 13:10
Compare
Choose a tag to compare

mlr stats1 max was reporting the same value as mlr stats1 min, although p100 was unaffected. This error has been present since the 3.0.0 release. It was reported on #92.

Fix regression tests for i386

07 Dec 03:10
Compare
Choose a tag to compare

No functionality had been broken for i386: the changes are for the test framework only, to get validated builds on all available platforms.