Releases · johnkerl/miller

03 Jul 17:39

johnkerl

v4.3.0

60d1080

Interpolated percentiles, markdown-tabular output format, CSV-quote preservation

Major features:

Interpolated percentiles are now available using mlr stats1 -i or mlr merge-fields -i. Non-interpolated percentiles are the default. The former resemble R's type=7 quantiles and the latter resemble R's type=1 quantiles. See also http://johnkerl.org/miller/doc/reference.html#stats1 and http://johnkerl.org/miller/doc/reference.html#merge-fields.
Markdown-tabular output format is now available using --omd: please see http://johnkerl.org/miller/doc/file-formats.html#Markdown_tabular and #106.
For files using CSV input as well as CSV output, there is now a --quote-original option which outputs fields with quotes if they had them on input. The was-quoted flag isn't tracked on derived fields, e.g. if fields a and b were quoted on input, then in mlr put '$c = $a . $b the c field won't be quoted on output. As such, this option is most useful with mlr cut, mlr filter, etc. The use-case from the original feature request #77 (comment) is in trimming down a huge CSV file in order to facilitate subsequent in-memory processing using spreadsheet software.
The cookbook at http://johnkerl.org/miller/doc/cookbook.html has been extended significantly.

Minor features:

You can now set a MLR_CSV_DEFAULT_RS=lf environment variable if you're tired of always putting --rs lf arguments for your CSV files: http://johnkerl.org/miller/doc/file-formats.html#CSV/TSV/etc.
The printn and eprintn commands for mlr put are identical to print and eprint except they don't print final newlines.
It is now an error if boundvars in the same for-loop expression have duplicate names, e.g. for (a,a in $*) {...} results in the error message mlr: duplicate for-loop boundvars "a" and "a".
The strptime function would announce an internal coding error on malformed format strings; now, it correctly points out the user-level error.

Bug fixes:

Percentiles in merge-fields were not working. This was fixed; also, the lacking unit-test cases which would have caught this sooner have been filled in.
Miller's CSV output-quoting was non-RFC-compliant: double-quotes within field names were not being duplicated. This has been fixed (#104).

Brew update: Homebrew/homebrew-core#2698

Assets 7

21 Jun 01:16

johnkerl

v4.2.0

4b1bc4b

Multi-emit

You can now emit multiple out-of-stream variables side-by-side.

Doc link: http://johnkerl.org/miller/doc/reference.html#Multi-emit_statements_for_put

Example:

$ mlr --from data/medium --opprint put -q '
  @x_count[$a][$b] += 1;
  @x_sum[$a][$b] += $x;
  end {
      for ((a, b), _ in @x_count) {
          @x_mean[a][b] = @x_sum[a][b] / @x_count[a][b]
      }
      emit (@x_sum, @x_count, @x_mean), "a", "b"
  }
'
a   b   x_sum      x_count x_mean
pan pan 219.185129 427     0.513314
pan wye 198.432931 395     0.502362
pan eks 216.075228 429     0.503672
pan hat 205.222776 417     0.492141
pan zee 205.097518 413     0.496604
eks pan 179.963030 371     0.485076
eks wye 196.945286 407     0.483895
eks zee 176.880365 357     0.495463
eks eks 215.916097 413     0.522799
eks hat 208.783171 417     0.500679
wye wye 185.295850 377     0.491501
wye pan 195.847900 392     0.499612
wye hat 212.033183 426     0.497730
wye zee 194.774048 385     0.505907
wye eks 204.812961 386     0.530604
zee pan 202.213804 389     0.519830
zee wye 233.991394 455     0.514267
zee eks 190.961778 391     0.488393
zee zee 206.640635 403     0.512756
zee hat 191.300006 409     0.467726
hat wye 208.883010 423     0.493813
hat zee 196.349450 385     0.509999
hat eks 189.006793 389     0.485879
hat hat 182.853532 381     0.479931
hat pan 168.553807 363     0.464336

Note that this example simply recapitulates the easier-to-type

mlr --from ../data/medium --opprint stats1 -a sum,count,mean -f x -g a,b

Brew update: Homebrew/homebrew-core#2213

Assets 6

11 Jun 11:15

johnkerl

v4.1.0

ca2b33f

for/if/while and various features

While one of Miller’s strengths is its brevity, and so its domain-specific language is intentionally simple, the ability to loop over field names is a basic thing to want. Likewise for other control structures on the same complexity level as awk. Miller has always owed much inspiration to awk; 4.1.0 makes this more explicit by providing several common language idioms.

Major features:

For-loops over key-value pairs in stream records and out-of-stream variables
Loops using while and do while
break and continue in for, while, and do while loops
If-elif-else statements
Nestability of all the above, as well as of existing pattern-action blocks

Additional features:

Computable field names using square brackets, e.g. $[$a.$b] = $a * $b
Type-predicate functions: isnumeric, isint, isfloat, isbool, isstring
Commenting using pound signs
The new print and eprint allow formatting of arbitrary expressions to stdout/stderr, respectively
In addition to the existing dump which formats all out-of-stream variables to stdout as JSON, the new edump does the same to stderr
Semicolon is no longer required after closing curly brace
emit @* and unset @* are new synonyms for emit all and unset all
unset $* now exists
mlr -n is synonymous with mlr --from /dev/null, which is useful in dataless contexts wherein all your put statements are contained within begin/end blocks
Bugfix: in 4.0.0, mlr put -v '@a[1][2]=$b;$new=@a[1][2]' mydata.tbl would crash with a memory-management error.

Syntax example:

% mlr --from estimates.tbl put '
  for (k,v in $*) {
    if (isnumeric(v) && k =~ "^[t-z].*$") {
      $sum += v; $count += 1
    }
  }
  $mean = $sum / $count # no assignment if count unset
'

Document links:

Brew update: Homebrew/homebrew-core#1895

Assets 6

09 May 03:36

johnkerl

v4.0.0

62ee9c8

Variables, begin/end blocks, pattern-action blocks

This major release dramatically expands the expressive power of Miller's put DSL. The TL;DR is that you can now write things like

mlr put '@x_sum += $x; end { emit @x_sum }'

For full details please see the following reference sections:

as well as the following cookbook section:

http://johnkerl.org/miller/doc/cookbook.html#Using_out-of-stream_variables

Additional minor features in Miller 4.0.0:

Compound assignment operators such as +=, <<=, etc. are not new but were not previously announced in a release note.
Double-backslashing behavior for sub and gsub has been fixed: echo 'x=a\tb' | mlr put '$x=sub($x,"\\t","TAB")' now prints aTABb as desired. (The underlying issue was an unfortunate interaction between Miller's backslash-handling and the system regex library's backslash-handling.)
As an alternative to specifying input files as the last items on the Miller command line, you can now specify a single input file before other command-line switches and verbs using --from: for example, mlr --from myfile.dat put '$z = $x + $y' then stats1 -a sum -f z. The context is simple keystroke-reduction for interactively appending then-chains by up-arrowing at the command line: it's easier to iterate when you don't have to left-arrow past the input file name.

Assets 7

05 Apr 03:09

johnkerl

v3.5.0

8f82d7d

New data-rearrangers: nest, shuffle, repeat; misc. features

Major features in this release:

mlr nest is a companion to mlr reshape which was introduced in Miller 3.4.0: it allows unpacking key-value pairs which are nested within field values, and repacking them. Please see http://johnkerl.org/miller/doc/reference.html#nest.
mlr shuffle is a simple output-record permutor: http://johnkerl.org/miller/doc/reference.html#shuffle
mlr repeat can be used as a data-generator, to expand a few input records (or even a single one) into arbitrarily many. This is particularly useful in conjunction with pseudorandom-number generators. As well, it can be used to reconstruct individual samples from data which have been count-aggregated, so that statistics such as mode, percentiles, etc. may be computed on them. Please see http://johnkerl.org/miller/doc/reference.html#repeat.
mlr put and mlr filter now accept a -f {filename} option, so that the DSL expression may be placed within a file instead of being typed out on the command line when desired. Please see http://johnkerl.org/miller/doc/reference.html#put and http://johnkerl.org/miller/doc/reference.html#filter.

Minor features:

put/filter DSL string literals now may include \t, \", etc.: e.g. mlr put '$out = $left . "\t" . $right'
There is now a typeof function for the put/filter DSLs: mlr put '$xtype = typeof($x)'. This is occasionally useful for debugging type-conversion questions.
You may now do mlr --nr-progress-mod 1000000 ... to get something printed to stderr every 1000000th input record, and so on. For long-running aggregations on large input file(s), this can provide reassurance that processing is indeed proceeding apace. Example:

$ mlr --nr-progress-mod 100000 check data/big.dkvp
NR=100000 FNR=100000 FILENAME=data/big.dkvp
NR=200000 FNR=200000 FILENAME=data/big.dkvp
NR=300000 FNR=300000 FILENAME=data/big.dkvp
NR=400000 FNR=400000 FILENAME=data/big.dkvp
NR=500000 FNR=500000 FILENAME=data/big.dkvp
NR=600000 FNR=600000 FILENAME=data/big.dkvp
NR=700000 FNR=700000 FILENAME=data/big.dkvp
...

mlr cat -n had a bug wherein it counted zero-up while its documentation claimed it counted one-up. Now it counts one-up as documented.

Assets 6

14 Feb 15:57

johnkerl

v3.4.0

615fd88

JSON, reshape, regex captures, and more

Primary features:

JSON is now a supported format for input and output. Miller handles tabular data, and JSON supports arbitrarily deeply nested data structures, so if you want general JSON processing you should use jq. But if you have tabular data represented in JSON then Miller can now handle that for you. Please see the reference page and the FAQ.
Reshape is a standard data-processing idiom, now available in Miller: http://johnkerl.org/miller/doc/reference.html#reshape
Incidentally (not part of this release, but new since the last release) Miller is now available in FreeBSD's package manager: https://www.freshports.org/textproc/miller/. A full list of distributions containing Miller may be found here.
Miller is not yet available from within Fedora/CentOS, but as a step toward this goal, an SRPM is included in this release (see file-list below).

DSL enhancements for mlr put and mlr filter:

Regex captures \0 through \9: http://johnkerl.org/miller/doc/reference.html#Regex_captures
Ternary operator in expression right-hand sides: e.g. mlr put '$y = $x < 0.5 ? 0 : 1'
Boolean literals true and false
Final semicolon is now allowed: e.g. mlr put '$x=1;$y=2;'
Environment variables are now accessible, where environment-variable names may be string literals or arbitrary expressions: mlr put '$home = ENV["HOME"]' or mlr put '$value = ENV[$name]'.
While records are still string-to-string maps for input and output, and between then statements, types are preserved between multiple statements within a put. Example: mlr put '$y = string($x); $z = $y . $y' works as expected, without requring mlr put '$y = string($x); $z = string($y) . string($y)' as before.

Bug fixes:

Mixed-format join, e.g. CSV file joined with DKVP file, was incorrectly computing default separators (IRS, IFS, IPS). This resulted in records not being joined together.
Segmentation violation on non-standard-input read of files with size an exact multiple of page size and not ending in IRS, e.g. newline. (This is less of a corner case than it sounds: for example, leave a long-running program running with output redirected to a file, then in a sleep-and-process loop, have Miller process that file. The former program's stdio library will likely be doing block-sized buffered I/O, where block sizes will often be multiples of system page size and the block will almost surely not ending a newline.)

Acknowledgements: Big thank-yous to @gregfr and @aaronwolen for feature requests including reshape and regex captures, and to @jungle-boogie for his work getting Miller into FreeBSD. Also, ongoing thanks to @0-wiz-0 for his past work on configure support, making it possible for Miller to be put to use in multiple operating systems.

Assets 6

11 Jan 03:28

johnkerl

v3.3.2

4db9524

Bootstrap sampling, EWMA, merge-fields, isnull/isnotnull functions

Bootstrap sampling in mlr bootstrap: http://johnkerl.org/miller/doc/reference.html#bootstrap. Compare to reservoir sampling in mlr sample: http://johnkerl.org/miller/doc/reference.html#sample.
Exponentially weighted moving averages in mlr step -a ewma: principally useful for smoothing of noisy time series, e.g. finely sampled system-resource utilization to give one of many possible examples. Please see http://johnkerl.org/miller/doc/reference.html#step.
"Horizontal" univariate statistics in mlr merge-fields, compared to mlr stats which is "vertical". Also allows collapsing multiple fields into one, such as in_bytes and out_bytes data fields summing to bytes_sum. This can also be done easily using mlr put. However, mlr merge-fields allows aggregation of more than just a pair of field names, and supports pattern-matching on field names. Please see http://johnkerl.org/miller/doc/reference.html#merge-fields for more information.
isnull and isnotnull functions for mlr filter and mlr put.
stats1, stats2, merge-fields, step, and top correctly handle not only missing fields (in the row-heterogeneous-data case) but also null-valued fields.
Minor memory-management improvements.

Assets 6

29 Dec 20:40

johnkerl

v3.2.2

44901ae

Performance improvements, compressed I/O, and variable-name escaping

RFC-CSV read performance is dramatically improved and is now on par with other formats; read performance for all formats is slightly improved as well.
Variable names can now be escaped, using curly braces if there are special characters in the input-data field names. Example: mlr put '${bytes.total} = ${bytes.in} + ${bytes.out}'. See also #77 where this was requested.
Compressed I/O is now supported, using built-in compatibility with local system tools: http://johnkerl.org/miller/doc/reference.html#Compression. See also #77 where this was requested.
mlr uniq is now streaming (bounded memory use, functionality in tail -f contexts) when possible: i.e. when -n and -c are not specified.
Thorough valgrind-driven testing has been used to tighten memory usage. This is mostly an invisible internal improvement, although it has a slight across-the-board performance improvement as well as allowing Miller to handle even larger files in limited-memory contexts.

Assets 5

09 Dec 13:10

johnkerl

v3.1.2

b1004a2

Bugfix for stats1 max

mlr stats1 max was reporting the same value as mlr stats1 min, although p100 was unaffected. This error has been present since the 3.0.0 release. It was reported on #92.

Assets 5

07 Dec 03:10

johnkerl

v3.1.1

650ab00

Fix regression tests for i386

No functionality had been broken for i386: the changes are for the test framework only, to get validated builds on all available platforms.

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: johnkerl/miller

Interpolated percentiles, markdown-tabular output format, CSV-quote preservation

Multi-emit

for/if/while and various features

Variables, begin/end blocks, pattern-action blocks

New data-rearrangers: nest, shuffle, repeat; misc. features

JSON, reshape, regex captures, and more

Bootstrap sampling, EWMA, merge-fields, isnull/isnotnull functions

Performance improvements, compressed I/O, and variable-name escaping

Bugfix for stats1 max

Fix regression tests for i386