Create file chunks #889

masgo · 2022-01-20T12:25:40Z

masgo
Jan 20, 2022

I am dealing with some large files and would like to split them into chunks for further processing. The straight forward solution would be to use mlr head and tail in a loop. While this works well for normal files, my files are slightly special. The lines are not independent of each other and some lines need to "stay together" in order to have a meaninful output in the next processing steps.
Image the files would represend n:m relations, e.g.

n;m
1;a
1;b
1;c
1;d
2;a
2;c
2;d
3;b
3;c
3;d
...

Say I wanted to split this into files with 4 records each, this would separate the 3-entries.
I want to split this, but keep all the 1, 2 and 3 together. i.e.

n;m
1;a
1;b
1;c
1;d
--- split here
2;a
2;c
2;d
3;b
3;c
3;d
...

Obviously the chunks would no longer be of the exact same size, but thats fine. I just want roughly the size.
Any Ideas how to do this with miller?

I also know that there are usually not that many identical n. Its usually between 1 and 5 n while the file itself has a few million entries. And I want to split it in chunks of ~10.000 entries.

Answered by johnkerl

Jan 24, 2022

The only drawback is, that the new files have additional columns which I have to remove in a second step. Is there a way around it?

@masgo I think just

unset $tmpCounter;
unst $tmpItemCounter;

before the tee would do it.

Also, perhaps a dedicated mlr split verb would be useful to have ... at my current job I don't do much of this kind of splitting but at a previous job I did it very very much; perhaps you & I are not alone ...

View full answer

johnkerl · 2022-01-20T14:44:17Z

johnkerl
Jan 20, 2022
Maintainer

Hi @masgo I ran into this quite a lot a few years ago & created the tee DSL function for it -- https://miller.readthedocs.io/en/latest/reference-dsl-output-statements/#tee-statements

Here that might be

$ cat input.csv
n;m;text
1;a;the
1;b;quick
1;c;brown
1;d;fox
2;a;jumped
2;c;over
2;d;the
3;b;lazy
3;c;dogs
3;d;!

$ mlr --csv --fs semicolon put -q 'tee > $n."-".$m.".csv", $*' input.csv

$ ll *.csv
-rw-r--r--  1 kerl  staff   17 Jan 20 09:41 1-a.csv
-rw-r--r--  1 kerl  staff   19 Jan 20 09:41 1-b.csv
-rw-r--r--  1 kerl  staff   19 Jan 20 09:41 1-c.csv
-rw-r--r--  1 kerl  staff   17 Jan 20 09:41 1-d.csv
-rw-r--r--  1 kerl  staff   20 Jan 20 09:41 2-a.csv
-rw-r--r--  1 kerl  staff   18 Jan 20 09:41 2-c.csv
-rw-r--r--  1 kerl  staff   17 Jan 20 09:41 2-d.csv
-rw-r--r--  1 kerl  staff   18 Jan 20 09:41 3-b.csv
-rw-r--r--  1 kerl  staff   18 Jan 20 09:41 3-c.csv
-rw-r--r--  1 kerl  staff   15 Jan 20 09:41 3-d.csv
-rw-r--r--  1 kerl  staff   97 Jan 20 09:40 input.csv

$ cat 1-a.csv
n;m;text
1;a;the

$ cat 3-b.csv
n;m;text
3;b;lazy

or maybe

$ mlr --csv --fs semicolon put -q 'tee > $n.".csv", $*' input.csv

$ ll *.csv
-rw-r--r--  1 kerl  staff   45 Jan 20 09:47 1.csv
-rw-r--r--  1 kerl  staff   37 Jan 20 09:47 2.csv
-rw-r--r--  1 kerl  staff   33 Jan 20 09:47 3.csv
-rw-r--r--  1 kerl  staff   97 Jan 20 09:40 input.csv

$ cat 1.csv
n;m;text
1;a;the
1;b;quick
1;c;brown
1;d;fox

$ cat 3.csv
n;m;text
3;b;lazy
3;c;dogs
3;d;!

I think this is what you want -- please let me know if not.

Also, this (currently) works by keeping all files open simultaneously, so if you run into a situation where there are thousands of different n;m pairs then mlr might crash with a too-many-open-files error. Which could be coded around -- it could be modified to have logic wherein if too many files are being opened, it closes least-recently-used & then re-opens for append later on in the processing stream, all transparent to the user.

0 replies

johnkerl · 2022-01-23T04:13:39Z

johnkerl
Jan 23, 2022
Maintainer

@masgo reading more carefully, it seems like there are few small values of n so maybe this:

mlr --csv --from ~/tmp/big.csv cat -n -g color then put -q '
  shard = $n // 10000;
  file=$color."-".shard.".csv";
  tee > file, $*
'

where my ~/tmp/big.csv is like

color,shape,flag,k,index,quantity,rate
purple,square,false,10,10,53.6353,3.6051
red,square,false,4,17,50.6102,4.7702
red,square,false,4,25,51.2934,6.6613
red,circle,true,3,35,64.6480,3.8278
red,square,false,6,45,65.1873,7.5467
purple,triangle,false,7,55,60.4445,7.0020
purple,triangle,false,7,59,59.1077,6.4434
yellow,circle,true,9,68,97.4032,2.2115
yellow,triangle,true,1,72,61.1054,4.0279
red,square,false,4,77,68.5386,1.6783
...

for a total of a million lines. This produces:

$ ls -l *.csv | wc -l
     103

]$ ls *.csv
purple-0.csv  purple-22.csv purple-9.csv  red-21.csv    red-35.csv    yellow-11.csv yellow-25.csv
purple-1.csv  purple-23.csv red-0.csv     red-22.csv    red-36.csv    yellow-12.csv yellow-26.csv
purple-10.csv purple-24.csv red-1.csv     red-23.csv    red-37.csv    yellow-13.csv yellow-27.csv
purple-11.csv purple-25.csv red-10.csv    red-24.csv    red-38.csv    yellow-14.csv yellow-28.csv
purple-12.csv purple-26.csv red-11.csv    red-25.csv    red-39.csv    yellow-15.csv yellow-29.csv
purple-13.csv purple-27.csv red-12.csv    red-26.csv    red-4.csv     yellow-16.csv yellow-3.csv
purple-14.csv purple-28.csv red-13.csv    red-27.csv    red-40.csv    yellow-17.csv yellow-30.csv
purple-15.csv purple-29.csv red-14.csv    red-28.csv    red-5.csv     yellow-18.csv yellow-4.csv
purple-16.csv purple-3.csv  red-15.csv    red-29.csv    red-6.csv     yellow-19.csv yellow-5.csv
purple-17.csv purple-30.csv red-16.csv    red-3.csv     red-7.csv     yellow-2.csv  yellow-6.csv
purple-18.csv purple-4.csv  red-17.csv    red-30.csv    red-8.csv     yellow-20.csv yellow-7.csv
purple-19.csv purple-5.csv  red-18.csv    red-31.csv    red-9.csv     yellow-21.csv yellow-8.csv
purple-2.csv  purple-6.csv  red-19.csv    red-32.csv    yellow-0.csv  yellow-22.csv yellow-9.csv
purple-20.csv purple-7.csv  red-2.csv     red-33.csv    yellow-1.csv  yellow-23.csv
purple-21.csv purple-8.csv  red-20.csv    red-34.csv    yellow-10.csv yellow-24.csv

$ head red-5.csv
n,color,shape,flag,k,index,quantity,rate
50000,red,square,true,2,750987,82.4695,3.4016
50001,red,circle,true,3,751000,79.7563,4.9204
50002,red,square,false,4,751030,70.2386,5.2236
50003,red,square,false,4,751038,85.2822,3.8586
50004,red,square,true,2,751049,65.2863,7.8226
50005,red,square,false,6,751077,99.3722,1.0465
50006,red,square,true,2,751094,54.9408,8.3964
50007,red,circle,true,3,751145,78.8417,8.9005
50008,red,square,true,2,751158,96.2544,6.3039

0 replies

masgo · 2022-01-23T11:31:22Z

masgo
Jan 23, 2022
Author

Thank you for your help. Maybe I did not state my problem clearly (english is not my first language). But anyway, you pointed me in the right direction and I found a solution. What I endet up with is this:

mlr --tsvlite --from bigFile.tsv \
then sort -f ID \
then cat -N "tmpCounter" \
then cat -N "tmpItemCounter" -g ID\
then put -q '
  begin {
    @shard = 0;
    @newShard = false;
    @shardSize = 10000;
  };
  @newShard = (@newShard || (($tmpCounter % @shardSize) == 0));
  if(@newShard && ($tmpItemCounter == 1)) {
    @shard += 1;
    @newShard = false;
  }
  file = "shard-".@shard.".csv";
  tee > file, $*;
'

First, it sorts the items according to the ID, then adds two counters, one for each ID and another one for all records.
Then, it waits for counter % shartSize to be zero and sets newShard to true.
Then if newShard is set, it waits untill all items with the same ID are processed (i.e. for a new ID the itemCounter starts at 1 again) and only then increases the shard variable which creates a new file.

The only drawback is, that the new files have additional columns which I have to remove in a second step. Is there a way around it?

0 replies

johnkerl · 2022-01-24T05:05:46Z

johnkerl
Jan 24, 2022
Maintainer

The only drawback is, that the new files have additional columns which I have to remove in a second step. Is there a way around it?

@masgo I think just

unset $tmpCounter;
unst $tmpItemCounter;

before the tee would do it.

Also, perhaps a dedicated mlr split verb would be useful to have ... at my current job I don't do much of this kind of splitting but at a previous job I did it very very much; perhaps you & I are not alone ...

0 replies

masgo · 2022-01-24T20:12:18Z

masgo
Jan 24, 2022
Author

unset works great. Thank you.

'mlr split' would be nice, but If I where to choose I would prefer the join which adds only a single column. (or maybe only certain columns)
#888

0 replies

johnkerl · 2022-02-09T12:35:01Z

johnkerl
Feb 9, 2022
Maintainer

Here is mlr split

And the #888 is on deck

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create file chunks #889

{{title}}

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Create file chunks #889

masgo Jan 20, 2022

Replies: 6 comments

johnkerl Jan 20, 2022 Maintainer

johnkerl Jan 23, 2022 Maintainer

masgo Jan 23, 2022 Author

johnkerl Jan 24, 2022 Maintainer

masgo Jan 24, 2022 Author

johnkerl Feb 9, 2022 Maintainer

masgo
Jan 20, 2022

johnkerl
Jan 20, 2022
Maintainer

johnkerl
Jan 23, 2022
Maintainer

masgo
Jan 23, 2022
Author

johnkerl
Jan 24, 2022
Maintainer

masgo
Jan 24, 2022
Author

johnkerl
Feb 9, 2022
Maintainer