-
I am dealing with some large files and would like to split them into chunks for further processing. The straight forward solution would be to use mlr head and tail in a loop. While this works well for normal files, my files are slightly special. The lines are not independent of each other and some lines need to "stay together" in order to have a meaninful output in the next processing steps.
Say I wanted to split this into files with 4 records each, this would separate the 3-entries.
Obviously the chunks would no longer be of the exact same size, but thats fine. I just want roughly the size. I also know that there are usually not that many identical n. Its usually between 1 and 5 n while the file itself has a few million entries. And I want to split it in chunks of ~10.000 entries. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments
-
Hi @masgo I ran into this quite a lot a few years ago & created the Here that might be
or maybe
I think this is what you want -- please let me know if not. Also, this (currently) works by keeping all files open simultaneously, so if you run into a situation where there are thousands of different |
Beta Was this translation helpful? Give feedback.
-
@masgo reading more carefully, it seems like there are few small values of
where my
for a total of a million lines. This produces:
|
Beta Was this translation helpful? Give feedback.
-
Thank you for your help. Maybe I did not state my problem clearly (english is not my first language). But anyway, you pointed me in the right direction and I found a solution. What I endet up with is this:
First, it sorts the items according to the ID, then adds two counters, one for each ID and another one for all records. The only drawback is, that the new files have additional columns which I have to remove in a second step. Is there a way around it? |
Beta Was this translation helpful? Give feedback.
-
@masgo I think just
before the Also, perhaps a dedicated |
Beta Was this translation helpful? Give feedback.
-
unset works great. Thank you. 'mlr split' would be nice, but If I where to choose I would prefer the join which adds only a single column. (or maybe only certain columns) |
Beta Was this translation helpful? Give feedback.
-
Here is And the #888 is on deck |
Beta Was this translation helpful? Give feedback.
@masgo I think just
before the
tee
would do it.Also, perhaps a dedicated
mlr split
verb would be useful to have ... at my current job I don't do much of this kind of splitting but at a previous job I did it very very much; perhaps you & I are not alone ...