-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miller produces no output on TSV with > 64K characters per line #1501
Comments
Hi @GiulioCentorame ! I'll try to reproduce here. Meanwhile can you please tell me what this produces?
Then replace the 10 with 100, then 1000, 10000, 100000 -- this will help us see if this is indeed size-related. |
It doesn't seem to work even when running |
A note for @johnkerl: if I use Miller 5 and the sample file of @GiulioCentorame, it works (I duplicated the header line of the file, to have two rows). Using Miller 6 I have and empty result. |
I can replicate that on my original file too, Miller 5 seems to be working just fine |
@GiulioCentorame and @aborruso -- digging into this now. I was initially confused from the input data
which is not TSV (it has spaces, not tabs) and which has trailing whitespace. However, on some more thought I was able to reproduce the problem, and moreover, to narrow in on it. Here's a data-generator script: #!/usr/bin/env python
import sys
nrow = 2
ncol = 100
if len(sys.argv) == 2:
ncol = int(sys.argv[1])
if len(sys.argv) == 3:
ncol = int(sys.argv[1])
nrow = int(sys.argv[2])
prefix = "k"
for i in range(nrow):
for j in range(ncol):
if j == 0:
sys.stdout.write("%s%07d" % (prefix, j))
else:
sys.stdout.write("\t%s%07d" % (prefix, j))
sys.stdout.write("\n")
prefix = "v" Using this script I can produce an arbitrary number of columns:
Running this with various column-counts I get
Note
i.e. the problem happens when the line-length exceeds 64K (65536 == 2**16). So the bug is in the line-buffering reader. I'll dig into this and find a fix. |
Charming 😬
See also
And of course I used |
Found the issue; #1506 to follow up; closing this issue.
|
Resolved on #1507 -- there is a full performance-analysis write-up there. |
@GiulioCentorame you can pull from head and compile from source if you like (https://miller.readthedocs.io/en/latest/build/) -- or you can use |
I tried pulling from head and it works, thank you so much! The interesting behaviour now is that E.g. this is miller 5.10.3
and this is miller from aff4b9f (interrupted manually)
|
Hi John,
Hope you are doing well. I just wanted to open an issue wrt this problem I have encountered with one of my files.
To recap: I am working with a big file with approximately the following structure
(~5,000 columns and ~500,000 records)
Running
mlr --itsv cut -f f.ID [file]
does not seem to work as expected, as the program just "hangs" without printing anything on the shell/returning anything to stdout/stderr. We tried a few things to make it run with @aborruso, but nothing seemed to work as intended. This is running on an HPC machine and I provided up to 200 GB RAM to miller, without success.Apologies for opening an issue without a reproducible example, this seems to happen only with a specific file that I cannot share due to data sharing policies.
Cheers
The text was updated successfully, but these errors were encountered: