Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange random exception thrown by Treex::PML::Backend::PML #70

Open
dan-zeman opened this issue Nov 26, 2017 · 7 comments
Open

Strange random exception thrown by Treex::PML::Backend::PML #70

dan-zeman opened this issue Nov 26, 2017 · 7 comments
Labels

Comments

@dan-zeman
Copy link
Member

I have a large number (hundreds) of .treex.gz files that I process in two steps. Each step is a parallelized treex run on the ÚFAL cluster. The first step generates .treex.gz files, the second step reads them. Every now and then the reader in the second step crashes. I have observed it with various corpora; it is not tied to one particular dataset.

The exception says for one or more input files that there is extra content after the PML document end. Manual inspection of the files does not reveal anything unusual.

Re-running the first step (without changing settings or sources) sometimes helps. The error disappears but it strikes back again somewhere else some other time.

Re-running the second step without re-running the first step did not help (I let it retry 11 times, then I killed it), so the random error seems to be connected to writing rather than reading.

I looked up the name of the file that could not be read, and I tried just reading it, locally (no cluster), without anything else in the scenario. Worked. I tried the full scenario on the cluster, but just with this one file. Crashed. Re-tried the same thing a second time. Worked. Huh. Ran the same scenario for all 874 files. Crashed. Retried 11 times, always crashed (sometimes on that file that I had tried to single out).

I gunzipped all input files (but did not re-run step 1), then re-run the scenario on the cluster. It worked. Just one experiment is too little evidence, but I now suspect that the bug may be related to reading/writing gzipped files from withing Perl. (Gunzip itself did not complain about the files though.)

which perl
/net/work/projects/perlbrew/Ubuntu/14.04/x86_64/perls/perl-5.18.2/bin/perl
whichpm PerlIO::via::gzip
/net/work/projects/perlbrew/Ubuntu/14.04/x86_64/perls/perl-5.18.2/lib/site_perl/5.18.2/PerlIO/via/gzip.pm 0.021
@dan-zeman dan-zeman added the bug label Nov 26, 2017
@dan-zeman
Copy link
Member Author

Just found out that cpanm returns version 0.03 of PerlIO::via::gzip. Let's see whether it affects the error in any way.

@dan-zeman
Copy link
Member Author

dan-zeman commented Nov 26, 2017

The error persists with the newer PerlIO::via::gzip.

TREEX-INFO:     4.099:  Parallelized execution. This process is one of the worker nodes, jobindex==20
TREEX-INFO:     4.177:  Loading block Treex::Block::Util::SetGlobal language=ar (1/6)
TREEX-INFO:     4.210:  Loading block Treex::Block::Read::Treex from=!/net/work/people/zeman/hamledt-data/ar/treex/01/{train,dev,test}/*.treex.gz (2/6)
TREEX-INFO:     4.478:  Loading block Treex::Block::A2A::CopyAtree source_selector= selector=prague (3/6)
TREEX-INFO:     4.503:  Loading block Treex::Block::HamleDT::Udep  (4/6)
TREEX-INFO:     4.657:  Loading block Treex::Block::Write::CoNLLU print_zone_id=0 substitute={treex/01}{conllu} compress=1 (5/6)
TREEX-INFO:     4.710:  Loading block Treex::Block::Write::Treex substitute={conllu}{treex/02} compress=1 (6/6)
TREEX-INFO:     4.720:  ALL BLOCKS SUCCESSFULLY LOADED.
TREEX-INFO:     4.721:  Loading the scenario took 0 seconds
TREEX-INFO:     4.733:  Applying process_start
Error occured while reading '/net/work/people/zeman/hamledt-data/ar/treex/01/train/AFP_ARB_20000815.0001.treex.gz' using backend Treex::PML::Backend::PML:
file:///net/work/people/zeman/hamledt-data/ar/treex/01/train/AFP_ARB_20000815.0001.treex.gz:11237: parser error : Extra content at the end of the document

@martinpopel
Copy link
Member

If the error is on random files and you just need the work done without solving the problem, there is the --skip_finished option, which makes re-running the whole experiment easier.

I see the bug is non-deterministic, but still it would be nice to have a minimal test, ideally committed in a branch of this repo. One test may contain the wrong file and just do the reading (which should deterministically fail). Another test may contain reading+writing in a for cycle 1..100, so there is a higher chance the error shows up. Without the test, I cannot work on this. But maybe it will be easier for you to solve the problem than write a test:-):

If the problem is with writing, you can try switching from open (my $gzip_fh, "| gzip -c > $filename.gz") here to using IO::Compress::Gzip.

If you suspect PerlIO::via::gzip, you can try switching to PerlIO::gzip as I did in Udapi. Theoretically, it is just a matter of deleting the "via::" and changing '<:via(gzip):' to '<:gzip:'.
Compare Treex and Udapi.

@dan-zeman
Copy link
Member Author

I wanted to create a minimal test but the failing file does not fail deterministically.

Thanks for the pointers to the other gzip options. I will try them when I have time.

@dan-zeman dan-zeman changed the title Strange random exception thrown by PML::Backend Strange random exception thrown by Treex::PML::Backend::PML Nov 27, 2017
@dan-zeman
Copy link
Member Author

I have not seen the error when processing the files sequentially outside the cluster, and I have not seen the error when processing uncompressed data (based on several experiments now).

I have seen it sometimes with gzipped treex data processed in parallel on several cluster machines.

@martinpopel
Copy link
Member

I see. Writing tests for parallel cluster processing is a bit more tricky, but still possible.
Have you checked whether it's always the same machine which produced the faulty treex.gz file? Maybe there is a different version of gzip installed.
Anyway, if this is the case using IO::Compress::Gzip should solve it.

@dan-zeman
Copy link
Member Author

Not sure whether it's always the same (set of) machine(s). It takes some effort to dig this information out of the logs. But in one case I identified the machine from which the error message came (lucifer6), logged into it and processed the entire batch of presumably wrong files sequentially there. No error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants