Segfault when training CRF #10

DebinUser42 · 2019-07-31T13:39:06Z

I cannot train any CRF model using my own dataset. Variable models train fine. I first raised the issue with Nice2Predict here.

As explained in the Nice2Predict issue, I am using the latest docker image to perform the experiments.

LostBenjamin · 2019-08-04T19:45:45Z

Hi,

Could you please try adding a filename prefix to out_model argument? That is, change --out_model 128_kfold_10/1/crf/x86_64/ to something like --out_model 128_kfold_10/1/crf/x86_64/model.

Best,
Jingxuan

DebinUser42 · 2019-08-05T12:19:00Z

Unfortunately the problem still exists. I recompiled N2P with an increased stackLimit to stop segfaults when using multiple workers. The server has 32 cores with hyper-threading and 256GB of ram and plenty of HDD space.

root@7fd590dedd2e:/debin# /Nice2Predict/bazel-bin/n2p/training/train_json --input /debin/128_kfold_10/1/crf/feature.json --log_dir /debin/128_kfold_10/1/crf/ --valid_labels /debin/c_valid_labels --out_model /debin/128_kfold_10/1/crf/x86
_64/model --num_threads 16 --training_method pl --max_labels_z 8
*** Aborted at 1565006314 (unix time) try "date -d @1565006314" if you are using GNU date ***
PC: @     0x7f742ef3e6f8 fwrite
*** SIGSEGV (@0x0) received by PID 38717 (TID 0x7f74303bd880) from PID 0; stack trace: ***
    @     0x7f742fd7b390 (unknown)
    @     0x7f742ef3e6f8 fwrite
    @           0x45b6cf GraphInference::SaveModel()
    @           0x410e25 LearningMain<>()
    @           0x40d97a main
    @     0x7f742eef0830 __libc_start_main
    @           0x40d4a9 _start
    @                0x0 (unknown)
Segmentation fault (core dumped)

root@7fd590dedd2e:/debin# gdb /Nice2Predict/bazel-bin/n2p/training/train_json ./core
(gdb) bt
#0  __GI__IO_fwrite (buf=0x7ffc327f1710, size=4, count=1, fp=0x0) at iofwrite.c:37
#1  0x000000000045b6cf in GraphInference::SaveModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#2  0x0000000000410e25 in int LearningMain<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::function<nice2protos::Query (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>) ()
#3  0x000000000040d97a in main ()

(gdb) x/s 0x7ffc327f1710 
0x7ffc327f1710: "(\036\031"
(gdb) info frame
Stack level 1, frame at 0x7ffc327f17e0:
 rip = 0x45b6cf in GraphInference::SaveModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&); saved rip = 0x410e25
 called by frame at 0x7ffc327f1c50, caller of frame at 0x7ffc327f1700
 Arglist at 0x7ffc327f17d0, args: 
 Locals at 0x7ffc327f17d0, Previous frame's sp is 0x7ffc327f17e0
 Saved registers:
  rbx at 0x7ffc327f17c8, rbp at 0x7ffc327f17d0, rip at 0x7ffc327f17d8
(gdb)
(gdb) frame 0
#0  __GI__IO_fwrite (buf=0x7ffc327f1710, size=4, count=1, fp=0x0) at iofwrite.c:37
37      in iofwrite.c
(gdb) info args
buf = 0x7ffc327f1710
size = 4
count = 1
fp = 0x0
(gdb) info locals
_IO_acquire_lock_file = <optimized out>
request = 4
written = 0
(gdb) p 0x7ffc327f1710
$2 = 140721155675920
(gdb) x/s 0x7ffc327f1710
0x7ffc327f1710: "(\036\031"

The core dump file is 1.1GB. I can upload it if you need it.

DebinUser42 · 2019-08-05T12:22:59Z

Looking at N2P source code the file it is trying to save to is logged at the start of the SaveModel function. Here is the log file contents:

root@7fd590dedd2e:/debin/128_kfold_10/1/crf# cat !$
cat train_json.7fd590dedd2e.invalid-user.log.INFO.20190805-115305.38717
Log file created at: 2019/08/05 11:53:05
Running on machine: 7fd590dedd2e
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0805 11:53:05.728695 38717 train_internal.h:298] Running structured training...
I0805 11:53:14.708145 38717 train_internal.h:94] Loaded 12999 training data samples.
I0805 11:53:14.708168 38717 graph_inference.cpp:1644] Loading LabelChecker...
I0805 11:53:15.263715 38717 graph_inference.cpp:1646] LabelChecker loaded
I0805 11:53:17.602084 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:53:19.874061 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:53:19.874079 38717 train_internal.h:303] Training inited...
I0805 11:53:19.874084 38717 train_internal.h:305] Running PL training...
I0805 11:53:20.124673 38717 train_internal.h:127] Starting training using pseudolikelihood as objective function with --start_learning_rate=0.100000, --regularization_const=2.000000 and --max_labels_z=8
I0805 11:53:51.277228 38717 train_internal.h:151] Training pass took 31152ms.
I0805 11:53:51.277369 38717 train_internal.h:153] Pass 0 with learning rate 0.1
I0805 11:53:54.044384 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:53:56.213568 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:54:37.292136 38717 train_internal.h:151] Training pass took 41078ms.
I0805 11:54:37.292222 38717 train_internal.h:153] Pass 1 with learning rate 0.05
I0805 11:54:39.924296 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:54:42.070261 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:55:24.225458 38717 train_internal.h:151] Training pass took 42155ms.
I0805 11:55:24.225556 38717 train_internal.h:153] Pass 2 with learning rate 0.0166667
I0805 11:55:27.046609 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:55:29.201287 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:56:11.677734 38717 train_internal.h:151] Training pass took 42476ms.
I0805 11:56:11.677822 38717 train_internal.h:153] Pass 3 with learning rate 0.00416667
I0805 11:56:14.314664 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:56:16.471534 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:56:59.356559 38717 train_internal.h:151] Training pass took 42884ms.
I0805 11:56:59.356649 38717 train_internal.h:153] Pass 4 with learning rate 0.000833333
I0805 11:57:01.894024 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:57:04.085263 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:57:44.610508 38717 train_internal.h:151] Training pass took 40525ms.
I0805 11:57:44.610594 38717 train_internal.h:153] Pass 5 with learning rate 0.000138889
I0805 11:57:47.334631 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:57:49.481114 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:58:34.793534 38717 train_internal.h:151] Training pass took 45312ms.
I0805 11:58:34.793619 38717 train_internal.h:153] Pass 6 with learning rate 1.98413e-05
I0805 11:58:34.793651 38717 graph_inference.cpp:1271] Saving model /debin/128_kfold_10/1/crf/x86_64/model...

Do I need to run touch /debin/128_kfold_10/1/crf/x86_64/model?

LostBenjamin · 2019-08-05T12:27:06Z

I think you need to run mkdir -p /debin/128_kfold_10/1/crf/x86_64/

DebinUser42 · 2019-08-05T12:35:08Z

🤦‍♂️ typo in the path. Resolved.

DebinUser42 closed this as completed Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault when training CRF #10

Segfault when training CRF #10

DebinUser42 commented Jul 31, 2019

LostBenjamin commented Aug 4, 2019

DebinUser42 commented Aug 5, 2019

DebinUser42 commented Aug 5, 2019

LostBenjamin commented Aug 5, 2019

DebinUser42 commented Aug 5, 2019

Segfault when training CRF #10

Segfault when training CRF #10

Comments

DebinUser42 commented Jul 31, 2019

LostBenjamin commented Aug 4, 2019

DebinUser42 commented Aug 5, 2019

DebinUser42 commented Aug 5, 2019

LostBenjamin commented Aug 5, 2019

DebinUser42 commented Aug 5, 2019