Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when training CRF #10

Closed
DebinUser42 opened this issue Jul 31, 2019 · 5 comments
Closed

Segfault when training CRF #10

DebinUser42 opened this issue Jul 31, 2019 · 5 comments

Comments

@DebinUser42
Copy link

I cannot train any CRF model using my own dataset. Variable models train fine. I first raised the issue with Nice2Predict here.

As explained in the Nice2Predict issue, I am using the latest docker image to perform the experiments.

@LostBenjamin
Copy link
Collaborator

Hi,

Could you please try adding a filename prefix to out_model argument? That is, change --out_model 128_kfold_10/1/crf/x86_64/ to something like --out_model 128_kfold_10/1/crf/x86_64/model.

Best,
Jingxuan

@DebinUser42
Copy link
Author

Unfortunately the problem still exists. I recompiled N2P with an increased stackLimit to stop segfaults when using multiple workers. The server has 32 cores with hyper-threading and 256GB of ram and plenty of HDD space.

root@7fd590dedd2e:/debin# /Nice2Predict/bazel-bin/n2p/training/train_json --input /debin/128_kfold_10/1/crf/feature.json --log_dir /debin/128_kfold_10/1/crf/ --valid_labels /debin/c_valid_labels --out_model /debin/128_kfold_10/1/crf/x86
_64/model --num_threads 16 --training_method pl --max_labels_z 8
*** Aborted at 1565006314 (unix time) try "date -d @1565006314" if you are using GNU date ***
PC: @     0x7f742ef3e6f8 fwrite
*** SIGSEGV (@0x0) received by PID 38717 (TID 0x7f74303bd880) from PID 0; stack trace: ***
    @     0x7f742fd7b390 (unknown)
    @     0x7f742ef3e6f8 fwrite
    @           0x45b6cf GraphInference::SaveModel()
    @           0x410e25 LearningMain<>()
    @           0x40d97a main
    @     0x7f742eef0830 __libc_start_main
    @           0x40d4a9 _start
    @                0x0 (unknown)
Segmentation fault (core dumped)

root@7fd590dedd2e:/debin# gdb /Nice2Predict/bazel-bin/n2p/training/train_json ./core
(gdb) bt
#0  __GI__IO_fwrite (buf=0x7ffc327f1710, size=4, count=1, fp=0x0) at iofwrite.c:37
#1  0x000000000045b6cf in GraphInference::SaveModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#2  0x0000000000410e25 in int LearningMain<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::function<nice2protos::Query (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>) ()
#3  0x000000000040d97a in main ()

(gdb) x/s 0x7ffc327f1710 
0x7ffc327f1710: "(\036\031"
(gdb) info frame
Stack level 1, frame at 0x7ffc327f17e0:
 rip = 0x45b6cf in GraphInference::SaveModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&); saved rip = 0x410e25
 called by frame at 0x7ffc327f1c50, caller of frame at 0x7ffc327f1700
 Arglist at 0x7ffc327f17d0, args: 
 Locals at 0x7ffc327f17d0, Previous frame's sp is 0x7ffc327f17e0
 Saved registers:
  rbx at 0x7ffc327f17c8, rbp at 0x7ffc327f17d0, rip at 0x7ffc327f17d8
(gdb)
(gdb) frame 0
#0  __GI__IO_fwrite (buf=0x7ffc327f1710, size=4, count=1, fp=0x0) at iofwrite.c:37
37      in iofwrite.c
(gdb) info args
buf = 0x7ffc327f1710
size = 4
count = 1
fp = 0x0
(gdb) info locals
_IO_acquire_lock_file = <optimized out>
request = 4
written = 0
(gdb) p 0x7ffc327f1710
$2 = 140721155675920
(gdb) x/s 0x7ffc327f1710
0x7ffc327f1710: "(\036\031"

The core dump file is 1.1GB. I can upload it if you need it.

@DebinUser42
Copy link
Author

Looking at N2P source code the file it is trying to save to is logged at the start of the SaveModel function. Here is the log file contents:

root@7fd590dedd2e:/debin/128_kfold_10/1/crf# cat !$
cat train_json.7fd590dedd2e.invalid-user.log.INFO.20190805-115305.38717
Log file created at: 2019/08/05 11:53:05
Running on machine: 7fd590dedd2e
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0805 11:53:05.728695 38717 train_internal.h:298] Running structured training...
I0805 11:53:14.708145 38717 train_internal.h:94] Loaded 12999 training data samples.
I0805 11:53:14.708168 38717 graph_inference.cpp:1644] Loading LabelChecker...
I0805 11:53:15.263715 38717 graph_inference.cpp:1646] LabelChecker loaded
I0805 11:53:17.602084 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:53:19.874061 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:53:19.874079 38717 train_internal.h:303] Training inited...
I0805 11:53:19.874084 38717 train_internal.h:305] Running PL training...
I0805 11:53:20.124673 38717 train_internal.h:127] Starting training using pseudolikelihood as objective function with --start_learning_rate=0.100000, --regularization_const=2.000000 and --max_labels_z=8
I0805 11:53:51.277228 38717 train_internal.h:151] Training pass took 31152ms.
I0805 11:53:51.277369 38717 train_internal.h:153] Pass 0 with learning rate 0.1
I0805 11:53:54.044384 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:53:56.213568 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:54:37.292136 38717 train_internal.h:151] Training pass took 41078ms.
I0805 11:54:37.292222 38717 train_internal.h:153] Pass 1 with learning rate 0.05
I0805 11:54:39.924296 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:54:42.070261 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:55:24.225458 38717 train_internal.h:151] Training pass took 42155ms.
I0805 11:55:24.225556 38717 train_internal.h:153] Pass 2 with learning rate 0.0166667
I0805 11:55:27.046609 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:55:29.201287 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:56:11.677734 38717 train_internal.h:151] Training pass took 42476ms.
I0805 11:56:11.677822 38717 train_internal.h:153] Pass 3 with learning rate 0.00416667
I0805 11:56:14.314664 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:56:16.471534 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:56:59.356559 38717 train_internal.h:151] Training pass took 42884ms.
I0805 11:56:59.356649 38717 train_internal.h:153] Pass 4 with learning rate 0.000833333
I0805 11:57:01.894024 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:57:04.085263 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:57:44.610508 38717 train_internal.h:151] Training pass took 40525ms.
I0805 11:57:44.610594 38717 train_internal.h:153] Pass 5 with learning rate 0.000138889
I0805 11:57:47.334631 38717 graph_inference.cpp:1710] Preparing GraphInference for MAP inference...
I0805 11:57:49.481114 38717 graph_inference.cpp:1723] GraphInference prepared for MAP inference.
I0805 11:58:34.793534 38717 train_internal.h:151] Training pass took 45312ms.
I0805 11:58:34.793619 38717 train_internal.h:153] Pass 6 with learning rate 1.98413e-05
I0805 11:58:34.793651 38717 graph_inference.cpp:1271] Saving model /debin/128_kfold_10/1/crf/x86_64/model...

Do I need to run touch /debin/128_kfold_10/1/crf/x86_64/model?

@LostBenjamin
Copy link
Collaborator

I think you need to run mkdir -p /debin/128_kfold_10/1/crf/x86_64/

@DebinUser42
Copy link
Author

🤦‍♂️ typo in the path. Resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants