Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeating serialization error #33

Closed
rmcclosk opened this issue May 7, 2015 · 7 comments
Closed

Repeating serialization error #33

rmcclosk opened this issue May 7, 2015 · 7 comments
Labels

Comments

@rmcclosk
Copy link
Collaborator

rmcclosk commented May 7, 2015

I ran the command from issue #29 overnight. In the morning, the log indicated that 6196 steps had completed. The below message was printing to the console about every 2 seconds.

Error in serialize(data, node$con, xdr = FALSE) : 
  error writing to connection

Some googling indicates this is probably an R problem, which I had thought was resolved before in #29.

Python had crashed at some point. Here are some relevant lines from the crash report. I saved the rest of the report.

ProblemType: Crash
ExecutablePath: /usr/bin/python2.7
ExecutableTimestamp: 1395532628
ProcCmdline: python kamphir.py DiffRisk examples/settings.rcolgem-DiffRisk1.json examples/rcolgem_c1-2.0_n-300_rho-0.9.nwk diffrisk-c1-2.log -kdecay 0.3 -tol0 0.005 -mintol 0.0025 -ncores 4 -nthreads 4 -nreps 20 -treenum 0 -seed 0
ProcCwd: /home/rmcclosk/Documents/kamphir

I killed the program with Control+C. Here's the traceback to show where we were.

Traceback (most recent call last):
  File "kamphir.py", line 688, in <module>

  File "kamphir.py", line 446, in abc_mcmc
    print 'ERROR: next_score (', next_score, ') outside interval [0,1], dumping proposal and EXIT'
  File "kamphir.py", line 377, in evaluate
    if len(trees) == 0:
  File "kamphir.py", line 269, in simulate_internal
    for newick in newicks:
  File "/home/rmcclosk/Documents/kamphir/rcolgem.py", line 252, in simulate_DiffRisk_trees
    robjects.r("tfgy <- make.fgy( t0, maxSampleTime, births, deaths, nonDemeDynamics, x0, migrations=migrations, "
  File "/usr/local/lib/python2.7/dist-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/robjects/__init__.py", line 269, in __call__
    res = self.eval(p)
  File "/usr/local/lib/python2.7/dist-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 170, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/rpy2-2.5.6-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 100, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)
KeyboardInterrupt

So, it appears that next_score was outside the allowed interval for some reason, we tried to dump the proposal and quit, but it failed.

One possibility is that sys.exit() didn't cleanly shut down all the threads and the R instance (instances?) they were attached to. So we still had R trying to pass stuff back to Python when it was shut down.

@rmcclosk
Copy link
Collaborator Author

rmcclosk commented May 7, 2015

Something possibly related from the sys.exit documentation: "Since exit() ultimately “only” raises an exception, it will only exit the process when called from the main thread, and the exception is not intercepted."

It also occurs to me that we are using separate multi-threading tools in Python and R. It's possible there are some unexpected side effects from doing that.

@ArtPoon
Copy link
Owner

ArtPoon commented May 11, 2015

Oof. Regarding use of separate multi-threading tools, I suppose that we could have each Python thread spawn its own instance of R, and then distribute jobs respectively. A kernel score outside the prescribed bounds could conceivably be due to numerical overflow (see large diagonal problem in kernel matrices).

@ArtPoon
Copy link
Owner

ArtPoon commented Jun 5, 2015

I just ran into this problem on my workstation. Delving into my system logs, I found this report:

Crashed Thread:  0  Dispatch queue: com.apple.main-thread

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x00007fa80c66f948
[...]
Application Specific Information:
crashed on child side of fork pre-exec

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   rcolgem.so                      0x0000000110e45050 dQAL + 320
1   deSolve.so                      0x0000000110d1e3d8 derivs + 136
2   deSolve.so                      0x0000000110d1cbec rk_fixed + 1244
3   deSolve.so                      0x0000000110cc492d call_rkFixed + 3293
4   libR.dylib                      0x000000010e26f0fc do_dotcall + 1916 (dotcode.c:652)
5   libR.dylib                      0x000000010e2a036b Rf_eval + 1355 (eval.c:657)

I suspect there is some exception being thrown at the level of deSolve that is not being handled correctly by rcolgem/Python. Will try to find the offending line in dQAL.

ArtPoon added a commit that referenced this issue Jul 22, 2015
…hing

 time units in tree.
Fixed typo in rcolgem.py, PANGEA model specification.
Successfully running PANGEA model, but encountered issue #33.
@ArtPoon ArtPoon added the bug label Jul 22, 2015
@ArtPoon
Copy link
Owner

ArtPoon commented Jul 22, 2015

This might be fixed by rcolgem svn 126, integrated these changes as of commit 27ec24c

@ArtPoon
Copy link
Owner

ArtPoon commented Jul 22, 2015

I think this one's finally squished by commit 9fe8f34

@ArtPoon
Copy link
Owner

ArtPoon commented Jul 23, 2015

ARGH. Still no good.
To reproduce, do a fresh pull, recompile and install rcolgem, and use this invocation:

python kamphir.py PANGEA settings/pangea.json /Users/art/git/pangea/data/February2015/Regional/hyphy/root2tip/150129_PANGEAsim_Regional_FirstObj_scA_SIMULATED_all.nwk scA.log -delimiter _ -ncores 5 -nthreads 5 -nreps 5 -tscale 0.142857

@ArtPoon ArtPoon reopened this Jul 23, 2015
@rmcclosk
Copy link
Collaborator Author

I believe this is fixed with 9509795. The problem was negative node heights, which would lead to an attempt to access a negative array index in dQAL (before the first time point when the population sizes are calculated). I put in a check for this so it will use the values from time zero if the time is negative. 30 steps so far no problems.

This is not a coherent fix but more of a stopgap just to get it to run. If the node heights are negative, that means the nodes are around before time zero. Since first infected individual appears at time zero, this scenario is impossible, and probably those trees should be thrown away instead.

@ArtPoon ArtPoon closed this as completed Jul 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants