Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM installs signal handlers by default #150

Open
ahnlabb opened this issue Sep 7, 2021 · 11 comments
Open

JVM installs signal handlers by default #150

ahnlabb opened this issue Sep 7, 2021 · 11 comments

Comments

@ahnlabb
Copy link
Contributor

ahnlabb commented Sep 7, 2021

In ahnlabb/BioformatsLoader.jl#18 @timholy noted that after the JVM was initialized sending the interrupt signal (e.g. through Ctrl-C) would exit Julia. This happens because the JVM installs signal handlers by default. On a POSIX system this can be demonstrated as follows:

Normally when julia receives sigint, an exception is thrown:

julia> (() -> ccall("kill", Int, (Int, Int), 0, 2))()
Error showing value of type Int64:
ERROR: InterruptException:
Stacktrace:
  [...]

However after calling JavaCall.init() julia will instead exit:

julia> using JavaCall

julia> JavaCall.init()

julia> (() -> ccall("kill", Int, (Int, Int), 0, 2))()
0

julia> % [process exited]

The intrrupt signal (() -> ccall("kill", Int, (Int, Int), 0, 2))() can be replaced by something else that is manually interrupted by typing Ctrl-C (e.g. sleep(10)).

In ahnlabb/BioformatsLoader.jl@084efe7 I fixed this by adding -Xrs to the arguments passed to JavaCall.init and adding JavaCall.destroy to the julia exit hooks.

This raises the following questions:

  • Should JavaCall pass -Xrs by default?
  • Does JavaCall.destroy ensure that everything is cleaned up properly?
  • Do we need to use Base.disable_sigint to block interruptions and ensure JVM resources are freed when calls are interrupted?
@mkitti
Copy link
Member

mkitti commented Sep 7, 2021

  • Should JavaCall pass -Xrs by default?

JVM options can be specific to a certain VM implementation or can be vendor specific, so -Xrs may not be supported by all JVMs. Also, there may be other consequences such as performance degradation on certain platforms when that option is used.

I think the correct way to ultimately handle this would be support for signal chaining in Julia:
https://docs.oracle.com/en/java/javase/16/vm/signal-chaining.html

We either need to implement signal chaining for Julia itself or make Julia's signaling compatible with the JVM's signal chaining facility.

  • Does JavaCall.destroy ensure that everything is cleaned up properly?

JavaCall.destroy() invokes JNI.destroy(). That in turn invokes the JNI function DestroyJavaVM

JavaCall.jl/src/JNI.jl

Lines 207 to 216 in 4846e4e

function destroy()
if !is_env_loaded()
throw(JNIError("Called destroy without initialising Java VM"))
end
res = ccall(jvmfunc[].DestroyJavaVM, Cint, (Ptr{Nothing},), ppjvm[])
res < 0 && throw(JavaCallError("Unable to destroy Java VM"))
ppenv[1] = C_NULL
ppjvm[] = C_NULL
nothing
end

  • Do we need to use Base.disable_sigint to block interruptions and ensure JVM resources are freed when calls are interrupted?

I think signal chaining above is the proper solution to these issues.

@mkitti
Copy link
Member

mkitti commented Sep 7, 2021

For now, the easiest way to address this is documentation.

@ahnlabb
Copy link
Contributor Author

ahnlabb commented Sep 8, 2021

JVM options can be specific to a certain VM implementation or can be vendor specific, so -Xrs may not be supported by all JVMs. Also, there may be other consequences such as performance degradation on certain platforms when that option is used.

I find it highly unlikely that anyone using JavaCall won't have access to this option since:

  • it has been in J2SE since 1.3.1 (realeased in 2000) and all Sun/Oracle versions since then
  • it is also available in OpenJDK and J9

However, I agree that we need to be careful and clear about compatibility. Which implementations do we currently support? Like I said, I don't believe we support (or even could support) any implementation that does not have this option, but it would be great to have the compatibility documented in general. As for the performance degradation, this should really be understood as a specific optimization that e.g. HotSpot can perform on certain platforms when it has complete control over the signals, that these optimizations are unavailable when running as a service is completely expected (SO discussion). The differences in performance are minor and I think that the default should be safety and correctness, perhaps with an opt-in to JVM signal handling (though I'm not sure that is ever advisable).

I think the correct way to ultimately handle this would be support for signal chaining in Julia:
https://docs.oracle.com/en/java/javase/16/vm/signal-chaining.html

I think signal chaining above is the proper solution to these issues.

Unfortunately no, for various reasons, but explicitly noted on the page you linked:

Note: The SIGQUIT, SIGTERM, SIGINT, and SIGHUP signals cannot be chained. If the application must handle these signals, then consider using the —Xrs option.

@ahnlabb
Copy link
Contributor Author

ahnlabb commented Sep 8, 2021

  • Does JavaCall.destroy ensure that everything is cleaned up properly?

JavaCall.destroy() invokes JNI.destroy(). That in turn invokes the JNI function DestroyJavaVM

My question was a bit unclear. Is the code you linked (calling DestroyJavaVM then throwing on non-zero return) enough to ensure that everything is cleaned up after destroy returns? I'm aware that DestroyJavaVM has been problematic since forever and there is a lot of confusion online about what it does and doesn't do . The spec states:

DestroyJavaVM
Unloads a Java VM and reclaims its resources.

But at the end:

The JDK/JRE still does not support VM unloading, however.

This seems to be unnecessarily confusing in the documentation and a bug was reported and supposedly fixed for Java SE 8 (bug report), but the Java SE 8 docs still state:

DestroyJavaVM
Unloads a Java VM and reclaims its resources.

RETURNS:
Returns JNI_OK on success; returns a suitable JNI error code (a negative number) on failure.
Unloading of the VM is not supported.

This probably just means "in the sense that a new VM can't then be loaded" so that is not an issue.

In summary, I think destroy should guarantee the right things but if anyone knows of possible caveats, JNI error codes that could be handled better, etc. such things may become more important if we add destroy to Julia's atexit callbacks.

@mkitti
Copy link
Member

mkitti commented Sep 8, 2021

Basically it's not clear to me that calling destroy is actually very beneficial in most usage from Julia. If you are going to call destroy you might as well just exit the Julia process in most cases.

The main scenario I could imagine if you have limited memory resources and you only needed to load the JVM for a limited time while continuing to process the data in Julia. Perhaps that applies to BioformatsLoader where a user might only want to load the JVM to load the initial data?

@timholy
Copy link

timholy commented Sep 8, 2021

If you run destroy under atexit, then it only runs when you quit Julia. IIUC, a concern is whether that will do everything it should, or does it risk leaving some zombie Java process running on the machine? If that would ultimately be killed by a parent Julia process, would a failed cleanup lead to a noisy & messy Julia exit? (It's not pretty if the terminal scrolls with dozens of error messages when the user quits Julia, even if there's no actual harm caused.)

@ahnlabb
Copy link
Contributor Author

ahnlabb commented Sep 8, 2021

As @timholy is saying the idea is to give the JVM a chance to clean up as Julia is exiting and explicitly handle any issues that might arise during shutdown of the JVM. If this is not done properly there may be even more severe consequences than noise during exit, e.g. data corruption and confusing bugs, more on this later.

As outlined above: if -Xrs is not passed, Java takes all control of signals away from Julia invalidating guarantees that Julia programmers rely on. For example:

julia> try; sleep(1000); println("Done!"); finally; println("This is always printed!"); end

This code should always execute the finally clause. However, if you interrupt the above code after calling JavaCall.init() the following happens:

  • The JVM signal handlers catch the signal
  • The Java shutdown hooks are called
  • The process is terminated

Julia doesn't get back control and the finally clause is not executed, neither are any of Julia's exit hooks or internal signal handlers.

As explained in the prior discussion the reasonable solution is to pass -Xrs. However, now the situation is reversed, when exiting Julia the running JVM is not taken into account, JVM shutdown hooks are not called, threads are killed etc. While this is surely better (since Julia is the host) we can try to play nice by destroying on exit.

As an example of what could go wrong if we don't do our best to clean up you can imagine a logging library that uses BufferedWriters and closes them in a shutdown hook. Sometimes the writers will not flush and the logs will be incomplete potentially making it difficult to figure out the reason for a crash. Another scenario could be asynchronous writes to a socket or file that don't finish and silently corrupt data in ways that may be difficult to detect or reproduce.

@mkitti
Copy link
Member

mkitti commented Sep 8, 2021

To clarify, with JavaCall, there is no separate Java process. There are other packages that take that approach and use some kind of interprocess communication, but JavaCall runs the JVM in process. Since Java is running within the same process as Julia, there are has some implications such as shared signaling and need to copy stacks when switching Tasks.

The JVM has a DestroyJavaVM thread running in the background whose main purpose is to shutdown any remain threads on process exit. The potential problem here is if any of the Java threads become stalled, then trying to invoke DestroyJavaVM manually may end up stalling when what you ultimately want to do is just exit the process.

Additionally, note that if you try to exit from the Java side, Julia will also exit.

~$ JULIA_COPY_STACKS=1 julia --banner=no
julia> using JavaCall

julia> JavaCall.init()

julia> jls = @jimport java.lang.System
JavaObject{Symbol("java.lang.System")}

julia> jcall(jls, "exit", Nothing, (jint,), 0)

~$

Regarding my prior comment on signal chaining, what I mean is that we probably need to build a similar facility for Julia. The libjsig facility is meant for when the JVM is primary handler of signals. What libjsig does is allows other software to install other signal handlers downstream of the JVM signal handler. When the JVM signal handler decides that it does not need to handle a signal, it will pass it to other signal handlers. We need a libjuliasig that would act similarly but would let Julia intercept the signals first before handing them off. libjsig is licensed under GPL, so we probably would need a clean room implementation.

@mkitti
Copy link
Member

mkitti commented Sep 8, 2021

As explained in the prior discussion the reasonable solution is to pass -Xrs.

I think this is reasonable option for the end user, but it has implications that the end user needs to consider including significant performance impacts. What I find problematic is when a package takes the choice whether to use that option away from the end user.

What JavaCall.jl does is provides a central facility to separate the configuration and initialization stages so the user (or other packages) has the chance modify the configuration before initialization.

@mkitti
Copy link
Member

mkitti commented Sep 9, 2021

Here's an example of where atexit(JavaCall.destroy) would hang based on
https://support.hpe.com/hpesc/public/docDisplay?docId=mmr_kc-0117683

julia> using JavaCall

julia> begin
           JavaCall.init()
           jTimer = @jimport java.util.Timer
           jTimer()
           atexit(JavaCall.destroy)
       end

julia> exit() # hangs, Ctrl-C will terminate the process

This hangs for me on both Windows and Linux. It is supposed to. This is the documented behavior of DestroyJavaVM. It will wait for all non-daemon threads to exit.

If we make the java.util.Timer use a daemon thread, then the process will exit fine:

julia> using JavaCall

julia> begin
           JavaCall.init()
           jTimer = @jimport java.util.Timer
           jTimer((jboolean,), true)
           atexit(JavaCall.destroy)
       end

julia> exit() # actually exits

~$

If we run with -Xrs and put the timer on a non-daemon process as in the first example above, then we will not even be able to exit using Control-C.

julia> using JavaCall

julia> JavaCall.init("-Xrs")

julia> begin
           jTimer = @jimport java.util.Timer
           jTimer()
           atexit(JavaCall.destroy)
       end

julia> exit() # hangs, Ctrl-C does not terminate the process

My point here is that "-Xrs" and "JavaCall.destroy()" require some intention to be used properly and effectively. Without intention, the user can easily end up with a stuck process. At the JavaCall.jl level, I would prefer not to tie the hands of end applications or end users more than needed. A downstream package, JavaCallEasy.jl, could maybe do those things. Here the intention is to provide a basic application programming interface.

@ahnlabb
Copy link
Contributor Author

ahnlabb commented Sep 13, 2021

Additionally, note that if you try to exit from the Java side, Julia will also exit.

Yes, ideally I'd like for JavaCall to handle this as well (in the sense of allowing Julia exit hooks to run even if Java is exiting), however, it is generally a more conscious decision by the application/library developer to invoke System.exit so I'm less worried about this.

Regarding my prior comment on signal chaining, what I mean is that we probably need to build a similar facility for Julia.

I see! That could indeed be a solution. It would require diving pretty deep into the specifics of JVM signal handling though, to ensure that we don't invalidate the JIT's assumptions about signals.

What I find problematic is when a package takes the choice whether to use that option away from the end user.

I completely agree and maybe this is the central issue, any solution has to include an escape hatch. I'm not arguing that we should take options away from power users but that the defaults should nudge users to make sensible decisions on subtle issues. One possible solution would be a kwarg for init: install_signals=false that adds -Xrs if false or a corresponding global in JavaCall.

including significant performance impacts

Although I understand that you mean significant in the sense that JVM vendors have found it worthwhile to implement these optimizations, I think it's important to state for the record that the performance difference is neither large nor guaranteed to be positive.

The Openj9 docs state:

Disabling signal handling in the OpenJ9 VM reduces performance by approximately 2-4%, depending on the application.

Like with most JVM options, YMMV and you should benchmark your own workload. In my benchmarks, running parts of the renaissance suite using OpenJDK 11 on Linux, I get lower median duration for many benchmarks when running with -Xrs. I would not argue that this means that "-Xrs" leads to better performance as a general rule but anyone reading this in the future should not assume that running without "-Xrs" will boost performance.

This hangs for me on both Windows and Linux. It is supposed to.

Exactly, and as noted in the thread you linked:

The issue needs to be resolved by the application developers. Either they need to make sure that every non-daemon thread has exited before the main thread finishes, or all other threads need to be made daemon threads.

If we don't try to call DestroyJavaVM, we are hiding this fact from the developers, potentially corrupting their data without ever telling them that they have dangling threads. We should definitely try to do even better though, which is what I hope to discuss further in this issue. Could we try to exit with a timeout and then throw an exception for example?

Without intention, the user can easily end up with a stuck process.

If they have issues in their code they may become visible through a stuck process. The issues remain either way but are currently invisible to the developer. I'd prefer an error to a stuck process, but I'd prefer a stuck process to silently corrupt data. What should the default be? Can we improve on the situation with -Xrs and destroy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants