-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gateway sigsegv's when cleaning up channels using ca_clear_channel #1
Comments
Murali Shankar (mshankar) wrote on 2014-02-12: Results of thread apply all bt in a core. |
Is there an update on this issue? |
I'm afraid not. I was never able to reproduce the issue, and they never got back on it. |
We had 3 events like this last month at SLAC. I'll try to narrow down the possible reasons. So far I have identified that when the Gateway call ca_clear_channel, the code in tcpiiu.cpp tries to remove an item from the ncui linked list, but at this point, the list is already empty. So, looks like there is another function entering in a condition that causes this list to be clean. Currently, we are using Gateway R2.1.2.0 and EPICS R7.0.3.1. |
Hi Márcio, Murali's original report from 2012 said the IOC involved was running on RTEMS 4.9.4. Does this crash happen with other IOCs running on other OSs? There are over 1000 threads in the full back-trace attached above, so the gateway was connected over 500 IOCs at the time. I don't see any obvious smoking guns in that, but I wasn't really expecting to. |
We've just seen this issue, on our most heavily loaded gateway although I believe it was running an older version of the GW code (2.0.something) and probably Base 3.15.5. I may still have access to the core file, but don't really have time to investigate it myself in detail right now so I'm leaving this comment as a marker.
|
Apparently it was running Gateway version 2.0 built against base-3.14.12.5-static (on RHEL-6 or RHEL-7). We've bumped it to a newer version since. |
Original LaunchPad Bug #1279147 reported by Murali Shankar on 2014-02-12:
At LCLS, the archiver appliances connect to the IOC's thru a CA gateway. The gateway crashes once in a while. This does not seem to be related to an “out-of-memory” issue or a “Gateway has been running for a long time” issue. Instead, it seems to be related to the gateway cleaning up PVs (Feb 07 04:42) from an IOC that is CPU overloaded and keeps disconnecting ( Feb 07 02:41).
From the gateway logs...
I have core dumps and I am able to examine the variables etc and indeed the gateway is trying to clean up the PVs from this IOC using ca_clear_channel. However, the place where this crashes is in a fundamental place (tsDLList.h:238) in EPICS base. I can provide more details/core if needed.
Regards,
Murali
More information
This is PV Gateway Version 2.0.3.0 [Mar 2 2012 09:46:57]
Gateway is built against base-R3-14-12 with a few patches applied (I can provide a full list if needed).
IOC eioc-und1-mp01 runs on RTEMS-4.9.4-slac_p0 on top of EPICS R3.14.12-SLAC_1 $Date 2010/11/27
The text was updated successfully, but these errors were encountered: