-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects #123
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #123 +/- ##
============================================
+ Coverage 80.14% 80.37% +0.22%
- Complexity 428 430 +2
============================================
Files 43 43
Lines 1516 1513 -3
Branches 138 138
============================================
+ Hits 1215 1216 +1
+ Misses 224 222 -2
+ Partials 77 75 -2
Continue to review full report at Codecov.
|
@@ -313,7 +313,6 @@ private static void setLightGBMDatasetLabelData(final SWIGTrainData swigTrainDat | |||
throw new LightGBMException(); | |||
} | |||
|
|||
swigTrainData.destroySwigTrainLabelDataArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't remove this line, instead we should:
swigTrainData.initSwigTrainLabelDataArray(); // Init and copy from chunked data.
try {
(...)
} finally {
swigTrainData.destroySwigTrainLabelDataArray();
}
The reason is because initSwigTrainLabelDataArray()
actually creates a new array and clones the chuncks, and if I understood well the code, if we call this method twice without calling destroySwigTrainLabelDataArray
we will leak an array of data.
We can think of an improvement by having a closeable class for this case too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is also the same situation as above @mlobofeedzai and should not be removed for the same reasons. I'd say, put a comment in the same manner as in the other case so there are no doubts as to why it's there.
I agree with @gandola , changing this to a try/finally is more idiomatic and better behaved in case of prior exceptions 👌.
} | ||
|
||
swigTrainData.initSwigDatasetHandle(); | ||
swigTrainData.destroySwigTrainFeaturesChunkedDataArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlbertoEAF is there any reason for calling destroySwigTrainFeaturesChunkedDataArray()
here? it looks to be an array that lives within the entire lifecycle of a SWIGTrainData
instance and it is released in the close
method.
If that is the case and to simplify and be more safe I would remove this destroySwigTrainFeaturesChunkedDataArray()
because it may not be necessary and can cause confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's wrong to delete this line. It'll unnecessarily duplicate RAM usage on the heaviest part of the job. Maybe add an inline comment with ("// Critical to save RAM. Do not remove!") along with a comment above explaining why it's here in the first place.
Transferring train data to a LightGBM Dataset
has several stages:
- Transfer data from a Pulse (OpenML) Dataset iterator (which doesn't know how large the dataset is) instance by instance to our ChunkedArrays. A ChunkedArray works as dynamically growing array of buffers automatically managed on the C++ side.
- Create the LGBM
Dataset
data structure from such ChunkedArrays (buffers). // In this stage we achieve peak RAM usage (2x copies of the dataset => ~2x memory usage) - Delete the buffers as soon as possible (
destroySwigTrainFeaturesChunkedDataArray()
). Now we only have the train data on a LightGBM Dataset. (we're back to only one copy of the train dataset 👍 )
Finally we can train the LGBM model from the LGBM Dataset, which can still use ample memory during train, but won't ever reach the 2x memory usage factor.
logger.error("Could not create LightGBM dataset."); | ||
throw new LightGBMException(); | ||
} | ||
// FIXME is this init necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is necessary. It's in here we pass from a void **
to a LGBM DatasetHandle
(https://lightgbm.readthedocs.io/en/latest/C-API.html#c.DatasetHandle) so we can pass it around with SWIG.
48f9bdf
to
80ba4ca
Compare
80ba4ca
to
3b96f41
Compare
The last commit broke the build. Address later |
Because we believe this non-closed/destroyed C++ objects were causing the segmentation faults
Mostly authored by @gandola