Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects #123

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ghost
Copy link

@ghost ghost commented Jun 1, 2022

Because we believe this non-closed/destroyed C++ objects were causing the segmentation faults

Mostly authored by @gandola

@ghost ghost requested a review from AlbertoEAF June 1, 2022 16:02
@codecov
Copy link

codecov bot commented Jun 1, 2022

Codecov Report

Merging #123 (80ba4ca) into master (76c4b30) will increase coverage by 0.22%.
The diff coverage is 81.25%.

❗ Current head 80ba4ca differs from pull request most recent head 3b96f41. Consider uploading reports for the commit 3b96f41 to get more accurate results

@@             Coverage Diff              @@
##             master     #123      +/-   ##
============================================
+ Coverage     80.14%   80.37%   +0.22%     
- Complexity      428      430       +2     
============================================
  Files            43       43              
  Lines          1516     1513       -3     
  Branches        138      138              
============================================
+ Hits           1215     1216       +1     
+ Misses          224      222       -2     
+ Partials         77       75       -2     
Impacted Files Coverage Δ
...tgbm/LightGBMBinaryClassificationModelTrainer.java 86.56% <80.00%> (-0.30%) ⬇️
...eedzai/openml/provider/lightgbm/SWIGResources.java 84.40% <100.00%> (ø)
...eedzai/openml/provider/lightgbm/SWIGTrainData.java 100.00% <0.00%> (+9.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72801a1...3b96f41. Read the comment docs.

@@ -313,7 +313,6 @@ private static void setLightGBMDatasetLabelData(final SWIGTrainData swigTrainDat
throw new LightGBMException();
}

swigTrainData.destroySwigTrainLabelDataArray();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't remove this line, instead we should:

 swigTrainData.initSwigTrainLabelDataArray(); // Init and copy from chunked data.
 try {
    (...)
 } finally {
    swigTrainData.destroySwigTrainLabelDataArray();
 }

The reason is because initSwigTrainLabelDataArray() actually creates a new array and clones the chuncks, and if I understood well the code, if we call this method twice without calling destroySwigTrainLabelDataArray we will leak an array of data.

We can think of an improvement by having a closeable class for this case too.

Copy link
Contributor

@AlbertoEAF AlbertoEAF Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is also the same situation as above @mlobofeedzai and should not be removed for the same reasons. I'd say, put a comment in the same manner as in the other case so there are no doubts as to why it's there.

I agree with @gandola , changing this to a try/finally is more idiomatic and better behaved in case of prior exceptions 👌.

}

swigTrainData.initSwigDatasetHandle();
swigTrainData.destroySwigTrainFeaturesChunkedDataArray();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlbertoEAF is there any reason for calling destroySwigTrainFeaturesChunkedDataArray() here? it looks to be an array that lives within the entire lifecycle of a SWIGTrainData instance and it is released in the close method.

If that is the case and to simplify and be more safe I would remove this destroySwigTrainFeaturesChunkedDataArray() because it may not be necessary and can cause confusion.

Copy link
Contributor

@AlbertoEAF AlbertoEAF Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's wrong to delete this line. It'll unnecessarily duplicate RAM usage on the heaviest part of the job. Maybe add an inline comment with ("// Critical to save RAM. Do not remove!") along with a comment above explaining why it's here in the first place.

Transferring train data to a LightGBM Dataset has several stages:

  1. Transfer data from a Pulse (OpenML) Dataset iterator (which doesn't know how large the dataset is) instance by instance to our ChunkedArrays. A ChunkedArray works as dynamically growing array of buffers automatically managed on the C++ side.
  2. Create the LGBM Dataset data structure from such ChunkedArrays (buffers). // In this stage we achieve peak RAM usage (2x copies of the dataset => ~2x memory usage)
  3. Delete the buffers as soon as possible (destroySwigTrainFeaturesChunkedDataArray()). Now we only have the train data on a LightGBM Dataset. (we're back to only one copy of the train dataset 👍 )

Finally we can train the LGBM model from the LGBM Dataset, which can still use ample memory during train, but won't ever reach the 2x memory usage factor.

logger.error("Could not create LightGBM dataset.");
throw new LightGBMException();
}
// FIXME is this init necessary?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is necessary. It's in here we pass from a void ** to a LGBM DatasetHandle (https://lightgbm.readthedocs.io/en/latest/C-API.html#c.DatasetHandle) so we can pass it around with SWIG.

@ghost ghost force-pushed the ft-ml-PULSEDEV-36792-improve-lgbm-trainer branch 2 times, most recently from 48f9bdf to 80ba4ca Compare June 3, 2022 10:34
@ghost ghost force-pushed the ft-ml-PULSEDEV-36792-improve-lgbm-trainer branch from 80ba4ca to 3b96f41 Compare June 3, 2022 12:54
@ghost
Copy link
Author

ghost commented Jun 3, 2022

The last commit broke the build. Address later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants