ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects #123

ghost · 2022-06-01T16:02:10Z

Because we believe this non-closed/destroyed C++ objects were causing the segmentation faults

Mostly authored by @gandola

codecov · 2022-06-01T16:18:12Z

Codecov Report

Merging #123 (80ba4ca) into master (76c4b30) will increase coverage by 0.22%.
The diff coverage is 81.25%.

❗ Current head 80ba4ca differs from pull request most recent head 3b96f41. Consider uploading reports for the commit 3b96f41 to get more accurate results

@@             Coverage Diff              @@
##             master     #123      +/-   ##
============================================
+ Coverage     80.14%   80.37%   +0.22%     
- Complexity      428      430       +2     
============================================
  Files            43       43              
  Lines          1516     1513       -3     
  Branches        138      138              
============================================
+ Hits           1215     1216       +1     
+ Misses          224      222       -2     
+ Partials         77       75       -2

Impacted Files	Coverage Δ
...tgbm/LightGBMBinaryClassificationModelTrainer.java	`86.56% <80.00%> (-0.30%)`	⬇️
...eedzai/openml/provider/lightgbm/SWIGResources.java	`84.40% <100.00%> (ø)`
...eedzai/openml/provider/lightgbm/SWIGTrainData.java	`100.00% <0.00%> (+9.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72801a1...3b96f41. Read the comment docs.

gandola · 2022-06-01T20:11:08Z

...main/java/com/feedzai/openml/provider/lightgbm/LightGBMBinaryClassificationModelTrainer.java

@@ -313,7 +313,6 @@ private static void setLightGBMDatasetLabelData(final SWIGTrainData swigTrainDat
            throw new LightGBMException();
        }

-        swigTrainData.destroySwigTrainLabelDataArray();


I think we shouldn't remove this line, instead we should:

swigTrainData.initSwigTrainLabelDataArray(); // Init and copy from chunked data. try { (...) } finally { swigTrainData.destroySwigTrainLabelDataArray(); }

The reason is because initSwigTrainLabelDataArray() actually creates a new array and clones the chuncks, and if I understood well the code, if we call this method twice without calling destroySwigTrainLabelDataArray we will leak an array of data.

We can think of an improvement by having a closeable class for this case too.

Yes this is also the same situation as above @mlobofeedzai and should not be removed for the same reasons. I'd say, put a comment in the same manner as in the other case so there are no doubts as to why it's there.

I agree with @gandola , changing this to a try/finally is more idiomatic and better behaved in case of prior exceptions 👌.

gandola · 2022-06-01T20:27:06Z

...main/java/com/feedzai/openml/provider/lightgbm/LightGBMBinaryClassificationModelTrainer.java

        }

-        swigTrainData.initSwigDatasetHandle();
-        swigTrainData.destroySwigTrainFeaturesChunkedDataArray();


@AlbertoEAF is there any reason for calling destroySwigTrainFeaturesChunkedDataArray() here? it looks to be an array that lives within the entire lifecycle of a SWIGTrainData instance and it is released in the close method.

If that is the case and to simplify and be more safe I would remove this destroySwigTrainFeaturesChunkedDataArray() because it may not be necessary and can cause confusion.

Yes, it's wrong to delete this line. It'll unnecessarily duplicate RAM usage on the heaviest part of the job. Maybe add an inline comment with ("// Critical to save RAM. Do not remove!") along with a comment above explaining why it's here in the first place.

Transferring train data to a LightGBM Dataset has several stages:

Transfer data from a Pulse (OpenML) Dataset iterator (which doesn't know how large the dataset is) instance by instance to our ChunkedArrays. A ChunkedArray works as dynamically growing array of buffers automatically managed on the C++ side.

Create the LGBM Dataset data structure from such ChunkedArrays (buffers). // In this stage we achieve peak RAM usage (2x copies of the dataset => ~2x memory usage)

Delete the buffers as soon as possible (destroySwigTrainFeaturesChunkedDataArray()). Now we only have the train data on a LightGBM Dataset. (we're back to only one copy of the train dataset 👍 )

Finally we can train the LGBM model from the LGBM Dataset, which can still use ample memory during train, but won't ever reach the 2x memory usage factor.

AlbertoEAF · 2022-06-02T13:57:02Z

...main/java/com/feedzai/openml/provider/lightgbm/LightGBMBinaryClassificationModelTrainer.java

+                logger.error("Could not create LightGBM dataset.");
+                throw new LightGBMException();
+            }
+            // FIXME is this init necessary?


Yes this is necessary. It's in here we pass from a void ** to a LGBM DatasetHandle (https://lightgbm.readthedocs.io/en/latest/C-API.html#c.DatasetHandle) so we can pass it around with SWIG.

…objects

ghost · 2022-06-03T12:56:02Z

The last commit broke the build. Address later

ghost requested a review from AlbertoEAF June 1, 2022 16:02

gandola reviewed Jun 1, 2022

View reviewed changes

AlbertoEAF reviewed Jun 2, 2022

View reviewed changes

ghost force-pushed the ft-ml-PULSEDEV-36792-improve-lgbm-trainer branch 2 times, most recently from 48f9bdf to 80ba4ca Compare June 3, 2022 10:34

MLobo1997 added 4 commits June 3, 2022 12:21

ref PULSEDEV-36792 openml-lightgbm: Fix segmentation fault

a48323b

ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG …

71fd8e5

…objects

ref PULSEDEV-36792 openml-lightgbm: Fix tests with unreleased resources

bd3fe9f

ref PULSEDEV-36792 openml-lightgbm: Improvement

3b96f41

ghost force-pushed the ft-ml-PULSEDEV-36792-improve-lgbm-trainer branch from 80ba4ca to 3b96f41 Compare June 3, 2022 12:54

AndreFCruz mentioned this pull request Jul 7, 2022

FairGBM openml API implementation #116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects #123

ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects #123

ghost commented Jun 1, 2022 •

edited by ghost

Loading

codecov bot commented Jun 1, 2022 •

edited

Loading

gandola Jun 1, 2022

AlbertoEAF Jun 2, 2022 •

edited

Loading

gandola Jun 1, 2022

AlbertoEAF Jun 2, 2022 •

edited

Loading

AlbertoEAF Jun 2, 2022

ghost commented Jun 3, 2022

ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects #123

Are you sure you want to change the base?

ref PULSEDEV-36792 openml-lightgbm: Identify and fix non-closed SWIG objects #123

Conversation

ghost commented Jun 1, 2022 • edited by ghost Loading

codecov bot commented Jun 1, 2022 • edited Loading

Codecov Report

gandola Jun 1, 2022

Choose a reason for hiding this comment

AlbertoEAF Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

gandola Jun 1, 2022

Choose a reason for hiding this comment

AlbertoEAF Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

AlbertoEAF Jun 2, 2022

Choose a reason for hiding this comment

ghost commented Jun 3, 2022

ghost commented Jun 1, 2022 •

edited by ghost

Loading

codecov bot commented Jun 1, 2022 •

edited

Loading

AlbertoEAF Jun 2, 2022 •

edited

Loading

AlbertoEAF Jun 2, 2022 •

edited

Loading