Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java SWIG - how to extract featureNames? #2814

Closed
AlbertoEAF opened this issue Feb 24, 2020 · 10 comments
Closed

Java SWIG - how to extract featureNames? #2814

AlbertoEAF opened this issue Feb 24, 2020 · 10 comments

Comments

@AlbertoEAF
Copy link
Contributor

Hello,

I'm having issues calling the SWIG wrappers in Java to extract the model feature names.
I get a segmentation fault when calling LGBM_BoosterGetFeatureNames.

In the Java-exposed lightgbmJNI.LGBM_BoosterGetFeatureNames method, the parameters are:

LGBM_BoosterGetFeatureNames(long boosterHandle, long outNumFeaturesPtr, String[] strings)
(https://lightgbm.readthedocs.io/en/latest/C-API.html#c.LGBM_BoosterGetFeatureNames)

The issue is that LGBM_BoosterGetFeatureNames requires an already allocated array of strings on the C side, which I cannot do if I pass a java String[] object. This then results in a memory segfault.

I've seen this comment related to implementing typemaps in SWIG for trickier cases #909 (comment).

I also saw some notices in the swig interface file:
/* Note: there is a bug in the SWIG generated string arrays when creating a new array with strings where the strings are prematurely deallocated */ %array_functions(char *, stringArray)

Am I using the API incorrectly or currently this function cannot be used?

Thank you! :D

@AlbertoEAF
Copy link
Contributor Author

I really need to use that function, so I made an extension to the lightgbmlib.i SWIG interface as an MVP just to make sure I can extract the feature names in my project until I get a reply from you on how to do it.

Disclaimer
I didn't care at all for building clean code, just implementing an MVP that works correctly so I can use LGBM_BoosterGetFeatureNames reliably. This is not production quality code, I could clean it up quite a bit if you find it useful.
A properly working String[] typemap would also be much better, but I have little to no experience with SWIG yet to do one.

To avoid touching any existing code at all, I added a single line at the end of lightgbmlib.i %include "char_array_helpers.i" that adds a series of custom_... functions with to the exported lightgbmlib and a variant of LGBM_BoosterGetFeatureNames that works for me without crashes. You can find the code in this branch: https://github.com/AlbertoEAF/LightGBM/tree/feat/swig-strings

In that interface file I create a class FeatureNames + typedef FeatureNamesHandle that manages an array of fixed length strings.

I then build a wrapper to the LGBM_BoosterGetFeatureNames that creates a new FeatureNames instance from which the user can extract the feature names:

This is how you use it:

`

public String[] getBoosterFeatureNames() {
    final int MAX_FEATURE_NAME_SIZE = 512; // Will segfault if a string has a name larger than this.
    long swigFeatureNamesHandleOutPtr = lightgbmlibJNI.voidpp_handle();
    long swigFeatureNamesHandle;

    try {
        int retcodeLGBM = lightgbmlibJNI.custom_new_LGBM_BoosterGetFeatureNames(
                swig.swigBoosterHandle,
                getBoosterNumFeatures(),
                MAX_FEATURE_NAME_SIZE,
                swigFeatureNamesHandleOutPtr // Returns a newly allocated FeatureNames handle
        );
        if (retcodeLGBM == -1) {
            logger.error("Couldn't read feature names");
            throw new LightGBMException();
        }
        swigFeatureNamesHandle = lightgbmlibJNI.voidpp_value(swigFeatureNamesHandleOutPtr); // Get the handle
    }
    finally {
        lightgbmlibJNI.delete_voidpp(swigFeatureNamesHandleOutPtr);
    }

    String[] featureNames = new String[getBoosterNumFeatures()];
    for(int i = 0; i < getBoosterNumFeatures(); ++i) {
        featureNames[i] = lightgbmlibJNI.custom_get_feature_name(swigFeatureNamesHandle, i);
        logger.debug("LightGBM model feature " + i + ": " + featureNames[i]);
    }
    logger.trace("Deallocating feature names.");
    lightgbmlibJNI.custom_FeatureNamesFree(swigFeatureNamesHandle);

    return featureNames;
}`

@StrikerRUS
Copy link
Collaborator

ping @imatiach-msft

@imatiach-msft
Copy link
Contributor

@AlbertoEAF yes, some of the SWIG wrappers don't work because SWIG can't handle string arrays automatically, I only ensured the ones that are used by mmlspark work. For this method, we will need to add another wrapper similar to this custom one that I added in lightgbmlib.i:

https://github.com/microsoft/LightGBM/blob/master/swig/lightgbmlib.i#L76

  char ** LGBM_BoosterGetEvalNamesSWIG(BoosterHandle handle,
                                       int eval_counts) {
    char** dst = new char*[eval_counts];
    for (int i = 0; i < eval_counts; ++i) {
      dst[i] = new char[128];
    }
    int result = LGBM_BoosterGetEvalNames(handle, &eval_counts, dst);
    if (result != 0) {
      return nullptr;
    }
    return dst;
  }

@imatiach-msft
Copy link
Contributor

@AlbertoEAF just read your other comment. I like this style too, I'm fine either way, but we should make sure that LGBM_BoosterGetEvalNamesSWIG is written in a consistent way too then. If you could either re-write the above to be in the same style or rewrite LGBM_BoosterGetEvalNamesSWIG to be in the style above that would be great, I just want to make sure the code is consistent.

@imatiach-msft
Copy link
Contributor

note to self - it looks like this array is not freed in mmlspark after we allocate it in mmlspark when calling LGBM_BoosterGetEvalNamesSWIG:

https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L166

@AlbertoEAF
Copy link
Contributor Author

Nice!

Yeah, I've seen other cases where MMLSpark does not delete every resource it creates, it's not just that one, there are other memory leaks.

If you're fine with it I'll clean my code up, address LGBM_BoosterGetEvalNamesSWIG as well and I'll send you a pull request where you're free to tell me if you want me to change anything and I'll follow your suggestions.

I think cleaning up the FeatureNames (FixedLengthStringArray class) and passing it around as a first class citizen whenever we need arrays of strings without SWIG messing it up is a good approach, so we can manage memory properly.
I'll rename FeatureNames -> FixedLengthStringArray or some other nicer name which is more generic too.

I'm also going to send the SWIG guys an issue regarding what's the proper wrapping for arrays of strings so we follow their recommended best practices and possibly simplify our wrapping of string arrays with maybe a fancy typemap.

I have my main work to do but should have something next week or the other after @imatiach-msft if that's ok for you, unless you plan to address that sooner.

@imatiach-msft
Copy link
Contributor

@AlbertoEAF that sounds terrific, thanks, I'm looking forward to your PR

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Mar 1, 2020

Hello @imatiach-msft,

If you look at their answer, there's a module to handle char**<->String[] conversion (various.i) which is already in use, and works as long as the last entry in char** is a NULL pointer.

However, unless we want to rewrite our C API (I don't), we have the extra constraint of needing to pass previously-allocated memory on the C side.

Proposed solution summary

Thus, I propose that we stick to the custom FixedLenStringArray class (previously called FeatureNames) which will manage memory resources for arrays of strings and we can pass it around in C++ and Java, but use the various.i extensions on top of it to extract String[] objects easily.

Proposed solution specifics

We can create wrappers to the original functions by replacing the char ** out parameters with pointers to FixedLenStringArray objects. We pass those by pointer, allocated from outside or initialized inside the wrappers and returned by pointer (as in the first comment).

We'll then add a method to the class which returns the raw char ** ptr:

char ** FixedLenStringArray::get_cstring_array_ptr()

as long as that array is null-terminated (OK, because the class can handle that logic), and thus, getting the full String[] in Java will be one call, no need to do a for loop to extract the parameters like in the opening comment for this issue.

What do you think?

Extra questions

  1. In your code snippet above should we handle errors with the API_BEGIN/END macros and return the success code 0/-1 as in the rest of the C API? I still don't understand when a LGBM_* method is impossible to fail and thus you can use the return value for what you want instead of the error code.

  2. I found 4 functions (beside the wrapped one) in the C API dealing with char**, 2 of which are setters, and other 2 which are getters. The setters should work with String[] directly (as you already use the various.i extension), thus you want only the 2 getters wrapped right?

  3. What should be the behaviour if allocating the array of strings is not possible for lack of memory? (std::bad_alloc), should we let it crash? Or handle and set a message error and return -1?

@AlbertoEAF
Copy link
Contributor Author

Ok, I'm submitting a pull request, I'll move the discussion there.

@imatiach-msft
Copy link
Contributor

@AlbertoEAF this looks great, terrific work

"unless we want to rewrite our C API (I don't)"
agreed, we need to be careful to touch the C API as little as possible, especially those methods used by python/R and which may be used by those who use the C API directly too

"we stick to the custom FixedLenStringArray class..."
sounds great

Extra questions

1.) not sure yet
2.) sure, we can start with the two getters for now
3.) I thought this behavior was already defined in the comments of the C API, return -1 on failure and set an error message
https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/c_api.h#L277

@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants