[R] On-demand serialization + standardization of attributes #9924

david-cortes · 2023-12-26T12:03:30Z

This PR changes the serialization logic of xgb.Booster objects to trigger only on-demand, by using an R altrep list class on which serialization methods are implemented, and not all of the expected altrep methods are implemented so as to avoid potential unwanted conversions that might lose the serializers.

Since the serialization logic changes with this PR, the way in which attributes are kept in model objects also needed to be changed:

Now there is a clear division between R-specific attributes, which can be arbitrary objects and are accessible and settable through attributes(model); and C-level attributes which are kept in the model JSON, which can be accessed and set through xgb.attributes(model).
Some attributes which were previously part of the R class were now moved to C-level attributes.
As serialization is now on-demand and R attributes are optional, there's no further need for an xgb.Booster.handle class. I removed it as part of this PR, since it would not have any use-case where it'd have some advantage over xgb.Booster without optional attributes.

This separation means that now accessing attributes is not as simple as calling model$<attribute> - now one needs to explicitly use either attributes or xgb.attributes.

Since the logic for keeping track of attributes was changed here, this PR required doing changes throughout pretty much all the R code. As it is right now, all of the tests are passing, but I am not entirely confident that this PR won't break something not covered in the tests, and I am not sure if I have updated all the docs that became outdated after these changes.

The changes here also now make xgboost incompatible with {caret}, so I've removed all the references to it. I also removed suggestions around it since this package is not in active development anymore as it was superceded by {tidymodels}. Haven't tested {mlr} but I have a feeling it might also break.

A couple notes about the PR and about many things I noticed - would be ideal if maintainers could open independent issues about some of these if needed:

Prioritization

As this PR creates merge conflicts with all others and it's the hardest to keep track of, would be very helpful to merge it before others like inplace predict or quantile dmatrices or roxygen updates.

Shallow and deep copies

Before this PR, serialization and de-serialization was triggered multiple times throughout the runs of different functions. Doing this was very inefficient, and had the potential to create inconsistencies between the R attributes and the C booster.

After this PR, there are no such duplications - objects are updated in-place, at the C level. It's now also possible to incrementally update a booster in-place through xgb.train, which is controlled by an optional parameter training_continuation, but I wasn't sure how this parameter should play along with xgboost parameter process_type.

Note that there's one edge case that I didn't know how to cover: if the user sends an interrupt signal, a booster that's being modified in-place will be left in an inconsistent state between the R attributes (like evaluation log) and the C booster. The R attributes are nevertheless not used in predict or similar methods.

Since one might now need to make copies of the booster, a helper function xgb.copy.Booster was also implemented in the public interface. I could not find any C-level function to duplicate a booster so I used the ubj to-bytes serializer for it, which I am guessing is not the most efficient way.

Information lost during serialization

It seems that the serialization functions that save models to disk (e.g. XGBoosterSaveModel), regardless of the format that they end up using (json/ubj), will never save custom user attributes that one sets to the booster outside of training, except for some particular attributes like 'niter'; unlike the serialization functions that save them to bytes like XGBoosterSaveModelToRaw which keep all of the custom attributes.

Additionally, these serializers also seem to lose feature names and types if they were set in the booster through XGBoosterSetStrFeatureInfo. As such, I modified some of the tests to manually remove the feature names for the comparisons, but I have a feeling that something here could be improved.

Serialization compatibility

After this PR, it will not be possible to load models that were saved with R serializers like saveRDS with a previous xgboost version, and it will not be possible to load models saved with saveRDS after this PR in previous xgboost versions. I updated the compatibility note to mention the breakpoint being xgboost version 2.1.0, which I suppose will be the next release.

There's a test which downloads serialized files from the internet and which will need to be updated. For now, I simply modified the test to skip the Rds files, but in the future would be more logical to update the files there.

I'm also thinking that it might be better to commit those model files here and bundle them in the R package instead of downloading them when the tests are run - if the files are updated now, unless the download link is changed, then the same test running on older xgboost versions will fail - will also make testing faster as there won't be a need to download files from the internet.

Serialization advise

XGBoost's own docs advise users to not use R-specific serializers, but lots of the functionalities from the public interface actually require using R serializers, such as function xgb.gblinear.history, which uses attributes from callbacks that aren't part of the standard C booster and thus do not get saved in functions like XGBoosterSaveModel or XGBoosterSaveModelToRaw. Moreover, from the point above, even C-level information is lost when using to-disk serializers, such as the feature names in the booster, which can also lead to very unexpected behaviors (e.g. they are saved with xgb.save.raw, but not with xgb.save).

I was thinking that perhaps the booster could add serialized versions of the R attributes (as obtained by base::serialize), but those raw bytes would not conform to a JSON-loadable string, would add latency and memory usage, are not usable in other interfaces, among other minuses.

It leaves me wondering what should be the actual suggestion in the doc, given that a user following that advise might find that things actually break after the fact, and there's advantages to using xgb.save.raw + R's writeBin compared to directly using xgb.save.

Plotting multi-valued trees

I am not sure if the plotting functionalities need any kind of adjustment for models like multi-quantile regression, multi-target objectives, and so on. I didn't add any modification here.

In the case of multi-quantile regression, functions seem to produce some output, but I am not sure about the correctness of such output, since there's one value per leaf visible there. Should this perhaps produce an error at the C++ level?

For mutli-output regression, functions like xgb.dump will error out from the C side - I would guess that multi-quantile and multi-class might share some structural similarities but am not familiar with tree plotting.

Retrieving booster attributes

As part of this PR, I added internal functions to access fields that are part of the C booster, such as the booster type.

I implemented them by producing full dumps of the booster JSON through XGBoosterSaveJsonConfig, then parsing them with R's jsonlite, and accessing the field in that parsed JSON.

This is rather inefficient, but I couldn't find any C-level function to extract only one particular field from the booster JSON. Would be quite useful to add more functions to retrieve and set commonly used attributes like the booster type and the number of threads, or to retrieve only one particular attribute from the JSON if given a path like field1.subfield2.<etc>.

Missing functions in core library

Both the R and Python interfaces now resort to parsing text dumps in order to extract information such as model coefficients in a linear booster. If such functionalities are going to be required throughout different interfaces, would be ideal to create C-level accessors for them that would for example avoid the loss of precision from the conversion from float to string and back.

Furthermore, some of these parsings broke after adding feature names, and the regexes needed to be updated. I am not sure that I've pushed updated everywhere necessary.

Attribute 'niter'

Handling of this attribute was rather strange before this PR. This attribute was kept both in the booster attributes and in the R attributes, with the caveat that the C one used base-0 indexing, while the R one used base-1. Further, they were not updated in synch everywhere.

After this PR, I preferred to remove this attribute, sticking instead to 'nrounds' as returned by function XGBoosterBoostedRounds.

There is one particular issue with this function however: after calling XGBoosterSetParam on a fitted booster, depending on what the parameters there contain, function XGBoosterBoostedRounds will afterwards return zero, but the trees will still be there when looking at e.g. learner.gradient_booster.gbtree_model_param.num_trees in the JSON config.

Linear intercepts

When creating a gblinear booster, it will have two intercepts:

One given by base_score.
Another given by the last coefficient in the JSON.

Both of those intercepts are added together in the prediction, which is quite confusing. Would be ideal if this booster type could auto-adjust itself after the fact to contain only one intercept - for example by forcibly setting base_score to zero and adding it to the last coefficient.

It would also improve things for some objective like multi:softmax as it gets an automatically added a base_score of 0.5 which, logically speaking, does not help with convergence / loss decrease.

Legacy R code

Changing things throughout multiple places made me notice that the logic for callbacks is very hard to follow and rather unidiomatic - for example, it requires defining variables inside the function call that do not seem to be used and one cannot tell from a look at that function if they could be removed or not. I am guessing it will also be the hardest thing to review in this PR, and the one that's most prone to errors.

Since these callbacks are part of the public interface (meaning: the user can pass custom callbacks), it'd be ideal to:

Document them (current docs do not cover key aspects like the contents of the environments).
Change the logic from using environments where variables are assigned to arbitrary names (e.g. the user might not know that there's a bst variable in the environment that needs to be accessed), towards using function keyword arguments.
- Ideally, there should be a factory function like xgb.Callback that would take arguments like before_training, after_training, before_iteration, after_iteration, each taking functions expecting a defined function signature.
- It should also standardize the logic for keeping things from a callback in booster attributes or not.
Avoid usage of global variables and <<- setters - better to keep a shared R environment-class variable for such purposes.

R-specific functionality

I also noticed that there are some functionalities in the R interface that the Python one lacks, such as function xgb.gblinear.history. I think this function could also be moved into a C-level functionality to keep that coefficient history as an internal booster property.

trivialfis · 2023-12-26T15:41:48Z

Thank you for introducing the new serialization! This is a major upgrade to the existing interface and I'm excited about it.

I will look into the PR deeper in the coming days and learn more about the changes.

Since this is related to serialization, I will assist in writing some testing guidelines, as mentioned earlier, serialization has been a foot gun and we still have issues with it today. I would like to be extra careful from the beginning.

trivialfis · 2023-12-26T15:43:04Z

Related #9908

david-cortes · 2023-12-26T21:51:55Z

Looks like the windows build from buildkite is still using R 3.6:

-- Found LibR: C:/Program Files/R/R-3.6.3
...
C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\R-package\src\xgboost_R.cc(679,27): error C3861: 'R_make_altlist_class': identifier not found [C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\build\R-package\xgboost-r.vcxproj]
C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\R-package\src\xgboost_R.cc(681,3): error C3861: 'R_set_altlist_Elt_method': identifier not found [C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0667cc80194b35b5c-1\xgboost\xgboost-ci-windows\build\R-package\xgboost-r.vcxproj]

hcho3 · 2023-12-26T22:02:09Z

What's the minimum R version do we need for ALTREP? I can update the R version on the Windows side.

R-package/R/utils.R

R-package/src/xgboost_R.cc

david-cortes · 2023-12-26T22:13:11Z

What's the minimum R version do we need for ALTREP? I can update the R version on the Windows side.

It's R 4.3.

hcho3 · 2023-12-26T22:17:14Z

I'm also thinking that it might be better to commit those model files here and bundle them in the R package instead of downloading them when the tests are run - if the files are updated now, unless the download link is changed, then the same test running on older xgboost versions will fail - will also make testing faster as there won't be a need to download files from the internet.

Currently, we have a policy of not including binary files in the git repository, since git isn't suited for handling diffs in binary (non-text) files. Some alternatives:

Use Git LFS. This can get expensive pretty fast, as GitHub charges per storage and per bandwidth.
Keep hosting model files externally, but version the files (by adding suffix), so that old tests don't break when we update the model files later.

hcho3 · 2023-12-26T22:20:23Z

Does it mean that the new XGBoost R package will only be compatible with R 4.3+ ? We probably should document the requirement.

david-cortes · 2023-12-26T22:21:53Z

Does it mean that the new XGBoost R package will only be compatible with R 4.3 ? We probably should document the requirement.

It's already stated in the DESCRIPTION file, so a user installing it with install.packages will not be able to install the new xgboost version in an older R version.

hcho3 · 2023-12-26T22:28:27Z

Sorry, I missed the conversation in #9847, as I was gone for vacation. I will go ahead and update the R version in the Windows CI pipeline.

mayer79 · 2023-12-27T16:13:22Z

Awesome work!

I noted that we will have a mix of non-Markdown and Markdown help files in "xgb.Booster.R" and "xgb.model.dt.tree.R".

E.g., backticks instead of \code, two asterisks instead of \bold,

Furthermore, I am avoiding long titles, @title tags, @description tags (except when the description has multiple sections).

Of course, I can go over these files after merging. Currently, I am not working on the help files until most of the PRs will be merged to avoid such conflicts.

david-cortes · 2023-12-27T18:24:09Z

I'm also thinking that it might be better to commit those model files here and bundle them in the R package instead of downloading them when the tests are run - if the files are updated now, unless the download link is changed, then the same test running on older xgboost versions will fail - will also make testing faster as there won't be a need to download files from the internet.

Currently, we have a policy of not including binary files in the git repository, since git isn't suited for handling diffs in binary (non-text) files. Some alternatives:

Use Git LFS. This can get expensive pretty fast, as GitHub charges per storage and per bandwidth.

Keep hosting model files externally, but version the files (by adding suffix), so that old tests don't break when we update the model files later.

There's also the option of pre-downloading it as part of the R package build script; and/or to look for the file under somewhere like inst when running tests without creating a package plus adding it to .gitignore, and save the file there instead of tmp if it detects such a folder.

hcho3

No objections from me

trivialfis

Huge thanks for the work on serialization! It will take some time to process it (I'm still on PTO). Can we extract some of the unrelated changes and merge them first?

In addition, a brief introduction to how it works as code comment would be nice. Something that can get new comers up to speed.

R-package/src/xgboost_R.cc

david-cortes · 2024-01-10T18:03:29Z

But that's also the case for UBJ serialization:

We have two distinguished cases as mentioned previously:

saving the model.

saving everything.

The xgb.save saves the model, and nothing else. It's used to "export" a model for other tasks like inference, explanation. xgb.serialize is used to save everything, that the user can continue training without any intervention, including things like whether to use GPU.

Thanks for pointing this out.

What would be the python equivalent of this function?

trivialfis · 2024-01-10T18:03:50Z

pickle.

trivialfis · 2024-01-10T18:05:24Z

specifically, __setstate__ and __getstate__.

trivialfis · 2024-01-10T18:08:41Z

I'm quite excited about this PR, and am comfortable with merging it regarding the interface change, this way we can unblock other PRs and avoid too many rebase work. Do you want to merge it and leave the RDS issue in a follow-up PR or do you want to have it solved in this one (I can help with the serialization)?

david-cortes · 2024-01-10T18:16:57Z

I've added a small note about the R attribute params.

I am wondering however if there's some C-level function that would automatically gather and serialize all C attributes that aren't handled by XGBoosterSaveModelToBuffer.

This step could be done in the same ALTREP serializer if we know what else needs to be there, but I see that the python class has many things that it keeps in attributes and I'm not sure which ones are C things that need to be re-assigned. Does it need to also gather xgb.parameters, for example? (there's currently no getter for that in R).

Do you want to merge it and leave the RDS issue in a follow-up PR

Let's better do it in this PR to keep all serialization-related things together.

trivialfis · 2024-01-10T18:21:44Z

@david-cortes See #9924 (comment) .

XGBoosterSerializeToBuffer
XGBoosterUnserializeFromBuffer

It saves everything that's defined in C, including parameters and attributes. It's used by the now removed xgb.serialize R function, and:

xgboost/python-package/xgboost/core.py

Line 1811 in 01c4711

_check_call(_LIB.XGBoosterUnserializeFromBuffer(handle, ptr, length))

trivialfis · 2024-01-10T18:22:13Z

I think we can use it here: https://github.com/dmlc/xgboost/pull/9924/files#r1445141247 .

david-cortes · 2024-01-10T18:31:08Z

Changed it to use XGBoosterSerializeToBuffer and XGBoosterUnserializeFromBuffer instead of XGBoosterSaveModelToBuffer.

Seems to pass the tests and work as expected.

hcho3 · 2024-01-10T18:47:26Z

We should probably retain the note about backward compatibility: if saveRDS is used, the model may not be fully usable in a future version of XGBoost.

david-cortes · 2024-01-10T18:52:52Z

We should probably retain the note about backward compatibility: if saveRDS is used, the model may not be fully usable in a future version of XGBoost.

Is that because XGBoosterSerializeToBuffer doesn't have such guarantee? If so, what happens when a user tries to load an incompatible model? Is there such an incompatible serialized booster that I could download from somewhere?

trivialfis · 2024-01-10T18:56:41Z

When loading from a different version of XGBoost, the internal parameters are automatically discarded, in which case, it's the same as xgb.save.

david-cortes · 2024-01-10T18:57:42Z

When loading from a different version of XGBoost, the internal parameters are automatically discarded, in which case, it's the same as xgb.save.

Ok, then I guess it's no big deal, and the compatibility note makes sense to keep.

trivialfis · 2024-01-10T18:57:44Z

See #9734 (comment)

Ok, then I guess it's no big deal, and the compatibility note makes sense to keep.

sounds good!

david-cortes · 2024-01-10T19:21:52Z

Then let's drop the warning about saveRDS ?

I've rewritten the compatibility note to reflect the current situation per my understanding. Would be ideal if you and @trivialfis could take a look at it. It doesn't have a warning to not use R's serializers, but describes a bit the differences.

I think it's still important to have it, since now we have an incompatibility with models that were created before this PR, and an incompatibility with older versions of qs which is also used for serialization.

R-package/R/utils.R

hcho3 · 2024-01-10T19:27:39Z

Thanks for the new wording. I recognize that some use cases do call for the use of saveRDS.

R-package/R/utils.R

Co-authored-by: Jiaming Yuan <[email protected]>

trivialfis

Looks good!

david-cortes added 13 commits December 26, 2023 13:00

on-demand serialization, refactor of attributes

09694ac

solve merge conflicts

ad6490b

export function for getting booster rounds

27bbdbc

linter

88dd947

fix incorrect qualifiers

4a3b5e2

Merge branch 'master' into altrep

e2331c3

remove all references to caret package

147e1cd

fix example

e012cce

misc fixes

f444812

allow unsetting booster info

2f30031

remove unused argument

6d4ad8b

more fixes

b0054be

missing import

4050b6f

david-cortes mentioned this pull request Dec 26, 2023

Dumps from multi-quantile models include only first quantile #9926

Closed

hcho3 requested changes Dec 26, 2023

View reviewed changes

R-package/R/utils.R Outdated Show resolved Hide resolved

R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved

R-package/src/xgboost_R.cc Outdated Show resolved Hide resolved

david-cortes added 2 commits December 27, 2023 19:12

swap 'static' with 'namespace'

2e16f73

improve wording on compatibility note

70affd5

hcho3 approved these changes Dec 27, 2023

View reviewed changes

trivialfis reviewed Dec 28, 2023

View reviewed changes

R-package/src/xgboost_R.cc Show resolved Hide resolved

R-package/src/xgboost_R.cc Show resolved Hide resolved

add note about booster's R parameters

692e5a5

user SerializeToBuffer for internal serialization

c161999

david-cortes added 2 commits January 10, 2024 19:33

add test for serialization of config

feedce5

check more attributes

a02abfc

This comment has been minimized.

Sign in to view

rewrite compatibility note for serialization

3285ed6

improve wording

8082256

hcho3 reviewed Jan 10, 2024

View reviewed changes

R-package/R/utils.R Outdated Show resolved Hide resolved

update note about attributes in xgb.save

6fa7937

trivialfis reviewed Jan 10, 2024

View reviewed changes

R-package/R/utils.R Outdated Show resolved Hide resolved

R-package/R/utils.R Outdated Show resolved Hide resolved

david-cortes and others added 3 commits January 10, 2024 20:44

Update R-package/R/utils.R

e02ed8f

Co-authored-by: Jiaming Yuan <[email protected]>

Update R-package/R/utils.R

ff70221

Co-authored-by: Jiaming Yuan <[email protected]>

rebuild docs

d133258

trivialfis approved these changes Jan 10, 2024

View reviewed changes

trivialfis merged commit d3a8d28 into dmlc:master Jan 10, 2024
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] On-demand serialization + standardization of attributes #9924

[R] On-demand serialization + standardization of attributes #9924

david-cortes commented Dec 26, 2023 •

edited

Loading

trivialfis commented Dec 26, 2023

trivialfis commented Dec 26, 2023

david-cortes commented Dec 26, 2023

hcho3 commented Dec 26, 2023

david-cortes commented Dec 26, 2023

hcho3 commented Dec 26, 2023 •

edited

Loading

hcho3 commented Dec 26, 2023 •

edited

Loading

david-cortes commented Dec 26, 2023

hcho3 commented Dec 26, 2023 •

edited

Loading

mayer79 commented Dec 27, 2023 •

edited

Loading

david-cortes commented Dec 27, 2023

hcho3 left a comment

trivialfis left a comment

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024

trivialfis commented Jan 10, 2024

trivialfis commented Jan 10, 2024

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024

trivialfis commented Jan 10, 2024

david-cortes commented Jan 10, 2024

hcho3 commented Jan 10, 2024

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024 •

edited

Loading

This comment has been minimized.

david-cortes commented Jan 10, 2024

hcho3 commented Jan 10, 2024

trivialfis left a comment

[R] On-demand serialization + standardization of attributes #9924

[R] On-demand serialization + standardization of attributes #9924

Conversation

david-cortes commented Dec 26, 2023 • edited Loading

Prioritization

Shallow and deep copies

Information lost during serialization

Serialization compatibility

Serialization advise

Plotting multi-valued trees

Retrieving booster attributes

Missing functions in core library

Attribute 'niter'

Linear intercepts

Legacy R code

R-specific functionality

trivialfis commented Dec 26, 2023

trivialfis commented Dec 26, 2023

david-cortes commented Dec 26, 2023

hcho3 commented Dec 26, 2023

david-cortes commented Dec 26, 2023

hcho3 commented Dec 26, 2023 • edited Loading

hcho3 commented Dec 26, 2023 • edited Loading

david-cortes commented Dec 26, 2023

hcho3 commented Dec 26, 2023 • edited Loading

mayer79 commented Dec 27, 2023 • edited Loading

david-cortes commented Dec 27, 2023

hcho3 left a comment

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024

trivialfis commented Jan 10, 2024

trivialfis commented Jan 10, 2024

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024

trivialfis commented Jan 10, 2024

david-cortes commented Jan 10, 2024

hcho3 commented Jan 10, 2024

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024

david-cortes commented Jan 10, 2024

trivialfis commented Jan 10, 2024 • edited Loading

This comment has been minimized.

david-cortes commented Jan 10, 2024

hcho3 commented Jan 10, 2024

trivialfis left a comment

Choose a reason for hiding this comment

david-cortes commented Dec 26, 2023 •

edited

Loading

hcho3 commented Dec 26, 2023 •

edited

Loading

hcho3 commented Dec 26, 2023 •

edited

Loading

hcho3 commented Dec 26, 2023 •

edited

Loading

mayer79 commented Dec 27, 2023 •

edited

Loading

trivialfis commented Jan 10, 2024 •

edited

Loading