-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support >2GB of Tensor data in training checkpoint #20077
Support >2GB of Tensor data in training checkpoint #20077
Conversation
Update training checkpoint code to use external file if data > ~2GB. I don't see a way for the flatbuffers 64-bit offsets to be used as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct). https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39 Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible. Code is quickly hacked this afternoon and is mainly intended as a starting point/rough example of how things _might_ be. 100% untested. Please feel free to suggest alternative approaches. Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running `python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema.
Add unit tests for external write/read in core code. THESE DO NOT VALIDATE THE OUTPUT YET. - unit tests need to do more validation of data read
… + started writing unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review, will continue later.
onnxruntime/core/flatbuffers/schema/ort_training_checkpoint.fbs
Outdated
Show resolved
Hide resolved
### Description <!-- Describe your changes. --> Add platform aware helper to fetch errno message string. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> For usage in #20077 --------- Co-authored-by: Edward Chen <[email protected]>
… state, added checkpoint unit test
onnxruntime/core/flatbuffers/schema/ort_training_checkpoint.fbs
Outdated
Show resolved
Hide resolved
…ernal data + moved lambdas to helpers
Fix build errors. Fix unit tests.
Exclude tests that write ORT format data from minimal build. Minor cleanups.
…r consistency with the other generated filenames. Add generated file to lint exclusions.
orttraining/orttraining/test/training_api/core/checkpoint_test.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a python generate artifact test as well that can create the checkpoint with external data.
Also having the python test to load the checkpoint might be nice to have.
I think if this PR is getting too big, we can get it in follow-up PR.
Let's do this in a separate PR next week to ensure these changes are in the release. If the additional testing reveals any issues we can cherry pick fixes for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with follow-up PR for the Python unit tests.
Notes for future work:
- Add unit tests for the Python generate_artifacts and checkpoint API's for models with external data
- Look into changing spans of uint8_t to spans of std::byte for the ExternalDataReader and ExternalDataWriter
- Make the optional / nullable references to ExternalDataReader and ExternalDataWriter consistent among function flows
- Change checkpoint test expected parameter names into a
constexpr std::array<std::string_view>
- Add the new generated flatbuffer schema files to the exclude patterns in the lintrunner config file + fix the lines in checkpoint.cc and graph_flatbuffers_utils.cc that are > 120 characters
This reverts commit 9372e9a.
### Description <!-- Describe your changes. --> Add platform aware helper to fetch errno message string. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> For usage in microsoft#20077 --------- Co-authored-by: Edward Chen <[email protected]>
### Description <!-- Describe your changes. --> Add ability to store initializer data in an external file. Update training checkpoint code to use external file if data > ~2GB. I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct). https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39 Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible. Please feel free to suggest alternative approaches. Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running: `python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date. I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: carzh <[email protected]>
Description
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.
I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct).
https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39
Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible.
Please feel free to suggest alternative approaches.
Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running:
python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe
from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date.
I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there.
Motivation and Context