Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support >2GB of Tensor data in training checkpoint #20077

Merged
merged 27 commits into from
Apr 22, 2024

Conversation

skottmckay
Copy link
Contributor

@skottmckay skottmckay commented Mar 26, 2024

Description

Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.

I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct).

https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39

Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible.

Please feel free to suggest alternative approaches.

Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running:
python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe
from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date.

I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there.

Motivation and Context

Update training checkpoint code to use external file if data > ~2GB.

I don't see a way for the flatbuffers 64-bit offsets to be used as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct).

https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39

Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible.

Code is quickly hacked this afternoon and is mainly intended as a starting point/rough example of how things _might_ be. 100% untested.

Please feel free to suggest alternative approaches.

Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running

`python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema.
onnxruntime/core/flatbuffers/schema/ort.fbs Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.h Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
skottmckay and others added 3 commits March 27, 2024 18:48
Add unit tests for external write/read in core code. THESE DO NOT VALIDATE THE OUTPUT YET.
  - unit tests need to do more validation of data read
@carzh carzh changed the title WIP/RFC: Support >2GB of Tensor data in training checkpoint Support >2GB of Tensor data in training checkpoint Apr 10, 2024
@carzh carzh marked this pull request as ready for review April 10, 2024 17:11
@carzh carzh requested a review from a team as a code owner April 10, 2024 17:11
Copy link
Contributor

@edgchen1 edgchen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, will continue later.

onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffers_utils_test.fbs Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
skottmckay added a commit that referenced this pull request Apr 17, 2024
### Description
<!-- Describe your changes. -->
Add platform aware helper to fetch errno message string.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For usage in #20077

---------

Co-authored-by: Edward Chen <[email protected]>
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffers_utils_test.fbs Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
onnxruntime/test/ir/flatbuffer_utils_test.cc Outdated Show resolved Hide resolved
Exclude tests that write ORT format data from minimal build.
Minor cleanups.
…r consistency with the other generated filenames.

Add generated file to lint exclusions.
onnxruntime/core/flatbuffers/schema/compile_schema.py Outdated Show resolved Hide resolved
onnxruntime/core/flatbuffers/schema/ort.fbs Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
onnxruntime/core/graph/graph_flatbuffers_utils.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
orttraining/orttraining/training_api/checkpoint.cc Outdated Show resolved Hide resolved
Copy link
Contributor

@baijumeswani baijumeswani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a python generate artifact test as well that can create the checkpoint with external data.
Also having the python test to load the checkpoint might be nice to have.
I think if this PR is getting too big, we can get it in follow-up PR.

@skottmckay
Copy link
Contributor Author

It would be nice to have a python generate artifact test as well that can create the checkpoint with external data.

Let's do this in a separate PR next week to ensure these changes are in the release. If the additional testing reveals any issues we can cherry pick fixes for them.

carzh
carzh previously approved these changes Apr 20, 2024
Copy link
Contributor

@carzh carzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with follow-up PR for the Python unit tests.

Notes for future work:

  • Add unit tests for the Python generate_artifacts and checkpoint API's for models with external data
  • Look into changing spans of uint8_t to spans of std::byte for the ExternalDataReader and ExternalDataWriter
  • Make the optional / nullable references to ExternalDataReader and ExternalDataWriter consistent among function flows
  • Change checkpoint test expected parameter names into a constexpr std::array<std::string_view>
  • Add the new generated flatbuffer schema files to the exclude patterns in the lintrunner config file + fix the lines in checkpoint.cc and graph_flatbuffers_utils.cc that are > 120 characters

@carzh carzh merged commit 9372e9a into main Apr 22, 2024
91 of 94 checks passed
@carzh carzh deleted the skottmckay/AddExternalDataOffsetToFlatbufferTensor branch April 22, 2024 22:17
mszhanyi pushed a commit that referenced this pull request Apr 23, 2024
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
### Description
<!-- Describe your changes. -->
Add platform aware helper to fetch errno message string.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
For usage in microsoft#20077

---------

Co-authored-by: Edward Chen <[email protected]>
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
### Description
<!-- Describe your changes. -->
Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.

I don't see a way for the flatbuffers 64-bit offsets to be used, as they
don't support storing 'table' types with 64-bit offsets (and our Tensor
is a 'table' type not a simple struct).


https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39

Allowing a Tensor to have its raw_data in an external file should
hopefully work with the least friction. As it's an extra field it's
backwards compatible.

Please feel free to suggest alternative approaches. 

Side note: the diffs in the generated *.fbs.h files are unexpectedly
large. Maybe they weren't re-generated when the new flatbuffers version
was checked in. I updated by running:
`python .\compile_schema.py -f <build output
dir>\_deps\flatbuffers-build\Debug\flatc.exe`
from onnxruntime\core\flatbuffers\schema which I thought was the correct
way but maybe that's out of date.

I think you can ignore all the diffs in the generated files and just
worry about the changes to the .fbs files in
onnxruntime/core/flatbuffers/schema. Basically start at the bottom of
the files changed and work up as all the 'real' diffs are there.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: carzh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants