Support >2GB of Tensor data in training checkpoint #20077

skottmckay · 2024-03-26T09:16:36Z

Description

Add ability to store initializer data in an external file.
Update training checkpoint code to use external file if data > ~2GB.

I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct).

https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39

Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible.

Please feel free to suggest alternative approaches.

Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running:
python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe
from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date.

I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there.

Motivation and Context

Update training checkpoint code to use external file if data > ~2GB. I don't see a way for the flatbuffers 64-bit offsets to be used as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct). https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39 Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible. Code is quickly hacked this afternoon and is mainly intended as a starting point/rough example of how things _might_ be. 100% untested. Please feel free to suggest alternative approaches. Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running `python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema.

onnxruntime/core/flatbuffers/schema/ort.fbs

onnxruntime/core/graph/graph_flatbuffers_utils.h

onnxruntime/core/graph/graph_flatbuffers_utils.cc

orttraining/orttraining/training_api/checkpoint.cc

Add unit tests for external write/read in core code. THESE DO NOT VALIDATE THE OUTPUT YET. - unit tests need to do more validation of data read

… + started writing unit test

onnxruntime/test/ir/flatbuffer_utils_test.cc

…ied to fix the pipeline build errors

edgchen1

partial review, will continue later.

onnxruntime/core/graph/graph_flatbuffers_utils.cc

onnxruntime/test/ir/flatbuffer_utils_test.cc

onnxruntime/test/ir/flatbuffers_utils_test.fbs

onnxruntime/test/ir/flatbuffer_utils_test.cc

orttraining/orttraining/training_api/checkpoint.cc

onnxruntime/core/graph/graph_flatbuffers_utils.cc

onnxruntime/core/flatbuffers/schema/ort_training_checkpoint.fbs

onnxruntime/core/graph/graph_flatbuffers_utils.cc

orttraining/orttraining/training_api/checkpoint.cc

### Description  Add platform aware helper to fetch errno message string. ### Motivation and Context  For usage in #20077 --------- Co-authored-by: Edward Chen <[email protected]>

… state, added checkpoint unit test

onnxruntime/core/flatbuffers/schema/ort_training_checkpoint.fbs

onnxruntime/core/graph/graph_flatbuffers_utils.cc

onnxruntime/test/ir/flatbuffers_utils_test.fbs

orttraining/orttraining/training_api/checkpoint.cc

…ernal data + moved lambdas to helpers

Fix build errors. Fix unit tests.

orttraining/orttraining/training_api/checkpoint.cc

onnxruntime/test/ir/flatbuffer_utils_test.cc

Exclude tests that write ORT format data from minimal build. Minor cleanups.

…r consistency with the other generated filenames. Add generated file to lint exclusions.

onnxruntime/core/flatbuffers/schema/compile_schema.py

onnxruntime/core/flatbuffers/schema/ort.fbs

onnxruntime/core/graph/graph_flatbuffers_utils.cc

onnxruntime/test/flatbuffers/flatbuffers_utils_test.fbs

orttraining/orttraining/test/training_api/core/checkpoint_test.cc

orttraining/orttraining/training_api/checkpoint.cc

baijumeswani

It would be nice to have a python generate artifact test as well that can create the checkpoint with external data.
Also having the python test to load the checkpoint might be nice to have.
I think if this PR is getting too big, we can get it in follow-up PR.

skottmckay · 2024-04-20T07:49:40Z

It would be nice to have a python generate artifact test as well that can create the checkpoint with external data.

Let's do this in a separate PR next week to ensure these changes are in the release. If the additional testing reveals any issues we can cherry pick fixes for them.

carzh

I agree with follow-up PR for the Python unit tests.

Notes for future work:

Add unit tests for the Python generate_artifacts and checkpoint API's for models with external data
Look into changing spans of uint8_t to spans of std::byte for the ExternalDataReader and ExternalDataWriter
Make the optional / nullable references to ExternalDataReader and ExternalDataWriter consistent among function flows
Change checkpoint test expected parameter names into a constexpr std::array<std::string_view>
Add the new generated flatbuffer schema files to the exclude patterns in the lintrunner config file + fix the lines in checkpoint.cc and graph_flatbuffers_utils.cc that are > 120 characters

onnxruntime/core/graph/graph_flatbuffers_utils.h

orttraining/orttraining/test/training_api/core/checkpoint_test.cc

This reverts commit 9372e9a.

### Description  Add platform aware helper to fetch errno message string. ### Motivation and Context  For usage in microsoft#20077 --------- Co-authored-by: Edward Chen <[email protected]>

### Description  Add ability to store initializer data in an external file. Update training checkpoint code to use external file if data > ~2GB. I don't see a way for the flatbuffers 64-bit offsets to be used, as they don't support storing 'table' types with 64-bit offsets (and our Tensor is a 'table' type not a simple struct). https://github.com/google/flatbuffers/blob/0cfb7eb80b05c058e19e50fb575263908e601469/tests/64bit/test_64bit.fbs#L38-L39 Allowing a Tensor to have its raw_data in an external file should hopefully work with the least friction. As it's an extra field it's backwards compatible. Please feel free to suggest alternative approaches. Side note: the diffs in the generated *.fbs.h files are unexpectedly large. Maybe they weren't re-generated when the new flatbuffers version was checked in. I updated by running: `python .\compile_schema.py -f <build output dir>\_deps\flatbuffers-build\Debug\flatc.exe` from onnxruntime\core\flatbuffers\schema which I thought was the correct way but maybe that's out of date. I think you can ignore all the diffs in the generated files and just worry about the changes to the .fbs files in onnxruntime/core/flatbuffers/schema. Basically start at the bottom of the files changed and work up as all the 'real' diffs are there. ### Motivation and Context  --------- Co-authored-by: carzh <[email protected]>

skottmckay requested review from baijumeswani, edgchen1 and carzh March 26, 2024 09:16

Make functions references to avoid copy of non-trivial lambda.

3187fba

edgchen1 reviewed Mar 26, 2024

View reviewed changes

baijumeswani reviewed Mar 26, 2024

View reviewed changes

orttraining/orttraining/training_api/checkpoint.cc Show resolved Hide resolved

skottmckay and others added 3 commits March 27, 2024 18:48

Address PR comments.

1bed68c

Add unit tests for external write/read in core code. THESE DO NOT VALIDATE THE OUTPUT YET. - unit tests need to do more validation of data read

fixed some small bugs + wrote first working(?) unit test

383a00f

added code that converts flatbuffers with external data to ort tensor…

e97f9a6

… + started writing unit test

github-advanced-security bot found potential problems Apr 9, 2024

View reviewed changes

onnxruntime/test/ir/flatbuffer_utils_test.cc Fixed Show fixed Hide fixed

carzh added 3 commits April 9, 2024 12:53

fixed reader segfault + updated tests

4094363

moved function to try to prevent safeint build failure in the pipelines

bca6849

finished data verification for test for LoadOrtTensorOrtFormat and tr…

fc6af96

…ied to fix the pipeline build errors

carzh changed the title ~~WIP/RFC: Support >2GB of Tensor data in training checkpoint~~ Support >2GB of Tensor data in training checkpoint Apr 10, 2024

carzh marked this pull request as ready for review April 10, 2024 17:11

carzh requested a review from a team as a code owner April 10, 2024 17:11

carzh added 3 commits April 10, 2024 10:22

oops lintrunner

6b9459e

fixed bug with creating uint8 tensorproto

bb07efa

attempting to resolve build errors in pipelines

321a924

edgchen1 reviewed Apr 11, 2024

View reviewed changes

baijumeswani reviewed Apr 11, 2024

View reviewed changes

skottmckay commented Apr 12, 2024

View reviewed changes

applied some of the suggestions

65d6a5f

skottmckay mentioned this pull request Apr 16, 2024

Add helper to get errno and error message #20324

Merged

carzh added 2 commits April 17, 2024 11:00

applied suggestions, added changes to save checkpoint from checkpoint…

dd04354

… state, added checkpoint unit test

fixed checkpoint test + fixed some minor checkpoint bugs

972859a

skottmckay commented Apr 18, 2024

View reviewed changes

carzh and others added 6 commits April 18, 2024 01:26

export_model_for_inferencing with external data file

adf3f33

added some suggestions + working export_for_inferencing test with ext…

7f4fc7f

…ernal data + moved lambdas to helpers

Merge remote-tracking branch 'origin/main' into scottsbranch

87f8199

added platform-specific error messages

cd0ded0

Address PR comments.

f7cf4c3

Fix build errors. Fix unit tests.

Remove temp debug output

70b5625

skottmckay commented Apr 19, 2024

View reviewed changes

skottmckay added 4 commits April 19, 2024 19:58

Make tests more robust.

738727d

Exclude tests that write ORT format data from minimal build. Minor cleanups.

Fix wasm build error

641dd45

Remove duplicate fbs generated file. Use the default '.fbs.h' file fo…

4e40740

…r consistency with the other generated filenames. Add generated file to lint exclusions.

Add manual #include path edit to generated file

5824779

edgchen1 reviewed Apr 20, 2024

View reviewed changes

baijumeswani reviewed Apr 20, 2024

View reviewed changes

carzh and others added 2 commits April 19, 2024 17:56

added some suggestions

7fa65ea

Address PR comments

48e0db3

carzh previously approved these changes Apr 20, 2024

View reviewed changes

Address PR comment

903fb18

skottmckay dismissed carzh’s stale review via 903fb18 April 20, 2024 22:45

baijumeswani approved these changes Apr 22, 2024

View reviewed changes

edgchen1 approved these changes Apr 22, 2024

View reviewed changes

yuslepukhin approved these changes Apr 22, 2024

View reviewed changes

yuslepukhin reviewed Apr 22, 2024

View reviewed changes

onnxruntime/core/graph/graph_flatbuffers_utils.h Show resolved Hide resolved

yuslepukhin reviewed Apr 22, 2024

View reviewed changes

orttraining/orttraining/test/training_api/core/checkpoint_test.cc Show resolved Hide resolved

carzh merged commit 9372e9a into main Apr 22, 2024
91 of 94 checks passed

carzh deleted the skottmckay/AddExternalDataOffsetToFlatbufferTensor branch April 22, 2024 22:17

mszhanyi pushed a commit that referenced this pull request Apr 23, 2024

Revert "Support >2GB of Tensor data in training checkpoint (#20077)"

6cbc44e

This reverts commit 9372e9a.

carzh mentioned this pull request Jun 21, 2024

[Training] protobuf limit reached when generating training artifacts for a onnx model more than 2GB. #18874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support >2GB of Tensor data in training checkpoint #20077

Support >2GB of Tensor data in training checkpoint #20077

skottmckay commented Mar 26, 2024 •

edited by carzh

Loading

edgchen1 left a comment

baijumeswani left a comment

skottmckay commented Apr 20, 2024

carzh left a comment •

edited

Loading

Support >2GB of Tensor data in training checkpoint #20077

Support >2GB of Tensor data in training checkpoint #20077

Conversation

skottmckay commented Mar 26, 2024 • edited by carzh Loading

Description

Motivation and Context

edgchen1 left a comment

Choose a reason for hiding this comment

baijumeswani left a comment

Choose a reason for hiding this comment

skottmckay commented Apr 20, 2024

carzh left a comment • edited Loading

Choose a reason for hiding this comment

skottmckay commented Mar 26, 2024 •

edited by carzh

Loading

carzh left a comment •

edited

Loading