Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Implement Checkpoint APIs #13561

Closed
wants to merge 17 commits into from

Conversation

jyoungyun
Copy link
Contributor

No description provided.

@jyoungyun jyoungyun force-pushed the draft/implement_checkpoint branch from cc026ce to 63b876f Compare August 2, 2024 06:08
@jyoungyun
Copy link
Contributor Author

jyoungyun commented Aug 5, 2024

Export test

  • Command
    $ ./Product/out/bin/onert_train --modelfile ../Personal/models/mobilenet_v2/keras_model/random_init.circle \
    --load_input:raw ../Personal/models/data/imagenet_a/test.input.100.bin \
    --load_expected:raw ../Personal/models/data/imagenet_a/test.output.100.bin \
    --epoch 1 --batch_size 10 --loss 1 --optimizer 1 --loss_reduction_type 1 \
    --learning_rate 0.01 --num_of_trainable_ops -1 \
    --export_checkpoint test.ckpt
    
  • Result
    • Header
      $ xxd -u -l 16 test.ckpt
      00000000: AD01 0100 D4D3 D400 94A7 A901 547B 7E02  ............T{~.
      
    • n_buffers
      $ xxd -u -s 16 -l 4 test.ckpt
      00000010: 6E00 0000                                n... // 110
      
    • n_buffers offsets (Little endian)
      $ xxd -u -s 20 -l 440 test.ckpt
      00000014: CC01 0000 CC37 0000 CC3A 0000 CC82 0000  .....7...:...... // 460 14283 15052
      00000024: 4C83 0000 4CE3 0000 CCE3 0000 CC43 0100  L...L........C..
      00000034: 4C44 0100 4CA4 0100 CCA4 0100 CC04 0200  LD..L...........
      00000044: 4C07 0200 4C67 0200 CC69 0200 CCC9 0500  L...Lg...i......
      00000054: CC89 0600 CC98 0600 CCF8 0900 CC58 0D00  .............X..
      ...
      

ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
@jyoungyun
Copy link
Contributor Author

Load test

  • Command
$ ./Product/out/bin/onert_train --modelfile ../Personal/models/mobilenet_v2/keras_model/random_init.circle \
--load_input:raw ../Personal/models/data/imagenet_a/test.input.100.bin \
--load_expected:raw ../Personal/models/data/imagenet_a/test.output.100.bin \
--epoch 1 --batch_size 10 --loss 1 --optimizer 1 --loss_reduction_type 1 \
--learning_rate 0.01 --num_of_trainable_ops -1 \
--checkpoint test.ckpt
  • Result
    • Header
      • Tensor
        (gdb) p _header
        $4 = {magic = 429, schema = 1 '\001', reserved = 0 '\000', opt1_offset = 13947860, opt2_offset = 27895700, other_offset = 41843540, 
          length = 110}
        
      • Checkpoint file
        $ xxd -u -l 20 test.ckpt
        00000000: AD01 0100 D4D3 D400 94A7 A901 547B 7E02  ............T{~.
        00000010: 6E00 0000                                n...
        
    • Tensor data
      • Tensor
        (gdb) x/16x tensor->buffer()
        0x7fffddf06920: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
        0x7fffddf06928: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
        __after reading file__
        (gdb) x/16x tensor->buffer()
        0x7fffddf06920: 0xd6    0x4f    0xd9    0x2a    0xea    0xaa    0x58    0x2a
        0x7fffddf06928: 0xce    0x84    0x59    0xaa    0x5e    0x82    0x4d    0x2a
        (gdb) p _tensor_data[vindex]
        $6 = {offset = 14284, size = 768}
        
      • Checkpoint file
        $ xxd -u -l 16 -s 14284 test.ckpt
        000037cc: D64F D92A EAAA 582A CE84 59AA 5E82 4D2A  .O.*..X*..Y.^.M*
        

@jyoungyun jyoungyun force-pushed the draft/implement_checkpoint branch from bbc757d to 6f1d0e8 Compare August 12, 2024 04:42
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
ONE-DCO-1.0-Signed-off-by: Jiyoung Yun <[email protected]>
@jyoungyun
Copy link
Contributor Author

Test

  • Test (SGD optimizer)
    • Origial loss
          Epoch 1/5 - time: 411.496ms/step - loss: [0] 0.1560
          Epoch 2/5 - time: 403.569ms/step - loss: [0] 0.0648
          Epoch 3/5 - time: 410.286ms/step - loss: [0] 0.0042
          Epoch 4/5 - time: 423.164ms/step - loss: [0] 0.0020
          Epoch 5/5 - time: 428.583ms/step - loss: [0] 0.0016
      
    • Export checkpoint after 2 epochs
      $ ./Product/out/bin/onert_train --modelfile ../Personal/models/vgg16/transfer.circle  \
      --load_input:raw ../Personal/models/data/imagenet_a/test.input.10.bin \
      --load_expected:raw ../Personal/models/data/imagenet_a/test.output.10.bin \
      --batch_size 1 --loss 1 --optimizer 1 --loss_reduction_type 1 --learning_rate 0.01 \
      --num_of_trainable_ops -1 --epoch 2 --export_checkpoint vgg1.ckpt  <--- export
      ========================
      Epoch 1/2 - time: 450.970ms/step - loss: [0] 0.1560
      Epoch 2/2 - time: 449.429ms/step - loss: [0] 0.0648
      
    • Load checkpoint and train model
      $ ./Product/out/bin/onert_train --modelfile ../Personal/models/vgg16/transfer.circle  \
      --load_input:raw ../Personal/models/data/imagenet_a/test.input.10.bin \
      --load_expected:raw ../Personal/models/data/imagenet_a/test.output.10.bin \
      --batch_size 1 --loss 1 --optimizer 1 --loss_reduction_type 1 --learning_rate 0.01 \
      --num_of_trainable_ops -1 --epoch 3 --checkpoint vgg1.ckpt <--- load
      ========================
      Epoch 1/3 - time: 419.234ms/step - loss: [0] 0.0042
      Epoch 2/3 - time: 412.397ms/step - loss: [0] 0.0020
      Epoch 3/3 - time: 411.636ms/step - loss: [0] 0.0016
      
  • Test (Adam optimizer)
    • Original loss
      Epoch 1/5 - time: 430.202ms/step - loss: [0] 0.0944
      Epoch 2/5 - time: 422.456ms/step - loss: [0] 0.0012
      Epoch 3/5 - time: 427.005ms/step - loss: [0] 0.0010
      Epoch 4/5 - time: 429.322ms/step - loss: [0] 0.0010
      Epoch 5/5 - time: 424.069ms/step - loss: [0] 0.0010
      
    • Export checkpoint after 1 epoch
      $ ./Product/out/bin/onert_train --modelfile ../Personal/models/vgg16/transfer.circle  \
      --load_input:raw ../Personal/models/data/imagenet_a/test.input.10.bin \
      --load_expected:raw ../Personal/models/data/imagenet_a/test.output.10.bin \
      --batch_size 1 --loss 1 --optimizer 2 --loss_reduction_type 1 --learning_rate 0.001 \
      --num_of_trainable_ops -1 --epoch 1 --export_checkpoint vgg2.ckpt <--- export
      Model Expected Filename ../Personal/models/data/imagenet_a/test.output.10.bin
      ========================
      Epoch 1/1 - time: 462.832ms/step - loss: [0] 0.0944
      
    • Load checkpoint and train model
      $ ./Product/out/bin/onert_train --modelfile ../Personal/models/vgg16/transfer.circle  \
      --load_input:raw ../Personal/models/data/imagenet_a/test.input.10.bin \
      --load_expected:raw ../Personal/models/data/imagenet_a/test.output.10.bin \
      --batch_size 1 --loss 1 --optimizer 2 --loss_reduction_type 1 --learning_rate 0.001 \
      --num_of_trainable_ops -1 --epoch 4 --checkpoint vgg2.ckpt <--- load
      ========================
      Epoch 1/4 - time: 430.199ms/step - loss: [0] 0.0012
      Epoch 2/4 - time: 428.794ms/step - loss: [0] 0.0010
      Epoch 3/4 - time: 424.921ms/step - loss: [0] 0.0010
      Epoch 4/4 - time: 426.044ms/step - loss: [0] 0.0010
      

@jyoungyun jyoungyun closed this Sep 9, 2024
@jyoungyun jyoungyun deleted the draft/implement_checkpoint branch September 9, 2024 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant