training script

kyegomez · Oct 2, 2023 · 957219f · 957219f
1 parent b933890
commit 957219f
Showing 1 changed file with 1 addition and 0 deletions.
diff --git a/README.md b/README.md
@@ -106,6 +106,7 @@ Eprint = {arXiv:2302.05442},
 
 # Todo
 - [ ] Add flash attention, with layernorm before attn, and then layernom for qk values,
+- [ ] Basic training script on CIFAR,
 - [ ] When using ViT-22B, similar to any large scale model, it is difficult to understand how the model arrived at a specific decision, which could lead to lack of
 trust and accountability. Add in a mechanism to backtrack
 - [ ] create logic to train the decoder for 300k steps with a batch size of 64 using Adam (Kingma and Ba, 2015) and clip the gradients to a global norm value of 0.05 to stabilize training. We linearly increase the learning rate for 2500 steps to 0.0002 (starting from 0) and then decay the learning rate with a cosine schedule (Loshchilov and Hutter, 2017) back to 0.