All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- A bunch of annealing configs
constant_with_warmup
learning rate scheduleone_in_eight
configuration for activation checkpointing- New tokenizer in the source instead of from huggingface
- Improved support for GCS
torch.compile()
now only compiles each block, not the whole model.- Support for
torch.compile()
withdynamic=True
- Resetting the
torch.compile()
after every evaluation, because evaluation messes with the compiled versions - Added more in-loop evaluation tasks to pick from, mostly for scaling law.
v0.5.1 - 2024-10-17
- Added ability to try loading latest checkpoint from save folder using
--try_load_latest_save
. - Added support for flash attention and gradient checkpointing to
hf_olmo
. - Added to
scripts.compare_wandb_configs.py
the ability to more easily compare differences in data mixes and evaluation tasks. - Added
effective_n_kv_heads
to OLMoConfig for hacky VLLM support.
v0.5.0 - 2024-08-26
- Fixed conversion to HuggingFace model for DDP-trained models.
- Added support for remote source and destination for HuggingFace model conversion.
- Added support for document masking via flash-attn during training with
--data.generate_doc_lengths
. - Added config options for
model.norm_after
,model.scale_emb_init
, andauxiliary_loss_multiplier
(used with zloss). - Added scripts for running experiments on qk_norm, norm reordering, and zloss.
- Added
model.rope_theta
configuration option. - Added
model.embedding_layer_norm
configuration option for adding a LN to the embeddings. - Added
model.emb_init_std
configuration option to override the standard deviation used to initialize the embeddings. - Added downstream eval task for requests dumped from oe-eval tasks
- Added
CosLinearEnvelope
scheduler, which is a pointwise product of a cosine schedule and a linear decay. - Added ability to save outputs of submodules for debugging purposes.
- Added a number of tasks from oe-eval to the downstream eval tasks.
- Version dolma flan change in named_data_mix.py
- Changed default distributed training strategy from single-GPU to FSDP
- Fixed behavior of
effective_memmap_dtype
to prevent unrecognized dtypes to be parsed asuint16
.
- Fixed restarting a training run in later epochs so that we no longer need to set the flag
--epoch=INT
. - Swapped in correct flan data mix.
- Fix bug where the attention norm, when applied before the attention block, was modifying the residual stream.
- Fixed
OLMo.from_checkpoint()
so that it correctly loadsolmo_core
andtorch_new
style checkpoints. - Fixed
preserve_rng_state
being incorrectly set to False when doing gradient checkpointing with dropout
v0.4.0 - 2024-07-11
- Added clipping fix to
Optimizer
class to make it work with FSDPno_shard
and DDP. - Added tests to compare grad norm differences between torch optimizer and clipping and OLMo optimizer and clipping on both CPU and GPU.
- Expose memmap dtype in data config
- Added support for DDP training.
- Added caching to disk of HF datasets used in downstream evals
- Added FLOPs logging
- Added configs for OLMo tiny set of models
- Added configuration field
optimizer.record_update_metrics
, which defaults toFalse
, but when set toTrue
will trigger AdamW to collect the step size norm and absolute max for each parameter. - Added configuration field
optimizer.selective_updates
, which defaults toFalse
, but when set toTrue
will tell the optimizer to skip updating the parameter and state when the corresponding gradient is 0. - Added configuration field
optimizer.record_update_metrics
, which defaults toFalse
, but when set to True will trigger AdamW to collect the step size norm and absolute max for each parameter. - Added
olmo_data
, a package holding data files like tokenizers. - Added ability to load tokenizers from
olmo_data
package data. - Added a script that can run a series of models with predictable scaling properties.
- Added original legacy unsharding implementation back, as the default. The new
shared memory implementation can be used by passing
use_legacy_shared_mem_impl
tounshard.py
. - Refactor weight initialization. IMPORTANT: this does not maintain backwards-compatibility with older configs; the jobs will still run, but may produce different outputs.
- Changed the behavior of the Lion optimizer to only record the update cosine similarity when
optimizer.record_update_metrics
isTrue
in order to be consistent with the API. - Added HF datasets into
olmo_data
, and changed downstream eval to load from the package.
- Changed from
ignored_index
toignore_index
forcross_entropy_loss
whenflash-attn>=2.5.8
. - Make
hf_olmo
supportAutoModelForCasualLM
and similar HF methods again.
v0.3.0 - 2024-04-25
- Added support for Grouped Query Attention.
- Added commonsense_qa and social_iqa downstream evaluation tasks
- Added ce_loss metric, with TriviaQA and NaturalQuestions tasks
- Makes it possible to read from http/https the same way we read from s3/r2.
- Added MMLU multiple choice (A/B/C/D) 5-shot variant downstream tasks
- Tokenizer patch
- Added option to specify number of model replicas when using hybrid sharding.
- Rename
Olmo
toOLMo
everywhere in the codebase - Disabled automatic garbage collection during training, instead we run manually at regular intervals to avoid ranks getting out-of-sync with their own gc.
- Removed
AMDLayerNorm
, since the original layer norm bug has been fixed and we don't need this workaround anymore. - Removed
OLMoParallelBlock
.
- Don't log garbage on nodes that aren't rank 0
- Don't crash in the HF code when we are referring to a tokenizer in a local file
- Point official training scripts to publicly available URLs
- Corrected the
resize_token_embeddings
method in theOLMoForCausalLM
class to properly update the token embeddings when resizing the vocabulary. - Changed
tie_weights
method to a no-op as weight tying is handled in olmo/model.py - Fixed the size calculation for qk layer norm
- Fixed pipeline test failure that occurs due to a bug in transformers version 4.39.1
- Make
hf_olmo
compatible with transformers versions >=4.40.0
v0.2.5 - 2024-03-06
- Fixed default value of
--tokenizer
argument toscripts/prepare_tulu_data.py
to be an absolute path, not relative path, the script can be run from other directories. - Added the option to directly pass input embeddings to
OLMo
andOLMoForCausalLM
. - Added support for Python 3.8.
- Added code to throw an error if
output_attentions
is set toTrue
in forward call toOLMoForCausalLM
. This functionality hasn't been implemented yet. - Correct scheme displayed in error messages that come from R2
- Fixed running with multiple data loading workers in LUMI
- Minor bug fix: uninitialized prompts variable
- Added
output_hidden_states
argument and associated functionality toOLMo
andOLMoForCausalLM
to return model intermediate hidden states. - Ability to read from R2 like we read from S3
- Added MMLU downstream evaluation tasks, with prompt variations.
- Added support for PyTorch v2.2.
- Added ability to show logs from all ranks
- Added option for QKV clipping.
- Added basic_arithmetic downstream evaluation task
- Changed legacy checkpoint unsharding to use processes and shared memory instead of threads
v0.2.4 - 2024-02-02
- Fixed an issue with the HuggingFace integration where we were inadvertently using a feature that was introduced in Python 3.10, causing an error for older Python versions.
v0.2.3 - 2024-01-31
v0.2.2 - 2023-12-10
v0.2.1 - 2023-12-10
v0.2.0 - 2023-12-08
- GPT-based model.
- Tokenizer and data pre-processing pipeline.
- training script.
- Triton-based FlashAttention.