Flor (for "fast low-overhead recovery") is a record-replay system for deep learning, and other forms of machine learning that train models on GPUs. Flor was developed to speed-up hindsight logging: a cyclic-debugging practice that involves adding logging statements after encountering a surprise, and efficiently re-training with more logging. Flor takes low-overhead checkpoints during training, or the record phase, and uses those checkpoints for replay speedups based on memoization and parallelism.
FlorDB integrates Flor, git
and sqlite3
to manage model developer's logs, execution data, versions of code, and training checkpoints. In addition to serving as an experiment management solution for ML Engineers, FlorDB extends hindsight logging across model trainging versions for the retroactive evaluation of iterative ML. FlorDB has been extended to support Dataflow operations.
Flor and its evolutions are software developed at UC Berkeley's RISE Lab.
pip install flordb
We start by selecting (or creating) a git
repository to save our model training code as we iterate and experiment. Flor automatically commits your changes on every run, so no change is lost. Below we provide a sample repository you can use to follow along:
$ git clone [email protected]:ucbepic/ml_tutorial
$ cd ml_tutorial/
Run the train.py
script to train a small linear model,
and test your flordb
installation.
$ python train.py
Flor will manage checkpoints, logs, command-line arguments, code changes, and other experiment metadata on each run (More details below). All of this data is then exposed to the user via SQL or Pandas queries.
To view the experiment history you logged, open an iPython terminal from the same directory you ran the examples above, as follows:
$ python -m flor dataframe
projid tstamp filename device seed hidden epochs batch_size lr print_every accuracy correct
0 ml_tutorial 2023-08-28T15:04:07 train.py cpu 78 500 5 32 0.001 500 97.71 9771
1 ml_tutorial 2023-08-28T15:04:35 train.py cpu 8 500 5 32 0.001 500 98.01 9801
The train.py
script has been prepared in advance to define and manage four different hyper-parameters:
$ cat train.py | grep flor.arg
hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)
You can control any of the hyper-parameters (e.g. hidden
) using Flor's command-line interface:
$ python train.py --kwargs hidden=75
Flor is shipped with utilities for serializing and checkpointing PyTorch state, and utilities for resuming, auto-parallelizing, and memoizing executions from checkpoint.
The model developer passes objects for checkpointing to flor.checkpointing(**kwargs)
,
and gives it control over loop iterators by
calling flor.loop(name, iterator)
as follows:
import flor
import torch
hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)
trainloader: torch.utils.data.DataLoader
testloader: torch.utils.data.DataLoader
optimizer: torch.optim.Optimizer
net: torch.nn.Module
criterion: torch.nn._Loss
with flor.checkpointing(model=net, optimizer=optimizer):
for epoch in flor.loop("epoch", range(num_epochs)):
for data in flor.loop("step", trainloader):
inputs, labels = data
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
flor.log("loss", loss.item())
optimizer.step()
eval(net, testloader)
As shown,
we wrap both the nested training loop and main loop with flor.loop
so Flor can manage their state. Flor will use loop iteration boundaries to store selected checkpoints adaptively, and on replay time use those same checkpoints to resume training from the appropriate epoch.
You call flor.log(name, value)
and flor.arg(name, default=None)
to log metrics and register tune-able hyper-parameters, respectively.
$ cat train.py | grep flor.arg
hidden_size = flor.arg("hidden", default=500)
num_epochs = flor.arg("epochs", 5)
batch_size = flor.arg("batch_size", 32)
learning_rate = flor.arg("lr", 1e-3)
$ cat train.py | grep flor.log
flor.log("loss", loss.item()),
The name
(s) you use for the variables you intercept with flor.log
and flor.arg
will become a column (measure) in the full pivoted view.
To cite this work, please refer to the Multiversion Hindsight Logging paper (pre-print '23).
FlorDB is open source software developed at UC Berkeley. Joe Hellerstein (databases), Joey Gonzalez (machine learning), and Koushik Sen (programming languages) are the primary faculty members leading this work.
This work is released as part of Rolando Garcia's doctoral dissertation at UC Berkeley, and has been the subject of study by Eric Liu and Anusha Dandamudi, both of whom completed their master's theses on FLOR. Our list of publications are reproduced below. Finally, we thank Vikram Sreekanti, Dan Crankshaw, and Neeraja Yadwadkar for guidance, comments, and advice. Bobby Yan was instrumental in the development of FLOR and its corresponding experimental evaluation.
- The Management of Context in the Machine Learning Lifecycle. R Garcia. EECS Department, University of California, Berkeley, 2024. UCB/EECS-2024-142.
- Multiversion Hindsight Logging for Continuous Training. R Garcia, A Dandamudi, G Matute, L Wan, JE Gonzalez, JM Hellerstein, K Sen. pre-print on ArXiv, 2023.
- Hindsight Logging for Model Training. R Garcia, E Liu, V Sreekanti, B Yan, A Dandamudi, JE Gonzalez, JM Hellerstein, K Sen. The VLDB Journal, 2021.
- Fast Low-Overhead Logging Extending Time. A Dandamudi. EECS Department, UC Berkeley Technical Report, 2021.
- Low Overhead Materialization with FLOR. E Liu. EECS Department, UC Berkeley Technical Report, 2020.
- Context: The Missing Piece in the Machine Learning Lifecycle. _R Garcia, V Sreekanti, N Yadwadkar, D Crankshaw, JE Gonzalez, JM Hellerstein. CMI, 2018.
FlorDB is licensed under the Apache v2 License.