Proposed synchronisation of RL and LM approaches #4

thomfoster · 2023-01-29T13:47:57Z

A reinforcement learning algorithm is characterised by the trajectories it generates during training. We are interested in "algorithm distillation" - whether trajectories can be modelled by transformers, as studied in the original deepmind algorithm distillation paper.

My particular interest in this field is the case where:

the trajectories have been generated by the TRLx library during RLHF training of language models
the transformer modelling the trajectories is itself a standard language model

The current repo doesn't account for this, with:

the trajectories being generated for traditional rl environments using OpenAI gym. In the current repo, these are collected online during AD training, which is infeasible for TRLx.
the transformer modelling the trajectories being a gpt2 model with state, action, reward heads attached

This pull request is designed to synchronise these approaches to allow the exploration of both RL and LM tasks, and for the distillation of trajectories into both RL and LM transformers.

There's still a bit more to do but I wanted to open this PR to show the direction I was working in

More detail on data formats

A trajectory is typically defined as a list of (state, action, reward) triples. For training purposes, it is sometimes useful to augment this to include logprobs, which is, for each triple (s, a, r), the probability of taking action $a$ at state $s$ as determined the policy generating the trajectory.

We therefore define an RL Format Trajectory as a sequence of (state, action, reward, logprobs) tuples.

The typical way to learn to model these trajectories with a transformer is to seperately map the final hidden state using 3 different heads. That is, for a given triple (s,a,r,l) a transformer $f$ maps to $(\hat{s}, \hat{a}, \hat{r}, \hat{l})$.

In this repo, this is done via the models in /models/rl.

We are also interested in the ability of standard language models (with language modeling heads) to learn trajectories. To this end we define a Language Format Trajectory as a trajectory serialised into a string. There are many possible ways to do this, and the optimal one requires investigation. For example, for trajectories generated using TRLx when finetuning a language model on positive sentiment, we can format the trajectory as the string:

prompt: Dan went down to the shops.
completion: He smiled as he walked - the sun was shining.
reward: 0.9975
###

It's less obvious how to do this when the task is not a language task, such as moonlander. Enumerating the states as coordinates might work, but requires experimentation.

Trajectories in Language format are learnt by models in /models/lm.

To summarise:

/models contains the "algorithm distillation models", transformers that are trained in a supervised fashion to learn RL trajectories. We distinguish between models that operate on RL Format trajectories and Language format trajectories.

/tasks contains code to produce the RL trajectories that the models learn. It can store this data however it likes, but each task should expose a torch.utils.data.Dataset that can return trajectory data in either RL Format or Language format.

Generating trajectory data

I am using my own fork of TRLx that has rollout logging.

ToDo:

Still to do:
[X] Set up repo structure (just for your language stuff, @h can add in his)
[X] Add train script for models/lm/casually
[X] Clone H's work and merge with @h (/models/rl) and (/tasks/rl)
[ ] Write a train script that demonstrates how to use with env tasks
[ ] Switch to using official branch of TRLx (get rollout logging PR approved)
[ ] Add online evaluation script for models/lm/casuallm
[ ] Improve train script logging to include reward accuracy

Potential future tasks:
[ ] Add more elegant meta class switching between ...LanguageTrajectories and ...RlTrajectories
[ ] Add main file with click CLI interface for running experiments

…sentiment_rollouts.py to reflect that this is the script to generate data, not the class that uses it

… just roc stories

…oaches Task/synchronize rl and lm approaches

Thomas Foster and others added 24 commits December 5, 2022 13:48

collect rollouts for hyperparameter sweep on roc_story

42e682a

update reward model to one with wider range of sentiment options

5e41578

update for relative logging paths

88cfc22

add requirements.txt

3c12ceb

add incredibly overengineered script for decoding rollouts

6ff57ba

rename decode_rollout > decode_rollouts

315b00f

rename ppo_roc_story_sentiment_rollouts.py >> generate_ppo_roc_story_…

8c48e16

…sentiment_rollouts.py to reflect that this is the script to generate data, not the class that uses it

add class to use rollout data as language modelling task

35fa94d

rename to reflect that this is a generic wrapper for any rollout, not…

2f2ac70

… just roc stories

super simple script to train lm with accelerate on the rollout data

f368d93

add eval loop

b983670

add verbosity flag for dataset

23410dd

use verbosity flag

ca5c5e2

add flag for yielding prompt only for generation during evaluation

3935c55

fix iterator

eb67081

add wandb logging

e949b19

add dataset shuffling

5f67a14

folder level refactor

b336e17

add mega simple example script

5cc1d91

fix train script

6b422f8

ipdate readme

6f8c8f9

Merge branch 'main' into task/synchronize-rl-and-lm-approaches

cda841c

Merge pull request #1 from thomfoster/task/synchronize-rl-and-lm-appr…

5a768e7

…oaches Task/synchronize rl and lm approaches

seperate into rl and lm

ddcbdea

thomfoster changed the title ~~Synchronise RL and LM approaches~~ Proposed synchronisation of RL and LM approaches Jan 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed synchronisation of RL and LM approaches #4

Proposed synchronisation of RL and LM approaches #4

thomfoster commented Jan 29, 2023 •

edited

Loading

Proposed synchronisation of RL and LM approaches #4

Are you sure you want to change the base?

Proposed synchronisation of RL and LM approaches #4

Conversation

thomfoster commented Jan 29, 2023 • edited Loading

More detail on data formats

To summarise:

Generating trajectory data

ToDo:

thomfoster commented Jan 29, 2023 •

edited

Loading