Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Allow users to shuffle ActivationCache dataset, rather than shuffling the pieces of the activation cache #277

Open
1 task done
naterush opened this issue Sep 3, 2024 · 2 comments

Comments

@naterush
Copy link
Contributor

naterush commented Sep 3, 2024

Proposal

Add shuffle_dataset_upfront as a config option to CacheActivationsRunnerConfig.

Motivation

Currently, users can cache activations using the CacheActivationsRunner class. However, when caching these activations, an enormous portion of runtime is spent shuffling data pairwise within buffers. In my (highly-unscientific) experiments, shuffling (with default values) was >50% of the runtime, and ended up consistently triggering OOM errors on my GPU.

While it's currently possible to configure the CacheActivationsRunner to avoid shuffling all-together, this might have a training impact on the resulting SAE depending on the organization of the initial dataset.

As such, it would be ideal to allow users to:

  1. Disable all pairwise shuffling between different saved activation tensors.
  2. Shuffle input token sequences upfront. On top of being less data to move around, we only need to shuffle once in this case.

Both of these changes could be enabled with backward-compatible extensions to CacheActivationsRunnerConfig. Simple a new param, shuffle_dataset_upfront (or something) - which must have streaming=False.

Pitch

I made this change - and it resulted in me being able to cache activations! Previously, I would just get OOM errors on my GPU during shuffling (which might be an related bug resulting in not cleaning up old buffers).

Alternatives

Not sure there are too many, if you're aiming for a random order of activations - you need to either shuffle before or during. This adds shuffling before.

Alternatively, the user could be responsible for shuffling and then reuploading the dataset to hugging face before just redownloading - but this is a lot of extra work that we could totally avoid and handle easily with a single extra param.

Checklist

  • I have checked that there is no similar issue in the repo (required)
@naterush
Copy link
Contributor Author

naterush commented Sep 3, 2024

Happy to take a shot at adding this, btw (would be a good early contribution for me) -- but let me know if there's an appetite for it, before I go for it.

Thanks!

@jbloomAus
Copy link
Owner

I don't think shuffling tokens up front would help. We need activations from different contexts to get mixed. I'd be open to a PR which makes the shuffling less frequent or turns it off so we people can move more quickly sometimes (though the shuffling is supposed to be important according to Anthropic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants