Feature/acces single heads #40

llinauer · 2024-04-07T06:21:53Z

Add accessing single-head activations in source_forward_pass and target_forward_pass methods of Patchscope
Warning This currently only works for models with the GPT2Attention implementation (https://github.com/huggingface/transformers/blob/v4.39.2/src/transformers/models/gpt2/modeling_gpt2.py#L123)
We probably need one implementation per attention architecture
Modify activation_patching_ioi.py to create a plot by layer & head

…unning

…-style transformers

…_pass with single head access

…ss single-head activations

obvs/patchscope.py

tvhong · 2024-04-07T22:05:48Z

obvs/patchscope.py

+
+        # currently, accessing single head activations is only supported for GPT2LMHead models
+        if (self.source.head is not None and 'gpt2' not in self.source.model_name or
+                self.target.head is not None and 'gpt2' not in self.target.model_name):


Does this work with other GPT models (e.g., GPTJ)

Unfortunately not.
GPT-J, despite being similar, uses a different attention implementation (GPTJAttention: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gptj/modeling_gptj.py#L100)

We need to implement a mechanism that works for a range of model architectures

Yeah, good news is that we're starting to see a pattern emerge.

I'm thinking we want to have a base ModelAccessor class that looks like:

class ModelAccessor(ABC): def get_block_output(position: list[int], layer: int) -> Tensor: raise NotImplementedError(...) def set_block_output(position: list[int], layer: int) -> None: raise NotImplementedError(...) def get_head_attn(position: list[int], layer: int, head: list[int]) -> Tensor: raise NotImplementedError(...) def set_head_attn(position: list[int], layer: int, head: list[int]) -> None: raise NotImplementedError(...)

and each model can implement this class.

obvs/patchscope.py

tvhong · 2024-04-07T22:15:29Z

obvs/patchscope.py

+        if self.source.head is not None:
+            attn = getattr(layer, self.ATTN_SOURCE)
+            # TODO may not be .input for other models
+            head_act = getattr(attn, self.HEAD_SOURCE).input[0][0]


Why are we using input instead of output?

My understanding is that patchscope always uses output, and if a researcher needs an input from layer i, they can access the output from layer i-1.

The problem is, that the output of the c_attn layer in GPT2Attention is not the same as the input of the c_proj.
c_attn.output gets us the Q,K & Values concatenated together into one tensor. We want the attention layer outputs (sometimes referred to as z-values), which are calculated inbetween the c_attn and c_proj forward calls in the GPT2Attention object. So they are input of c_proj, but not output of c_attn
See GPT2Attention.forward for reference (https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py#L306)

I.. see. So, c_proj is the equivalence of W^O in the original transformer paper?

If so, I agree that the concatenated head would be at .attn.c_proj.input.

https://arxiv.org/pdf/1706.03762.pdf

obvs/patchscope.py

tvhong · 2024-04-11T04:25:25Z

I think this PR is logically sound.

But do you want to merge #41 first to fix CI before merging in this PR?

tvhong · 2024-04-11T04:28:54Z

You know what, let's merge this. I'll rebase the other one.

llinauer added 10 commits April 6, 2024 10:57

Add attention and head names to get_model_specifics

6f01422

Implement getting of specific head activations in source_forward_pass

3c70a4e

Add check_patchscope_setup method to Patchscope class; check before r…

3ce33f5

…unning

Adapt maniuplate_target to allow head-specific patching

4f5ec67

Fix manipulate_target method, add torch.no_grad decorator

fc23b01

Add activation patching by layer and head to activation_patching_ioi.py

14e04df

Remove no_grad decorator and detach instead

8f1d21d

Accessing single head activations is currently only possible for GPT2…

fb50956

…-style transformers

Add test for checking stored hidden state when running source_forward…

3afbf4a

…_pass with single head access

Add einops; needed for reshape attention activations in order to acce…

229dafe

…ss single-head activations

tvhong reviewed Apr 7, 2024

View reviewed changes

llinauer added 3 commits April 10, 2024 07:51

Rename MODEL_, LAYER_, ATTN_ & HEAD_ attributes

c4ba3a1

Raise errors when single head activation patching is not possible

f368175

Accessing a range of heads is also possible

0a4d8a4

tvhong merged commit 2d4a407 into main Apr 11, 2024
2 of 3 checks passed

llinauer deleted the feature/acces_single_heads branch April 11, 2024 09:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/acces single heads #40

Feature/acces single heads #40

llinauer commented Apr 7, 2024

tvhong Apr 7, 2024

llinauer Apr 8, 2024

llinauer Apr 10, 2024

tvhong Apr 11, 2024 •

edited

Loading

tvhong Apr 7, 2024

llinauer Apr 8, 2024

tvhong Apr 11, 2024 •

edited

Loading

tvhong commented Apr 11, 2024

tvhong commented Apr 11, 2024

Feature/acces single heads #40

Feature/acces single heads #40

Conversation

llinauer commented Apr 7, 2024

tvhong Apr 7, 2024

Choose a reason for hiding this comment

llinauer Apr 8, 2024

Choose a reason for hiding this comment

llinauer Apr 10, 2024

Choose a reason for hiding this comment

tvhong Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

tvhong Apr 7, 2024

Choose a reason for hiding this comment

llinauer Apr 8, 2024

Choose a reason for hiding this comment

tvhong Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

tvhong commented Apr 11, 2024

tvhong commented Apr 11, 2024

tvhong Apr 11, 2024 •

edited

Loading

tvhong Apr 11, 2024 •

edited

Loading