Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[computer use] Adjusting 'Only send N most recent images' does not seem to work #118

Open
bmacer opened this issue Oct 25, 2024 · 5 comments

Comments

@bmacer
Copy link
Contributor

bmacer commented Oct 25, 2024

When I set "Only send N most recent images" to 1, the requests continue to include all previous images.

Simple example is "open terminal":

  1. First request sends the user instruction and receives the computer instructions to take a screenshot. All good here.
  2. Second request returns the screenshot data, the response to this says to move to the terminal icon and click. All good here.
  3. The third request has the problem. Here is the request object:
{
  "max_tokens": 4096,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "open terminal",
          "cache_control": { "type": "ephemeral" }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "I'll help you open a terminal. Since we're in a GUI environment, we need to find and click on the terminal icon. Let me take a screenshot first to locate it."
        },
        {
          "id": "toolu_01UVk5YSDhjm2QWqjDZ1PLiB",
          "input": { "action": "screenshot" },
          "name": "computer",
          "type": "tool_use"
        }
      ]
    },
    {
      "content": [
        {
          "type": "tool_result",
          "content": [
            {
              "type": "image",
              "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": "iVWnqNX <snip> zAwOjAw/oHREAAAAABJRU5ErkJggg=="
              }
            }
          ],
          "tool_use_id": "toolu_01UVk5YSDhjm2QWqjDZ1PLiB",
          "is_error": false,
          "cache_control": { "type": "ephemeral" }
        }
      ],
      "role": "user"
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "I can see the terminal icon in the taskbar (it looks like a black box with \">_\" in it). I'll move the cursor to it and click:"
        },
        {
          "id": "toolu_01KUHFhE4vc63WdYkx9ixb81",
          "input": { "action": "mouse_move", "coordinate": [750, 738] },
          "name": "computer",
          "type": "tool_use"
        },
        {
          "id": "toolu_018kAVgzSun9izofvrWozHmS",
          "input": { "action": "left_click" },
          "name": "computer",
          "type": "tool_use"
        }
      ]
    },
    {
      "content": [
        {
          "type": "tool_result",
          "content": [
            {
              "type": "image",
              "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": "iVBORf <snip> RU5ErkJggg=="
              }
            }
          ],
          "tool_use_id": "toolu_01KUHFhE4vc63WdYkx9ixb81",
          "is_error": false
        },
        {
          "type": "tool_result",
          "content": [
            {
              "type": "image",
              "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": "iVBORw0hdGU6bW9 <snip> A6MDBB5sxuAAAAAElFTkSuQmCC"
              }
            }
          ],
          "tool_use_id": "toolu_018kAVgzSun9izofvrWozHmS",
          "is_error": false,
          "cache_control": { "type": "ephemeral" }
        }
      ],
      "role": "user"
    }
  ],
  "model": "claude-3-5-sonnet-20241022",
  "system": [
    {
      "type": "text",
      "text": "<snipped-for-size>",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "tools": [
    {
      "name": "computer",
      "type": "computer_20241022",
      "display_width_px": 1024,
      "display_height_px": 768,
      "display_number": 1
    },
    { "type": "bash_20241022", "name": "bash" },
    { "name": "str_replace_editor", "type": "text_editor_20241022" }
  ]
}

The issue is in the three data values.

The three data values are for:

  1. The initial screenshot.
  2. The screenshot after the mouse move.
  3. The screenshot after the terminal click.

The expectation is that only the last will be included, while the others will be redacted in some way.

Short video:
https://github.com/user-attachments/assets/cc47139e-7bc4-450a-817f-4f74cf1161f8

@bmacer
Copy link
Contributor Author

bmacer commented Oct 25, 2024

it seems to be thatmin_removal_threshold is causing this:

images_to_remove -= images_to_remove % min_removal_threshold

It is only the case where the images_to_remove > min_removal_threshold that images will be removed. So there is no use to setting this value lower than min_removal_threshold, unless my logic is misguided.

For example, in the case of "open terminal", the third request will have 3 total images (screenshot, move cursor, and click). If i set images_to_keep to 2, then images_to_remove would be 1. so images_to_remove % min_removal_threshold would be 1 % 10, which would be 1, so it would be 1 -= 1, thus images_to_remove would be 0.

A few options for routes to resolution:

  1. Drop the min_removal_threshold altogether. What is its intended purpose? What does the comment "for better cache behavior, we want to remove in chunks" intend to mean?
  2. Set min_removal_threshold to 1. Probably the same result of dropping it altogether, but it would change 1 character instead of a few lines of code.
  3. Don't let the user set a value < 10 if it won't have an effect. Images get expensive quickly, especially for just-poking-around developers. Also, at least for simple Open terminal kind of operations, 1 image seems to be sufficient. Perceiving that you're setting a value that has no effect is unpleasant.

I'd vote for option 2.

@bmacer
Copy link
Contributor Author

bmacer commented Oct 25, 2024

Testing by setting image_truncation_threshold = 1 and send N most recent images to 2 indeed worked. The first of the 3 images was purged:

image

@p-i-
Copy link

p-i- commented Nov 3, 2024

What does the comment "for better cache behavior, we want to remove in chunks" intend to mean?

I think I can hazard an answer at this one. If the new HTTP json payload (convo) it's just the previous one with a message added to the end, Claude can cache the previous one, load that into the LLM-state, and just run the diff-tokens thru.

Therefore, if you are constantly keeping the last 5 images, you're fucking the cache-mechanism, as EVERY new convo you send to Claude fails this criterion (there's a change somewhere in the middle where the 6-th-to-last image used to be, now nuked).

So removing images invalidates the cache, ergo you gotta strike a balance.

@bmacer
Copy link
Contributor Author

bmacer commented Nov 3, 2024

@p-i- not sure I understand...

In my (admittedly limited/mvp-level) usages, I really only need to send the latest image. I certainly don't need to send the first image (the same generic screen everyone sends on their first request) more than the first request. I imagine I only ever would need to send the latest image, or maybe two if the request is checking for a change.

@p-i-
Copy link

p-i- commented Nov 3, 2024

@bmacer hmm now that's an interesting idea. Calculate the diff between the current and last image. And send the diff together with the new image. intrrdastingk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants