I would like to ask how to do a visual grounding (REC) task directly using GPTY4v? #41

xiang-xiang-zhu · 2024-06-05T03:08:58Z

Thank you for your work!
Now I would like to directly to GPT4v input the image and a prompt like “This is an image, now I need to do the visual grounding task where you generate the coordinates [x,y,h,w] of a bounding box based on a query.”
But I found that this doesn't output very well, the model is even outputting the coordinates randomly. Should I have to preprocess the image first? How should this go about? Thank you!

abrichr · 2024-06-05T15:16:42Z

Current frontier multimodal models (e.g. GPT4) do not appear to be good at segmenting images.

At https://github.com/OpenAdaptAI/OpenAdapt we use Ultralytics FastSAM to run segmentation first with good results. See e.g. OpenAdaptAI/OpenAdapt#610 (scroll down to see images).

edit: https://x.com/openadaptai/status/1798502003045548480

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I would like to ask how to do a visual grounding (REC) task directly using GPTY4v? #41

I would like to ask how to do a visual grounding (REC) task directly using GPTY4v? #41

xiang-xiang-zhu commented Jun 5, 2024

abrichr commented Jun 5, 2024 •

edited

Loading

I would like to ask how to do a visual grounding (REC) task directly using GPTY4v? #41

I would like to ask how to do a visual grounding (REC) task directly using GPTY4v? #41

Comments

xiang-xiang-zhu commented Jun 5, 2024

abrichr commented Jun 5, 2024 • edited Loading

abrichr commented Jun 5, 2024 •

edited

Loading