You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I notice in configs/clm_models/llm_seed_x_lora.yaml, special patch tokens are used to tag the beginning & end of patches, which represent the splited original images in raster-scan order. Image patches are added into multi-modal inputs to realize any resolution image generation according to the paper, which is also a novalty of SEED-X.
However, I also have the same question about 100 visual tokens. I found 64 tokens are used to represent an image, but vocab_size is 32330 actually. Maybe the extra tokens are used to explore if add more visual tokens can performance better.
Looking forward to your reply!
Yes, patch tokens are used to are used to tag the beginning & end of image patches. @AsteriaCao's understanding is corret.
We actually use N=64 visual tokens to represent an image, and the rest 36 visual tokens are not used during training.
For the ablation study of the number of visual tokens to represent an image, you can refer to https://github.com/geyuying/SEED-X. We find that more visual tokens lead to better image reconstruction for the image de-tokenizer, but lead to worse regression for MLLM.
Hi, congrats on your interesting work!
I have a question about the mismatch in the number of tokens between the paper and the implementation.
As far as I understand while reading the paper:
LLaMA has 32.000 tokens, and SEED-X added 290 new tokens for LLaMA, as details:
(Total: 290 tokens)
But as I run the code, I found out: There're actually 330 added tokens!
<loc-0>
...<loc-223>
)<img>
and</img>
<img_00000>
...<img_00099>
)<box_start>
and ,<box_end>
<patch>
,</patch>
(Total: 330 tokens!)
My questions are:
Thank you so much!
The text was updated successfully, but these errors were encountered: