Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about number of tokens in tokenizer~ #18

Open
thaoshibe opened this issue Jul 8, 2024 · 2 comments
Open

Questions about number of tokens in tokenizer~ #18

thaoshibe opened this issue Jul 8, 2024 · 2 comments

Comments

@thaoshibe
Copy link

thaoshibe commented Jul 8, 2024

Hi, congrats on your interesting work!
I have a question about the mismatch in the number of tokens between the paper and the implementation.

As far as I understand while reading the paper:
LLaMA has 32.000 tokens, and SEED-X added 290 new tokens for LLaMA, as details:

  • 224 bbox tokens
  • 2 special tokens and
  • 64 visual tokens
    (Total: 290 tokens)

But as I run the code, I found out: There're actually 330 added tokens!

  • 224 bbox token: (<loc-0>... <loc-223>)
  • 2 special tokens for image generation: <img> and </img>
  • 100 visual tokens (<img_00000>... <img_00099>)
  • 2 special tokens for bbox: <box_start> and , <box_end>
  • 2 patch tokens <patch>, </patch>
    (Total: 330 tokens!)

My questions are:

  1. What's the use of patch token?
  2. How many visual tokens (or another word, image token) are actually added into the final models? (N =100 or N= 64)
  3. Do you have any comment on the number of image token? (e.g., do you feel that more image token will be better? (e.g., N=100 is better than N=64))?

Thank you so much!

@AsteriaCao
Copy link

AsteriaCao commented Jul 9, 2024

Hi, I notice in configs/clm_models/llm_seed_x_lora.yaml, special patch tokens are used to tag the beginning & end of patches, which represent the splited original images in raster-scan order. Image patches are added into multi-modal inputs to realize any resolution image generation according to the paper, which is also a novalty of SEED-X.

However, I also have the same question about 100 visual tokens. I found 64 tokens are used to represent an image, but vocab_size is 32330 actually. Maybe the extra tokens are used to explore if add more visual tokens can performance better.
Looking forward to your reply!

@geyuying
Copy link
Collaborator

  1. Yes, patch tokens are used to are used to tag the beginning & end of image patches. @AsteriaCao's understanding is corret.

  2. We actually use N=64 visual tokens to represent an image, and the rest 36 visual tokens are not used during training.

  3. For the ablation study of the number of visual tokens to represent an image, you can refer to https://github.com/geyuying/SEED-X. We find that more visual tokens lead to better image reconstruction for the image de-tokenizer, but lead to worse regression for MLLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants