Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Tokenizer Class: Error Handling, Flexibility #640

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zeelsheladiya
Copy link

Input Validation and Error Handling:

  • Added input validation checks for the bos and eos parameters in the encode method to ensure they are boolean values.
  • Enhanced error messages for better context and debugging.

Flexible Model Loading:

  • Modified the constructor of the Tokenizer class to optionally accept a model path or URL.
  • Users can now load models from URLs or Hugging Face model identifiers, making it more versatile for different deployment scenarios.

Handling Unknown Tokens:

  • Improved tokenization by handling unknown tokens using SentencePiece's unk_id. Tokens outside the vocabulary range are replaced with the <UNK> token.

These changes contribute to the overall reliability and usability of the Tokenizer class, enabling smoother integration into various projects.

@facebook-github-bot
Copy link

Hi @zeelsheladiya!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2023
dbsdat

This comment was marked as spam.

dbsdat

This comment was marked as spam.

dbsdat

This comment was marked as spam.

Thakhun referenced this pull request Aug 31, 2023
llama/tokenizer.py Outdated Show resolved Hide resolved
meta-llama#640  In this commit, I have addressed the feedback from the code review by replacing instances of `Union[str, None]` with `Optional[str]` in the Tokenizer class. This change aligns with the Python Typing documentation's recommendation for better type hinting.
@zeelsheladiya
Copy link
Author

Hi @DamienAllonsius ,

I hope you're doing well! I wanted to let you know that I've made the requested change to the code. I replaced the usage of Union[str, None] with Optional[str] as per your feedback.

You can review the updated code at. I'd greatly appreciate it if you could take a look and let me know if everything looks good to you now. If you have any further suggestions or feedback, please feel free to share them.

Thank you for your time and guidance throughout this process. I look forward to hearing your thoughts on the changes.

@DamienAllonsius
Copy link

Hi @zeelsheladiya I solved the conflicts with main.
Any idea of how we could make Sentenpiece.encode return an unk_id character?

@zeelsheladiya
Copy link
Author

Hi @DamienAllonsius,

As per my Understanding, The SentencePieceProcessor.encode method returns an out-of-vocabulary token as a separate token, rather than mapping it to the unk_id character. If we want to make SentencePiece.encode return the unk_id character when it encounters an unknown token, we can modify the encode method.

We can use a list comprehension to check each token ID returned by self.sp_model.encode(s). If the token ID is within the valid range of token IDs (i.e., it's a known token), we keep it as is. If the token ID is not within the valid range, we replace it with self.unk_id, effectively mapping unknown tokens to the unk_id character.

Please Let me know your thoughts. :)

Copy link

@DamienAllonsius DamienAllonsius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested the code but could not generate any character with unk_id encoding

@@ -13,29 +13,32 @@

class Tokenizer:
"""tokenizing and encoding/decoding text using SentencePiece."""
def __init__(self, model_path: str):
def __init__(self, model_path: Optional[str] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this optional? How would you initialize the tokenizer without the model file?

Comment on lines +59 to +60
# Handle unknown tokens
t = [token_id if token_id in range(self.n_words) else self.unk_id for token_id in t]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have an example of an input string that requires this?

Comment on lines +54 to +57
try:
t = self.sp_model.encode(s)
except Exception as e:
raise ValueError(f"Error during tokenization: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this, the exception itself should be clear enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants