Improve Tokenizer Class: Error Handling, Flexibility #640

zeelsheladiya · 2023-08-08T04:39:31Z

Input Validation and Error Handling:

Added input validation checks for the bos and eos parameters in the encode method to ensure they are boolean values.
Enhanced error messages for better context and debugging.

Flexible Model Loading:

Modified the constructor of the Tokenizer class to optionally accept a model path or URL.
Users can now load models from URLs or Hugging Face model identifiers, making it more versatile for different deployment scenarios.

Handling Unknown Tokens:

Improved tokenization by handling unknown tokens using SentencePiece's unk_id. Tokens outside the vocabulary range are replaced with the <UNK> token.

These changes contribute to the overall reliability and usability of the Tokenizer class, enabling smoother integration into various projects.

facebook-github-bot · 2023-08-08T09:22:22Z

Hi @zeelsheladiya!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

llama/tokenizer.py

meta-llama#640 In this commit, I have addressed the feedback from the code review by replacing instances of `Union[str, None]` with `Optional[str]` in the Tokenizer class. This change aligns with the Python Typing documentation's recommendation for better type hinting.

zeelsheladiya · 2023-09-07T03:40:28Z

Hi @DamienAllonsius ,

I hope you're doing well! I wanted to let you know that I've made the requested change to the code. I replaced the usage of Union[str, None] with Optional[str] as per your feedback.

You can review the updated code at. I'd greatly appreciate it if you could take a look and let me know if everything looks good to you now. If you have any further suggestions or feedback, please feel free to share them.

Thank you for your time and guidance throughout this process. I look forward to hearing your thoughts on the changes.

DamienAllonsius · 2023-09-07T17:21:31Z

Hi @zeelsheladiya I solved the conflicts with main.
Any idea of how we could make Sentenpiece.encode return an unk_id character?

zeelsheladiya · 2023-09-07T18:18:54Z

Hi @DamienAllonsius,

As per my Understanding, The SentencePieceProcessor.encode method returns an out-of-vocabulary token as a separate token, rather than mapping it to the unk_id character. If we want to make SentencePiece.encode return the unk_id character when it encounters an unknown token, we can modify the encode method.

We can use a list comprehension to check each token ID returned by self.sp_model.encode(s). If the token ID is within the valid range of token IDs (i.e., it's a known token), we keep it as is. If the token ID is not within the valid range, we replace it with self.unk_id, effectively mapping unknown tokens to the unk_id character.

Please Let me know your thoughts. :)

DamienAllonsius

LGTM, tested the code but could not generate any character with unk_id encoding

ruanslv · 2023-09-11T13:48:04Z

llama/tokenizer.py

@@ -13,29 +13,32 @@

 class Tokenizer:
    """tokenizing and encoding/decoding text using SentencePiece."""
-    def __init__(self, model_path: str):
+    def __init__(self, model_path: Optional[str] = None):


Why make this optional? How would you initialize the tokenizer without the model file?

ruanslv · 2023-09-11T13:48:44Z

llama/tokenizer.py

+        # Handle unknown tokens
+        t = [token_id if token_id in range(self.n_words) else self.unk_id for token_id in t]


Would you have an example of an input string that requires this?

ruanslv · 2023-09-11T13:49:20Z

llama/tokenizer.py

+        try:
+            t = self.sp_model.encode(s)
+        except Exception as e:
+            raise ValueError(f"Error during tokenization: {e}")


I don't think we need this, the exception itself should be clear enough?

Update tokenizer.py

352691e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 8, 2023

This comment was marked as spam.

Sign in to view

Thakhun referenced this pull request Aug 31, 2023

updates

82ce861

msaroufim added the enhancement label Sep 1, 2023

DamienAllonsius reviewed Sep 4, 2023

View reviewed changes

llama/tokenizer.py Outdated Show resolved Hide resolved

Merge branch 'main' into main

a925ca5

DamienAllonsius approved these changes Sep 11, 2023

View reviewed changes

ruanslv suggested changes Sep 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Tokenizer Class: Error Handling, Flexibility #640

Improve Tokenizer Class: Error Handling, Flexibility #640

zeelsheladiya commented Aug 8, 2023

facebook-github-bot commented Aug 8, 2023

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

zeelsheladiya commented Sep 7, 2023

DamienAllonsius commented Sep 7, 2023

zeelsheladiya commented Sep 7, 2023

DamienAllonsius left a comment

ruanslv Sep 11, 2023

ruanslv Sep 11, 2023

ruanslv Sep 11, 2023

		# Handle unknown tokens
		t = [token_id if token_id in range(self.n_words) else self.unk_id for token_id in t]

Improve Tokenizer Class: Error Handling, Flexibility #640

Are you sure you want to change the base?

Improve Tokenizer Class: Error Handling, Flexibility #640

Conversation

zeelsheladiya commented Aug 8, 2023

facebook-github-bot commented Aug 8, 2023

Action Required

Process

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

zeelsheladiya commented Sep 7, 2023

DamienAllonsius commented Sep 7, 2023

zeelsheladiya commented Sep 7, 2023

DamienAllonsius left a comment

Choose a reason for hiding this comment

ruanslv Sep 11, 2023

Choose a reason for hiding this comment

ruanslv Sep 11, 2023

Choose a reason for hiding this comment

ruanslv Sep 11, 2023

Choose a reason for hiding this comment