Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first step of modular tokenizer refactor #141

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

mosheraboh
Copy link
Collaborator

@mosheraboh mosheraboh commented Sep 1, 2024

  • fixed CICD together with Sagi
  • detach pretrained tokenizer from modular tokenizer code.
  • rename main objects but also keep old names for backward compatibility

Next step is to move the code to fuse-med-ml (the pre-trained tokenizers will stay in fuse-drug, or maybe should move to bmfm-core). We also need to re-add some tests that were removed in this PR.

if additional_tokens_list is not None:
all_special_tokens += additional_tokens_list
all_special_tokens = (
list(special_tokens_list) if special_tokens_list is not None else []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we lose the scenario in which all_special_tokens was list(self.special_tokens_dict.values()) and then += additional_tokens_list ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see it's an argument that was removed from the signature

@@ -992,7 +975,7 @@ def count_unknowns(
f"Unexpected type of encoding {type(encoding)}, should be list or Encoding"
)
if unk_token is None:
unk_token = special_wrap_input(special_tokens["unk_token"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you prefer to hardcode it? can't it be different for different (sub) tokenizers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The special_tokens was also hard coded.
We (I 😅 ) need to add such an argument to encode method (like we do with padding)

@@ -34,7 +34,7 @@ model:
target_featurizer: ProtBertFeaturizer
# possible values: SimpleCoembedding, others (not tested): GoldmanCPI, SimpleCosine, AffinityCoembedInner, CosineBatchNorm, LSTMCosine, DeepCosine,
# SimpleConcat, SeparateConcat, AffinityEmbedConcat, AffinityConcatLinear
model_architecture: SimpleCoembedding
model_architecture: deepcoembedding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intentional in this PR ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert it, we ended up disabling this example cause it failed.

Copy link
Collaborator

@YoelShoshan YoelShoshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Ideally @floccinauc should also review

@@ -1,5 +1,5 @@
paths:
tokenizers_path: "${oc.env:MY_GIT_REPOS}/fuse-drug/fusedrug/data/tokenizer/modulartokenizer/pretrained_tokenizers/" #tokenizer base work path
tokenizers_path: #tokenizer base work path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this legit yaml? is this correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that someone will set it before running the script.
But maybe it's better to set it to null together with asserting that it's different from None at the beginning of the script. Setting it null will allow to override it in command line.

@@ -38,16 +28,7 @@ def main(cfg: DictConfig) -> None:

t_mult = ModularTokenizer.load(path=cfg_raw["data"]["tokenizer"]["in_path"])

test_tokenizer(t_mult, cfg_raw=cfg_raw, mode="loaded_path")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sanity check for the tokenizer, intended to catch problems before they hit hard. Why remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cause it's should be a generic test here, and this specific test is not in the right place.
Foe example, the script will not necessarily add a subtokenizer to modular tokenizer that already contains AA subtokenizer.

Copy link
Collaborator

@matanninio matanninio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very comfortable with dropping the tests.
If I got this right, the new way of adding special tokens does not allow to create the tokenizer again from the original tokenizers, but requires that the extra tokens be added in just the right order, or one will need to inject the special token list into the first tokenizer.

Copy link
Collaborator

@floccinauc floccinauc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@@ -1434,15 +1418,7 @@ def update_vocab(
# operations on the tokenizer instance (if possible, operations should be done here, using built-in tokenizer methods)
json_str = json.dumps(t_json)
tokenizer_inst = Tokenizer.from_str(json_str)
if self.special_tokens_dict is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hopefully useless sanity check, but I still think we should keep it in case something went horribly wrong (e.g. incompatible subtokenizers were somehow loaded)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will add it back

@@ -20,8 +20,8 @@
OpToUpperCase,
OpKeepOnlyUpperCase,
)
from fusedrug.data.tokenizer.ops import (
FastTokenizer,
from fusedrug.data.tokenizer.ops.tokenizer_op import TokenizerOp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth it to go over all files that import FastTokenizer and replace the import with
from fusedrug.data.tokenizer.ops.tokenizer_op import TokenizerOp as FastTokenizer?
This way the PR will not affect other's work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, but I also kept the previous name so even if we keep using the original import it should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants