-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first step of modular tokenizer refactor #141
base: main
Are you sure you want to change the base?
Conversation
…tokenizer_refactor
if additional_tokens_list is not None: | ||
all_special_tokens += additional_tokens_list | ||
all_special_tokens = ( | ||
list(special_tokens_list) if special_tokens_list is not None else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we lose the scenario in which all_special_tokens was list(self.special_tokens_dict.values())
and then += additional_tokens_list
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see it's an argument that was removed from the signature
@@ -992,7 +975,7 @@ def count_unknowns( | |||
f"Unexpected type of encoding {type(encoding)}, should be list or Encoding" | |||
) | |||
if unk_token is None: | |||
unk_token = special_wrap_input(special_tokens["unk_token"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you prefer to hardcode it? can't it be different for different (sub) tokenizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The special_tokens was also hard coded.
We (I 😅 ) need to add such an argument to encode method (like we do with padding)
@@ -34,7 +34,7 @@ model: | |||
target_featurizer: ProtBertFeaturizer | |||
# possible values: SimpleCoembedding, others (not tested): GoldmanCPI, SimpleCosine, AffinityCoembedInner, CosineBatchNorm, LSTMCosine, DeepCosine, | |||
# SimpleConcat, SeparateConcat, AffinityEmbedConcat, AffinityConcatLinear | |||
model_architecture: SimpleCoembedding | |||
model_architecture: deepcoembedding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intentional in this PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert it, we ended up disabling this example cause it failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ideally @floccinauc should also review
@@ -1,5 +1,5 @@ | |||
paths: | |||
tokenizers_path: "${oc.env:MY_GIT_REPOS}/fuse-drug/fusedrug/data/tokenizer/modulartokenizer/pretrained_tokenizers/" #tokenizer base work path | |||
tokenizers_path: #tokenizer base work path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I this legit yaml? is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting that someone will set it before running the script.
But maybe it's better to set it to null together with asserting that it's different from None at the beginning of the script. Setting it null will allow to override it in command line.
Co-authored-by: Matan Ninio <[email protected]>
@@ -38,16 +28,7 @@ def main(cfg: DictConfig) -> None: | |||
|
|||
t_mult = ModularTokenizer.load(path=cfg_raw["data"]["tokenizer"]["in_path"]) | |||
|
|||
test_tokenizer(t_mult, cfg_raw=cfg_raw, mode="loaded_path") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sanity check for the tokenizer, intended to catch problems before they hit hard. Why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cause it's should be a generic test here, and this specific test is not in the right place.
Foe example, the script will not necessarily add a subtokenizer to modular tokenizer that already contains AA subtokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very comfortable with dropping the tests.
If I got this right, the new way of adding special tokens does not allow to create the tokenizer again from the original tokenizers, but requires that the extra tokens be added in just the right order, or one will need to inject the special token list into the first tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
@@ -1434,15 +1418,7 @@ def update_vocab( | |||
# operations on the tokenizer instance (if possible, operations should be done here, using built-in tokenizer methods) | |||
json_str = json.dumps(t_json) | |||
tokenizer_inst = Tokenizer.from_str(json_str) | |||
if self.special_tokens_dict is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hopefully useless sanity check, but I still think we should keep it in case something went horribly wrong (e.g. incompatible subtokenizers were somehow loaded)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will add it back
@@ -20,8 +20,8 @@ | |||
OpToUpperCase, | |||
OpKeepOnlyUpperCase, | |||
) | |||
from fusedrug.data.tokenizer.ops import ( | |||
FastTokenizer, | |||
from fusedrug.data.tokenizer.ops.tokenizer_op import TokenizerOp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth it to go over all files that import FastTokenizer and replace the import with
from fusedrug.data.tokenizer.ops.tokenizer_op import TokenizerOp as FastTokenizer?
This way the PR will not affect other's work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, but I also kept the previous name so even if we keep using the original import it should work.
Next step is to move the code to fuse-med-ml (the pre-trained tokenizers will stay in fuse-drug, or maybe should move to bmfm-core). We also need to re-add some tests that were removed in this PR.