first step of modular tokenizer refactor #141

mosheraboh · 2024-09-01T15:05:38Z

fixed CICD together with Sagi
detach pretrained tokenizer from modular tokenizer code.
rename main objects but also keep old names for backward compatibility

Next step is to move the code to fuse-med-ml (the pre-trained tokenizers will stay in fuse-drug, or maybe should move to bmfm-core). We also need to re-add some tests that were removed in this PR.

…tokenizer_refactor

fusedrug/data/tokenizer/modulartokenizer/README.md

YoelShoshan · 2024-09-04T15:18:47Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

- if additional_tokens_list is not None:
- all_special_tokens += additional_tokens_list
+ all_special_tokens = (
+ list(special_tokens_list) if special_tokens_list is not None else []


don't we lose the scenario in which all_special_tokens was list(self.special_tokens_dict.values()) and then += additional_tokens_list ?

oh, I see it's an argument that was removed from the signature

YoelShoshan · 2024-09-04T15:23:39Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

@@ -992,7 +975,7 @@ def count_unknowns(
 f"Unexpected type of encoding {type(encoding)}, should be list or Encoding"
 )
 if unk_token is None:
- unk_token = special_wrap_input(special_tokens["unk_token"])


why do you prefer to hardcode it? can't it be different for different (sub) tokenizers?

The special_tokens was also hard coded.
We (I 😅 ) need to add such an argument to encode method (like we do with padding)

YoelShoshan · 2024-09-04T15:31:32Z

fusedrug_examples/interaction/drug_target/affinity_prediction/PLM_DTI/configs/config.yaml

@@ -34,7 +34,7 @@ model:
 target_featurizer: ProtBertFeaturizer
 # possible values: SimpleCoembedding, others (not tested): GoldmanCPI, SimpleCosine, AffinityCoembedInner, CosineBatchNorm, LSTMCosine, DeepCosine,
 # SimpleConcat, SeparateConcat, AffinityEmbedConcat, AffinityConcatLinear
- model_architecture: SimpleCoembedding
+ model_architecture: deepcoembedding


is this intentional in this PR ?

I'll revert it, we ended up disabling this example cause it failed.

YoelShoshan

LGTM

Ideally @floccinauc should also review

matanninio · 2024-09-08T07:11:57Z

fusedrug/data/tokenizer/modulartokenizer/pretrained_tokenizers/configs/tokenizer_config.yaml

@@ -1,5 +1,5 @@
 paths:
- tokenizers_path: "${oc.env:MY_GIT_REPOS}/fuse-drug/fusedrug/data/tokenizer/modulartokenizer/pretrained_tokenizers/" #tokenizer base work path
+ tokenizers_path: #tokenizer base work path


I this legit yaml? is this correct?

I was expecting that someone will set it before running the script.
But maybe it's better to set it to null together with asserting that it's different from None at the beginning of the script. Setting it null will allow to override it in command line.

Co-authored-by: Matan Ninio <[email protected]>

matanninio · 2024-09-08T14:18:10Z

fusedrug/data/tokenizer/modulartokenizer/scripts/add_tokenizer_to_multi_tokenizer.py

@@ -38,16 +28,7 @@ def main(cfg: DictConfig) -> None:

 t_mult = ModularTokenizer.load(path=cfg_raw["data"]["tokenizer"]["in_path"])

- test_tokenizer(t_mult, cfg_raw=cfg_raw, mode="loaded_path")


This is a sanity check for the tokenizer, intended to catch problems before they hit hard. Why remove this?

Cause it's should be a generic test here, and this specific test is not in the right place.
Foe example, the script will not necessarily add a subtokenizer to modular tokenizer that already contains AA subtokenizer.

matanninio

not very comfortable with dropping the tests.
If I got this right, the new way of adding special tokens does not allow to create the tokenizer again from the original tokenizers, but requires that the extra tokens be added in just the right order, or one will need to inject the special token list into the first tokenizer.

floccinauc

Looks great!

floccinauc · 2024-09-09T13:11:11Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

@@ -1434,15 +1418,7 @@ def update_vocab(
 # operations on the tokenizer instance (if possible, operations should be done here, using built-in tokenizer methods)
 json_str = json.dumps(t_json)
 tokenizer_inst = Tokenizer.from_str(json_str)
- if self.special_tokens_dict is not None:


This is a hopefully useless sanity check, but I still think we should keep it in case something went horribly wrong (e.g. incompatible subtokenizers were somehow loaded)

Got it, will add it back

floccinauc · 2024-09-09T13:18:55Z

fusedrug_examples/interaction/drug_target/affinity_prediction/bimodal_mca_PPI/pl_data.py

@@ -20,8 +20,8 @@
 OpToUpperCase,
 OpKeepOnlyUpperCase,
 )
-from fusedrug.data.tokenizer.ops import (
- FastTokenizer,
+from fusedrug.data.tokenizer.ops.tokenizer_op import TokenizerOp


Maybe it's worth it to go over all files that import FastTokenizer and replace the import with
from fusedrug.data.tokenizer.ops.tokenizer_op import TokenizerOp as FastTokenizer?
This way the PR will not affect other's work

I did, but I also kept the previous name so even if we keep using the original import it should work.

Moshe Raboh [email protected] added 16 commits September 1, 2024 11:04

first step of modular tokenizer refactor

bf8f161

Merge branch 'main' of github.com:BiomedSciAI/fuse-drug into modular_…

67b2e52

…tokenizer_refactor

..

8f14a11

..

de18908

..

1518fff

..

08866d1

..

18b72a7

..

236d611

..

965588c

..

de0c0f9

..

f84a44f

..

0a16c04

..

c707b11

refactor injector tokenizer

8d7192c

..

b02c573

..

15e1f84

mosheraboh requested review from matanninio, floccinauc, YoelShoshan and SagiPolaczek September 4, 2024 13:20

mosheraboh mentioned this pull request Sep 4, 2024

Stabilize CICD #140

Closed

YoelShoshan reviewed Sep 4, 2024

View reviewed changes

fusedrug/data/tokenizer/modulartokenizer/README.md Outdated Show resolved Hide resolved

YoelShoshan reviewed Sep 4, 2024

View reviewed changes

YoelShoshan approved these changes Sep 4, 2024

View reviewed changes

matanninio reviewed Sep 8, 2024

View reviewed changes

Update fusedrug/data/tokenizer/modulartokenizer/README.md

dced5bc

Co-authored-by: Matan Ninio <[email protected]>

matanninio reviewed Sep 8, 2024

View reviewed changes

..

f1be0cf

floccinauc approved these changes Sep 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

first step of modular tokenizer refactor #141

first step of modular tokenizer refactor #141

mosheraboh commented Sep 1, 2024 •

edited

Loading

YoelShoshan Sep 4, 2024

YoelShoshan Sep 4, 2024

YoelShoshan Sep 4, 2024

mosheraboh Sep 4, 2024

YoelShoshan Sep 4, 2024

mosheraboh Sep 8, 2024

YoelShoshan left a comment •

edited

Loading

matanninio Sep 8, 2024

mosheraboh Sep 8, 2024

matanninio Sep 8, 2024

mosheraboh Sep 9, 2024

matanninio left a comment

floccinauc left a comment

floccinauc Sep 9, 2024

mosheraboh Sep 9, 2024

floccinauc Sep 9, 2024

mosheraboh Sep 9, 2024

		@@ -38,16 +28,7 @@ def main(cfg: DictConfig) -> None:

		t_mult = ModularTokenizer.load(path=cfg_raw["data"]["tokenizer"]["in_path"])

		test_tokenizer(t_mult, cfg_raw=cfg_raw, mode="loaded_path")

first step of modular tokenizer refactor #141

Are you sure you want to change the base?

first step of modular tokenizer refactor #141

Conversation

mosheraboh commented Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YoelShoshan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matanninio left a comment

Choose a reason for hiding this comment

floccinauc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosheraboh commented Sep 1, 2024 •

edited

Loading

YoelShoshan left a comment •

edited

Loading