diff --git a/docs/source/nlp/dialogue.rst b/docs/source/nlp/dialogue.rst deleted file mode 100644 index 157aaa714b16..000000000000 --- a/docs/source/nlp/dialogue.rst +++ /dev/null @@ -1,143 +0,0 @@ -.. _dialogue: - -Dialogue tasks -====================================== - -This module consists of various tasks that are related to dialogue. - -**Module Design** - -We decided to group dialogue tasks into a common module instead of having a module for each because they share many things in common, meaning that there can be more re-use of code. -This design can also support easier extension of this module, as developers can work on components of their interest while utilizing other components of dialogue pipeline. -In particular, we wanted to decouple the task-dependent, model-independent components of DataProcessor and InputExample from the model-dependent, task-independent components of Model and Dataset. - -.. image:: dialogue_UML.png - :alt: Dialogue-UML - :width: 800px - -**Supported Tasks** - -Supported tasks fall into broad categories of intent / domain classification with slot filling, intent classification as well as sequence generation. - -For each category of tasks, there exists several Data Processors to convert raw data from various sources into a common format as well as Dialogue Models that approachs the task in various ways. - -Currently, the supported task categories are: - -+----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| **Task Category** | **Tasks** | **Models** | **Supported Options for model.language_model.pretrained_model_name** | **Supported options for model.library** | -+----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| Domain / Intent Classification | Schema Guided Dialogue | Dialogue GPT Classification Model | gpt2, gpt2-{medium, large, xl}, microsoft/DialoGPT-{small, medium} | Huggingface, Megatron | -+ with slot filling +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| | Assistant | SGDQA (BERT-Based Schema Guided Dialogue Question Answering model) | bert-base-cased | Megatron | -+ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| | | Intent Slot Classification Model | bert-base-uncased | Megatron | -+----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| Intent Classification | Zero Shot Food Ordering | Dialogue GPT Classification Model | gpt2, gpt2-{medium, large, xl}, microsoft/DialoGPT-{small, medium} | Huggingface, Megatron | -+ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| | Omniverse Design | Dialogue Nearest Neighbour Model | sentence-transformers/* | Huggingface | -+ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| | | Dialogue Zero Shot Intent Model (Based on MNLI pretraining) | bert-base-uncased | Huggingface, Megatron | -+----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| Sequence Generation | Schema Guided Dialogue Generation| Dialogue GPT Generation Model | gpt2, gpt2-{medium, large, xl}, microsoft/DialoGPT-{small, medium} | Huggingface, Megatron | -+ +----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ -| | MS Marco NLGen | Dialogue S2S Generation Model | facebook/bart-{base, large}, t5-{small, base, large, 3b, 11b} | Huggingface, Megatron | -+----------------------------------------------------------+----------------------------------+----------------------------------------------------------------------------------+----------------------------------------------------------------------+------------------------------------------+ - -**Configuration** - -Example of model configuration file for training the model can be found at: `NeMo/examples/nlp/dialogue/conf/dialogue_config.yaml `__. - -Because the Dialogue module contains a wide variety of models and tasks, there are a large number of configuration parameters to adjust (some of which only applies to some models/some tasks) - -In the configuration file, define the parameters of the training and the model, although most of the default values will work well. -For various task-model combination, only a restricted set of config args will apply. Please read the configuration file for comments on which config args you would need for each model and task. - -The configuration can be roughly grouped into a few categories: - -- Parameters that describe the training process, such as how many gpus to use: **trainer** -- Parameters that describe the model: **model** -- Parameters that describe optimization: **model.optim** -- Parameters that describe the task: **model.dataset** -- Parameters that describe the dataloaders: **model.train_ds**, **model.validation_ds**, **model.test_ds**, -- Parameters that describe the training experiment manager that log training process: **exp_manager** - - -Arguments that very commonly need to be edited for all models and tasks - -- :code:`do_training`: perform training or only testing -- :code:`trainer.devices`: number of GPUs (int) or list of GPUs e.g. [0, 1, 3] -- :code:`model.dataset.task`: Task to work on [sgd, assistant, zero_shot, ms_marco, sgd_generation, design, mellon_qa] -- :code:`model.dataset.data_dir`: the dataset directory -- :code:`model.dataset.dialogues_example_dir`: the directory to store prediction files -- :code:`model.dataset.debug_mode`: whether to run in debug mode with a very small number of samples [True, False] -- :code:`model.language_model.pretrained_model_name`: language model to use, which causes different Dialogue Models to be loaded (see table above for options in each model class) -- :code:`model.library`: library to load language model from [huggingface or megatron] -- :code:`model.language_model.lm_checkpoint`: specifying a trained checkpoint (.bin / .ckpt / .nemo). The only exception is for DialogueZeroShotIntentModel, which can be configured at :code:`model.original_nemo_checkpoint`` instead For trained checkpoints, see :code:`list_available_models()`` for each model class and then downloading the file to a local directory - -**Obtaining data** - -Task: Schema Guided Dialogue (SGD) / SGD Generation - -:code: `git clone https://github.com/google-research-datasets/dstc8-schema-guided-dialogue.git` - -Task: MS Marco - -Please download the files below and unzip them into a common folder (for model.dataset.data_dir) - -https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz -https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz -https://msmarco.blob.core.windows.net/msmarco/eval_v2.1_public.json.gz - -Then remove unused samples (optional, but otherwise, this would require significantly more CPU RAM ~25GB) - -:code: `python ../NeMo/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py --filename train_v2.1.json` -:code: `python ../NeMo/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py --filename dev_v2.1.json` - -Task: Assistant - -:code: `git clone https://github.com/xliuhw/NLU-Evaluation-Data` - -Then unzip it - -Finally, convert the dataset into the required format - -.. code:: - - python examples/nlp/intent_slot_classification/data/import_datasets.py - --source_data_dir=`source_data_dir` \ - --target_data_dir=`target_data_dir` \ - --dataset_name='assistant' - -- :code:`source_data_dir`: the directory location of the your dataset -- :code:`target_data_dir`: the directory location where the converted dataset should be saved - - -Unfortunately other datasets are currently not available publically - -**Training/Testing a model** - - -Please try the example Dialogue model in a Jupyter notebook (can run on `Google's Colab `__). - - -Connect to an instance with a GPU (**Runtime** -> **Change runtime type** -> select **GPU** for the hardware accelerator). - -An example script on how to train the model can be found here: `NeMo/examples/nlp/dialogue/dialogue.py `__. - -The following is an example of the command for training the model: - - -Code for training a model with three public datasets (from above) are available in the Jupyter/Colab notebook `Google's Colab `__) - - -.. code:: - - python examples/nlp/dialogue/dialogue.py \ - do_training=True \ - model.dataset.task=sgd \ - model.dataset.debug_mode=True \ - model.language_model.pretrained_model_name=gpt2 \ - model.data_dir= \ - model.dataset.dialogues_example_dir= \ - trainer.devices=[0] \ - trainer.accelerator='gpu' diff --git a/docs/source/nlp/dialogue_UML.png b/docs/source/nlp/dialogue_UML.png deleted file mode 100644 index 5bcc4c01c9da..000000000000 Binary files a/docs/source/nlp/dialogue_UML.png and /dev/null differ diff --git a/examples/nlp/dialogue/analyse_prediction_results.py b/examples/nlp/dialogue/analyse_prediction_results.py deleted file mode 100644 index b97e886d6215..000000000000 --- a/examples/nlp/dialogue/analyse_prediction_results.py +++ /dev/null @@ -1,112 +0,0 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import json -import re - -import numpy as np - -from nemo.collections.nlp.metrics.dialogue_metrics import DialogueGenerationMetrics - - -def read_jsonl(filename): - with open(filename, 'r', encoding="UTF-8") as f: - docs = [json.loads(line) for line in f.readlines()] - return docs - - -def get_incorrect_labels(docs): - incorrect_labels_docs = [] - for doc in docs: - if doc["ground_truth_labels"] != doc["generated_labels"]: - incorrect_labels_docs.append( - { - "input": doc["input"], - "ground_truth_labels": doc["ground_truth_labels"], - "generated_labels": doc["generated_labels"], - } - ) - return incorrect_labels_docs - - -def get_incorrect_slots(docs): - incorrect_slots_docs = [] - for doc in docs: - if doc["ground_truth_slots"] != doc["generated_slots"]: - incorrect_slots_docs.append( - { - "input": doc["input"], - "ground_truth_slots": doc["ground_truth_slots"], - "generated_slots": doc["generated_slots"], - } - ) - return incorrect_slots_docs - - -def sort_by_f1(docs): - for i in range(len(docs)): - doc = docs[i] - generated_field = doc["generated"] - ground_truth_field = doc["ground_truth"] - generated_field = remove_punctation(generated_field.lower()) - ground_truth_field = remove_punctation(ground_truth_field.lower()) - p, r, f1 = DialogueGenerationMetrics._get_one_f1(generated_field, ground_truth_field) - docs[i]["f1"] = f1 - docs[i]["generated"] = generated_field - docs[i]["ground_truth"] = ground_truth_field - docs.sort(key=lambda x: x["f1"]) - return docs - - -def remove_punctation(sentence): - return re.sub(r'[^\w\s]', '', sentence) - - -def generation_main(filename): - docs = read_jsonl(filename) - docs = sort_by_f1(docs) - bleu = DialogueGenerationMetrics.get_bleu( - [doc["generated"] for doc in docs], [doc["ground_truth"] for doc in docs] - ) - acc = np.mean([int(doc["generated"] == doc["ground_truth"]) for doc in docs]) * 100 - f1 = np.mean([doc["f1"] for doc in docs]) - print("Token level F1 is {:.3}".format(f1)) - print("BLEU is {:.3}".format(bleu)) - print("Exact match accuracy is {:.3}".format(acc)) - for i in range(0): - print(docs[i]) - - -def classification_main(filename): - docs = read_jsonl(filename) - incorrect_labels_docs = get_incorrect_labels(docs) - incorrect_slots_docs = get_incorrect_slots(docs) - - print("{} / {} have incorrect labels".format(len(incorrect_labels_docs), len(docs))) - print("{} / {} have incorrect slots".format(len(incorrect_slots_docs), len(docs))) - - for doc in incorrect_labels_docs: - print(doc) - - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument("--prediction_filename") - parser.add_argument("--mode", choices=['generation', 'classification'], default='classification') - args = parser.parse_args() - if args.mode == 'classification': - classification_main(args.prediction_filename) - else: - generation_main(args.prediction_filename) diff --git a/examples/nlp/dialogue/conf/dialogue_config.yaml b/examples/nlp/dialogue/conf/dialogue_config.yaml deleted file mode 100644 index 6af9b5d50cf7..000000000000 --- a/examples/nlp/dialogue/conf/dialogue_config.yaml +++ /dev/null @@ -1,205 +0,0 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -pretrained_model: null # pretrained model from list_available_models() -do_training: true # true for training mode, false for testing -trainer: - devices: 1 # number of GPUs (0 for CPU), or list of the GPUs to use e.g. [0, 1] - num_nodes: 1 - max_epochs: 3 - max_steps: -1 # precedence over max_epochs - accumulate_grad_batches: 1 # accumulates grads every k batches - gradient_clip_val: 1.0 - precision: 16 # Should be set to 16 for O1 and O2 to enable the AMP. - accelerator: gpu - log_every_n_steps: 5 # Interval of logging. - val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations - num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it - enable_checkpointing: False # Provided by exp_manager - logger: False # Provided by exp_manager - -model: - # all models - tensor_model_parallel_size: 1 - nemo_path: null # filename to save the model and associated artifacts to .nemo file - library: huggingface # [huggingface, megatron] - save_model: False # save validation model checkpoints - - language_model: - pretrained_model_name: gpt2 # main config to select model (between bert, gpt2, t5/bart based models) see docs/source/nlp/dialogue.rst for full list of options - lm_checkpoint: null - config_file: null # json file, precedence over config - config: null - - tokenizer: - tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece - vocab_file: null # path to vocab file - tokenizer_model: null # only used if tokenizer is sentencepiece - special_tokens: null - - # Dialogue GPT Classification/Generation and Dialogue S2S Generation Model args - tokens_to_generate: 32 # for generation mode only - - # Intent Slot Classification model args - class_balancing: ${model.dataset.class_balancing} - intent_loss_weight: 0.6 # relation of intent to slot loss in total loss (between 0 to 1) - data_dir: ${model.dataset.data_dir} - classifier_head: - num_output_layers: 2 - fc_dropout: 0.1 - - # Dialogue GPT Classification Megatron Prompt Learning model args - prompt_learning: false # please change to true to activate prompt learning - language_model_path: ${model.language_model.lm_checkpoint} - new_tasks: ['intent_and_slot'] - prompt_tuning: - new_prompt_init_methods: ['text'] - new_prompt_init_text: ['intent_and_slot'] - p_tuning: # P-tuning specific params - dropout: 0.0 - num_layers: 2 - encoder_type: mlp # lstm or tpmlp or embedding - prompt_learning_nemo_path: prompt_learning.nemo - data: {} - virtual_prompt_style: 'p-tuning' # 'prompt-tuning' - encoder_seq_length: 2048 - pipeline_model_parallel_size: 1 - data_parallel_size: 1 - global_batch_size: 8 - micro_batch_size: 8 - - task_templates: - - taskname: "intent_and_slot" - prompt_template: "<|VIRTUAL_PROMPT_0|> {utterance} \nintent: {intent} \nslot: {slot}" - total_virtual_tokens: 10 - answer_only_loss: True - virtual_token_splits: [10] - truncate_field: null - - # SGDQA args - encoder: - dropout: 0.1 - - # Zero Shot Intent Model args - original_nemo_checkpoint: null ## cannot directly load as .nemo uses the pre-refactor model, therefore transfer its attributes over - - dataset: - - ## All tasks/models - data_dir: ??? # location to load data from - dialogues_example_dir: ??? # store prediction files - task: sgd # [sgd, assistant, zero_shot, ms_marco, sgd_generation, design, mellon_qa] - debug_mode: false # small number of examples for debugging - max_seq_length: 128 # the maximum number of tokens per sample - - ## Dialogue S2S and GPT Generation Model params - input_field: utterance+response # passage+utterance, utterance, response, utterance+response, system_actions - output_field: fluent_response # response, fluent_response, system_utterance - - ## Dialogue GPT Classification Model params - field: intent # [intent, slots, service] - few_shot: 0 # int ; 0 to 10, for number of examples in prompt - eval_mode: ranking # ranking or generation or binary_score - binary_score_subsample: false # subsample negative examples for binary score training - binary_score_subsample_ratio: 2 # number of negative examples per postive example - prompt_template: default # default, prompt_tuning, i_want_to # "This example is" for zeroshotintentmodel #acts_slots_values, slots_values, values for DialogueS2SGenerationDataset - target_template: default # default, with_description, with_slots - - ## SGD task specific params - system_utterance: prev_turn # prev_turn, next_turn: prev_turn (default for sgdqa) takes the system utterance that precede the user utterance; next_turn (for sgd_generation) takes the system utterance that follows the user utterance - num_tasks: 1 # number of task heads 1 for DialogGPTClassification and 6 for SGDQA - - ## SGD and Zero Shot task specific params - preprocess_intent_function: default # default, lowercase, description # remove_domain for zero_shot task - - ## SGDQA model specific params - subsample: false # balances negative and positive training examples for improved performance - task_name: sgd_single_domain # or from [sgd_all, sgd_all_single, sgd_multi_domain, debug_sample] - state_tracker: nemotracker # or baseline - use_cache: false # uses a cache to store the processed dataset, you may use it for large datasets for speed up - use_fuzzy_match: true # Whether to use fuzzy string matching when comparing non-categorical slot values. Should be set to False when conducting multiwoz style evaluation. - joint_acc_across_turn: false # Whether to compute joint goal accuracy across turn instead of across service. Should be set to True when conducting multiwoz style evaluation. - max_num_cat_slot: 6 # maximum number of different categorical slots per service in dataset - max_num_noncat_slot: 12 # maximum number of different non-categorical slots per service in dataset - max_value_per_cat_slot: 12 # maximum number of different categorical slot values per service in dataset - max_num_intent: 4 # maximum number of different intents per service in dataset - num_samples: -1 # restrict num_samples to an int value, if -1 all samples will be used - pad_label: -1 # if -1 not slot token will be used - ignore_extra_tokens: false - ignore_start_end: true # do not use first and last token for slot training - do_lowercase: false - - #Zero Shot Intent Model args - class_balancing: null # or weighted_loss - num_classes: 3 - - # Mellon QA, MS Marco and Design task - dev_proportion: 10 # These datasets do not have a dedicated dev set, therefore need to split train into a new train and dev. Indicate an integer (5-90) for the proporton for dev set - - train_ds: - ds_item: "train" - prefix: train - batch_size: 16 - shuffle: true - num_workers: 3 - drop_last: false - pin_memory: false - - validation_ds: - prefix: test - ds_item: ["dev"] - batch_size: 8 - shuffle: false - num_workers: 3 - drop_last: false - pin_memory: false - - test_ds: - prefix: test - ds_item: ["test"] - batch_size: 8 - shuffle: false - num_workers: 3 - drop_last: false - pin_memory: false - - optim: - name: adamw - lr: 1e-4 - # optimizer arguments - betas: [0.9, 0.999] - weight_decay: 0.01 - - # scheduler setup - sched: - name: PolynomialDecayAnnealing - # Scheduler params - warmup_steps: null - warmup_ratio: 0.02 - last_epoch: -1 - # pytorch lightning args - monitor: val_loss - reduce_on_plateau: false - -exp_manager: - exp_dir: null # exp_dir for your experiment, if None, defaults to "./nemo_experiments" - name: "SGDGEN" # The name of your model - create_wandb_logger: True - wandb_logger_kwargs: - name: ??? - project: SGDGEN - create_tensorboard_logger: True # Whether you want exp_manger to create a tb logger - create_checkpoint_callback: True # Whether you want exp_manager to create a modelcheckpoint callback - resume_if_exists: false - resume_ignore_no_checkpoint: false \ No newline at end of file diff --git a/examples/nlp/dialogue/dialogue.py b/examples/nlp/dialogue/dialogue.py deleted file mode 100644 index 3f4c5581eb5a..000000000000 --- a/examples/nlp/dialogue/dialogue.py +++ /dev/null @@ -1,158 +0,0 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -""" -This script contains an example of how to train and test dialogue models in NeMo. - -***Setting the configs*** -The model and the PT trainer are defined in a config file that declares multiple important sections. -The most important ones are: - model: All arguments that are related to the Model - model, loss, optimizer, - schedulers, and datasets/data loaders. - trainer: Any argument to be passed to PyTorch Lightning including number of epochs, number of GPUs, - precision level, etc. - -This script uses the `/examples/nlp/dialogue_state_tracking/conf/dialog_config.yaml` config file -by default. You may update the config file from the file directly. The other option is to set another config file via command-line arguments by `--config-name=CONFIG_FILE_PATH'. - - -***Model Training*** - python dialogue.py - do_training=True - model.dataset.data_dir= - model.dataset.dialogues_example_dir= - model.dataset.task= e.g. sgd - model.language_model.pretrained_model_name= e.g. gpt2 - trainer.devices=[] - -***Model Evaluation*** - command as above, change do_training=False -""" - -import os - -import lightning.pytorch as pl -from omegaconf import DictConfig, OmegaConf - -from nemo.collections.nlp.models.dialogue.dialogue_gpt_classification_model import DialogueGPTClassificationModel -from nemo.collections.nlp.models.dialogue.dialogue_gpt_generation_model import DialogueGPTGenerationModel -from nemo.collections.nlp.models.dialogue.dialogue_nearest_neighbour_model import DialogueNearestNeighbourModel -from nemo.collections.nlp.models.dialogue.dialogue_s2s_generation_model import DialogueS2SGenerationModel -from nemo.collections.nlp.models.dialogue.dialogue_zero_shot_intent_model import DialogueZeroShotIntentModel -from nemo.collections.nlp.models.dialogue.intent_slot_classification_model import IntentSlotClassificationModel -from nemo.collections.nlp.models.dialogue.sgdqa_model import SGDQAModel -from nemo.collections.nlp.modules.common.megatron.megatron_utils import compute_model_parallel_rank -from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategy -from nemo.core.config import hydra_runner -from nemo.utils import logging -from nemo.utils.app_state import AppState -from nemo.utils.exp_manager import exp_manager - - -@hydra_runner(config_path="conf", config_name="dialogue_config") -def main(cfg: DictConfig) -> None: - pl.seed_everything(42) - logging.warning('This script is no longer supported in NeMo and is scheduled for removal in the 24.11 release.') - logging.info(f'Config: {OmegaConf.to_yaml(cfg)}') - - try: - strategy = NLPDDPStrategy( - no_ddp_communication_hook=True, - find_unused_parameters=True, - ) - except (ImportError, ModuleNotFoundError): - strategy = 'auto' - - trainer = pl.Trainer(**cfg.trainer, strategy=strategy) - - exp_manager(trainer, cfg.get("exp_manager", None)) - - app_state = AppState() - app_state.data_parallel_size = cfg.model.data_parallel_size - if cfg.model.tensor_model_parallel_size > 1: - app_state.model_parallel_size = cfg.model.tensor_model_parallel_size - app_state.tensor_model_parallel_rank = compute_model_parallel_rank( - trainer.local_rank, app_state.model_parallel_size - ) - - if 'bert' in cfg.model.language_model.pretrained_model_name: - if cfg.model.dataset.task == 'sgd': - if cfg.model.original_nemo_checkpoint is not None: - model_class = DialogueZeroShotIntentModel - else: - model_class = SGDQAModel - elif cfg.model.dataset.task in ['zero_shot', 'design']: - model_class = DialogueZeroShotIntentModel - else: - model_class = IntentSlotClassificationModel - elif 'gpt' in cfg.model.language_model.pretrained_model_name.lower(): - if cfg.model.dataset.task in ['ms_marco', 'mellon_qa']: - model_class = DialogueGPTGenerationModel - else: - model_class = DialogueGPTClassificationModel - elif ( - 'bart' in cfg.model.language_model.pretrained_model_name.lower() - or 't5' in cfg.model.language_model.pretrained_model_name.lower() - ): - # please use bf16/32 with t5-large and above - # see https://github.com/huggingface/transformers/pull/10956 - model_class = DialogueS2SGenerationModel - elif 'sentence-transformers' in cfg.model.language_model.pretrained_model_name.lower(): - model_class = DialogueNearestNeighbourModel - - if cfg.pretrained_model or (cfg.model.nemo_path and os.path.exists(cfg.model.nemo_path)): - if cfg.pretrained_model: - logging.info(f'Loading pretrained model {cfg.pretrained_model}') - model = model_class.from_pretrained(cfg.pretrained_model) - else: - logging.info(f'Restoring model from {cfg.model.nemo_path}') - model = model_class.restore_from(cfg.model.nemo_path, trainer=trainer) - - if cfg.do_training: - model.setup_training_data(train_data_config=cfg.model.train_ds) - model.setup_multiple_validation_data(val_data_config=cfg.model.validation_ds) - else: - logging.info(f'Config: {OmegaConf.to_yaml(cfg)}') - model = model_class(cfg.model, trainer=trainer) - - if cfg.do_training: - trainer.fit(model) - if cfg.model.nemo_path: - if not os.path.exists(cfg.model.nemo_path): - model.save_to(cfg.model.nemo_path) - else: - updated_nemo_path = cfg.model.nemo_path.replace(".nemo", "_new.nemo") - logging.warning("nemo path exists, saving at {} instead".format(updated_nemo_path)) - model.save_to(updated_nemo_path) - - else: - data_dir = cfg.model.dataset.get('data_dir', None) - dialogues_example_dir = cfg.model.dataset.get('dialogues_example_dir', None) - - if data_dir is None or dialogues_example_dir is None: - raise ValueError('No dataset directory provided. Skipping evaluation. ') - elif not os.path.exists(data_dir): - raise ValueError(f'{data_dir} is not found, skipping evaluation on the test set.') - else: - if hasattr(model, "update_data_dirs"): - model.update_data_dirs(data_dir=data_dir, dialogues_example_dir=dialogues_example_dir) - model._cfg.dataset = cfg.model.dataset - - if hasattr(cfg.model, 'test_ds') and cfg.model.test_ds.ds_item is not None: - model.setup_multiple_test_data(test_data_config=cfg.model.test_ds) - trainer.test(model) - - -if __name__ == '__main__': - main() diff --git a/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py b/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py deleted file mode 100644 index 53a7ecfed2ef..000000000000 --- a/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py +++ /dev/null @@ -1,54 +0,0 @@ -# Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -import argparse -import json -from ast import literal_eval - -import ijson - - -def main(filename): - with open(filename, 'r') as file: - objects = ijson.kvitems(file, 'wellFormedAnswers') - valid_old_key_to_new_key = {} - new_key = 0 - for key, well_formed_answer in objects: - value = well_formed_answer if isinstance(well_formed_answer, list) else literal_eval(well_formed_answer) - if len(value) > 0: - valid_old_key_to_new_key[key] = str(new_key) - new_key += 1 - filtered_data = {} - fieldnames = ['query', 'query_type', 'answers', 'wellFormedAnswers', 'passages'] - for fieldname in fieldnames: - add_data(filename, filtered_data, fieldname, valid_old_key_to_new_key) - - with open(filename, 'w') as fw: - json.dump(filtered_data, fw) - - -def add_data(filename, filtered_data, fieldname, valid_old_key_to_new_key): - with open(filename, 'r') as f: - objects = ijson.kvitems(f, fieldname) - filtered_data[fieldname] = { - valid_old_key_to_new_key[key]: query for key, query in objects if key in valid_old_key_to_new_key - } - - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument("--filename") - args = parser.parse_args() - main(args.filename)