fix: Changes in function process_dataargs to support the current implementation #2

Abhishek-TAMU · 2024-11-21T20:34:56Z

Description of the change

1- Adding unit test case for testing function process_dataargs with current way of tuning.
2- Changes to make the case of dataset with input, output key work with current way of tuning and make test_process_dataargs_pretokenized handler work.

Related issue number

https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1428

How to verify the PR

Run current way of tuning for multiple cases:

Case with using dataset with text and label as a single sequence, using this dataset

Command used:

python tuning/sft_trainer.py  \
--model_name_or_path Maykeye/TinyLLama-v0  \
--training_data_path tests/data/twitter_complaints_small.jsonl  \
--output_dir outputs/full-tuning  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--gradient_accumulation_steps 1  \
--learning_rate 1e-5  \
--response_template "\n### Label:"  \
--dataset_text_field "output" \
--use_flash_attn false \
--torch_dtype "float32"

Case with pre-tokenized dataset using this dataset.

Command used:

python tuning/sft_trainer.py  \
--model_name_or_path Maykeye/TinyLLama-v0  \
--training_data_path tests/data/twitter_complaints_tokenized_with_maykeye_tinyllama_v0.jsonl \
--output_dir outputs/full-tuning  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--gradient_accumulation_steps 1  \
--learning_rate 1e-5  \
--use_flash_attn false \
--torch_dtype "float32"

Case with input, output key in dataset using this dataset

Command used:

python tuning/sft_trainer.py  \
--model_name_or_path Maykeye/TinyLLama-v0  \
--training_data_path tests/data/twitter_complaints_input_output.jsonl  \
--output_dir outputs/full-tuning  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--gradient_accumulation_steps 1  \
--learning_rate 1e-5  \
--use_flash_attn false \
--torch_dtype "float32"

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <[email protected]>

github-actions · 2024-11-21T20:35:09Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Abhishek-TAMU · 2024-11-21T20:43:47Z

tuning/data/data_handlers.py

+    fn_kwargs = tokenizer_kwargs.get("fn_kwargs", {})
+    tokenizer_inner_kwargs = fn_kwargs.get("tokenizer_kwargs", {})
+
+    tokenized_comb_seqs = tokenizer(combined, **tokenizer_inner_kwargs)
+    tokenized_input = tokenizer(input, **tokenizer_inner_kwargs)


Reference of discussion for this change: foundation-model-stack#381 (comment)

Thanks @Abhishek-TAMU

Abhishek-TAMU · 2024-11-21T20:46:04Z

tuning/data/setup_dataprocessor.py

@@ -118,7 +118,7 @@ def process_dataargs(
        kwargs = {
            "fn_kwargs": fn_kwargs,
            "batched": False,
-            "remove_columns": [JSON_INPUT_KEY, JSON_OUTPUT_KEY],
+            "remove_columns": "all",


In current implementation, dataset just have these columns input_ids, labels", attention_mask, hence every other columns in the dataset needed to be removed.

fix: Changes to support current implementation

163fe34

Signed-off-by: Abhishek <[email protected]>

github-actions bot added the fix label Nov 21, 2024

Abhishek-TAMU mentioned this pull request Nov 21, 2024

feat: DataProcessor v1 foundation-model-stack/fms-hf-tuning#381

Open

2 tasks

Abhishek-TAMU commented Nov 21, 2024

View reviewed changes

dushyantbehl merged commit 10d7d66 into dushyantbehl:dataloader-v2-impl Nov 22, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Changes in function process_dataargs to support the current implementation #2

fix: Changes in function process_dataargs to support the current implementation #2

Abhishek-TAMU commented Nov 21, 2024 •

edited

Loading

github-actions bot commented Nov 21, 2024

Abhishek-TAMU Nov 21, 2024

dushyantbehl Nov 22, 2024

Abhishek-TAMU Nov 21, 2024

fix: Changes in function process_dataargs to support the current implementation #2

fix: Changes in function process_dataargs to support the current implementation #2

Conversation

Abhishek-TAMU commented Nov 21, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Nov 21, 2024

Abhishek-TAMU Nov 21, 2024

Choose a reason for hiding this comment

dushyantbehl Nov 22, 2024

Choose a reason for hiding this comment

Abhishek-TAMU Nov 21, 2024

Choose a reason for hiding this comment

Abhishek-TAMU commented Nov 21, 2024 •

edited

Loading