Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to make predictions due to missing script: graph_prediction_with_flag.py #2

Open
osession opened this issue Jul 6, 2023 · 7 comments

Comments

@osession
Copy link

osession commented Jul 6, 2023

Hi, I have been trying to run the run_evaluation.sh with the provided checkpoints downloaded and unzipped to the checkpoints directory. I am running into this error:

evaluate.py: error: argument --task: invalid choice: 'graph_prediction_with_flag' (choose from 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'text_to_speech', 'speech_to_speech', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'frm_text_to_speech', 'sentence_prediction', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'multilingual_language_modeling', 'audio_finetuning', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')

I can't find the graph_prediction_with_flag.py script anywhere else and was curious if it has just been removed permanently or if there is another way to run predictions?

Thanks!

@Minys233
Copy link
Owner

Minys233 commented Jul 6, 2023

Hi, actually, this 'graph_prediction_with_flag' is a custom registered task in the fairseq framework, located here:

https://github.com/Minys233/dynaformer_model/blob/c9942c389e545a5f43f0834031ce36034cb9b343/dynaformer/tasks/graph_prediction.py#L277-L282

This custom task normally should be imported in the runtime to register the task. The default parameter in the evaluate.sh script defines the location of 'graph_prediction_with_flag' task:

https://github.com/Minys233/dynaformer_model/blob/c9942c389e545a5f43f0834031ce36034cb9b343/examples/evaluate/evaluate.sh#L27

So the problem for you is the evaluate.py code can't find this custom task. Maybe you are not running ./run_custom_input.sh command in the Dynaformer (project root) directory, which makes the relative path not valid. Or maybe you're running evaluate.py with fewer parameters.

Please follow the steps in README.md, if this problem still exist, please post detailed steps here, and I will happy to see what happened :D

@osession
Copy link
Author

osession commented Jul 6, 2023

I am running ./run_evaluate.sh in the home directory. I don't think I am running evaluate.py with fewer parameters since I have not modified any of the evaluate.sh file. It seems like instead of looking in this file path that you showed (https://github.com/Minys233/dynaformer_model/blob/c9942c389e545a5f43f0834031ce36034cb9b343/examples/evaluate/evaluate.sh#L27), it is maybe instead looking here?
https://github.com/facebookresearch/fairseq/tree/98ebe4f1ada75d006717d84f9d603519d8ff5579/fairseq/tasks

At least those are all the other names of the tasks that are being listed in the error that I'm still getting.

@osession
Copy link
Author

osession commented Jul 6, 2023

I think I figured out the issue. I was getting this error: Dynaformer/examples/evaluate/evaluate.sh: line 25: realpath: command not found. So when I removed the realpath command and just replaced those lines with simply the string of the filepath, it was able to find the graph_prediction.py script. Thank you for your help!!

@Minys233
Copy link
Owner

Minys233 commented Jul 7, 2023

I think I figured out the issue. I was getting this error: Dynaformer/examples/evaluate/evaluate.sh: line 25: realpath: command not found. So when I removed the realpath command and just replaced those lines with simply the string of the filepath, it was able to find the graph_prediction.py script. Thank you for your help!!

Glad to hear this and thank you for pointing out this! After some googling and some testing, I find that realpath command is a part of coreutils, but in newer versions, this command is deprecated. More reliable readlink command should be used instead for the same purpose. I will soon update README.md and corresponding scripts.

ref: Discussions on Unix & Linux Stack Exchange

@osession osession closed this as completed Jul 7, 2023
@osession
Copy link
Author

Hello again, I've been trying to run bash Dynaformer/examples/md_pretrain/md_train.sh , but I am running into a similar issue that I had before with getting the 'invalid choice: graph_prediction_with_flag' error. It still isn't working this time even after adjusting the realpath command. Sorry to bring this up again!

fairseq-train: error: argument --task: invalid choice: 'graph_prediction_with_flag' (choose from 'translation', 'translation_from_pretrained_xlm', 'denoising', 'multilingual_denoising', 'speech_to_text', 'text_to_speech', 'hubert_pretraining', 'online_backtranslation', 'sentence_prediction', 'speech_to_speech', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'audio_finetuning', 'cross_lingual_lm', 'frm_text_to_speech', 'multilingual_translation', 'translation_from_pretrained_bart', 'semisupervised_translation', 'multilingual_masked_lm', 'translation_multi_simple_epoch', 'language_modeling', 'multilingual_language_modeling', 'translation_lev', 'masked_lm', 'sentence_ranking', 'legacy_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 96930) of binary: /home/ray/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
************************************************
  /home/ray/anaconda3/bin/fairseq-train FAILED  
================================================
Root Cause:
[0]:
  time: 2023-07-11_09:24:00
  rank: 0 (local_rank: 0)
  exitcode: 2 (pid: 96930)
  error_file: <N/A>
  msg: "Process failed with exitcode 2"
================================================
Other Failures:
  <NO_OTHER_FAILURES>
************************************************

@osession osession reopened this Jul 11, 2023
@osession
Copy link
Author

osession commented Jul 12, 2023

I figured out that the user directory was incorrect which was why it was unable to find the 'graph_prediction_with_flag' custom task. So I changed line 157 in md_train.sh from --user-dir "$(realpath ./dynaformer)" \ to --user-dir "$(realpath ./Dynaformer/dynaformer)" \.

However, the training is still stopping at this error:

Root at /home/ray/dataset
Loading hybrid data from md-refined2019-5-5-5, general-set-2019-coreset-2016
Downloading https://scientificdata.blob.core.windows.net/dynaformer/dataset/mddata/md-refined2019-5-5-5.zip
Extracting /home/ray/dataset/md-refined2019-5-5-5.zip
Processing...
Loading file: /home/ray/dataset/md-refined2019-5-5-5_train_val.pkl, exists? True
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 54483) of binary: /home/ray/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ray/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
*************************************************
   /home/ray/anaconda3/bin/fairseq-train FAILED  
=================================================
Root Cause:
[0]:
  time: 2023-07-13_09:03:04
  rank: 0 (local_rank: 0)
  exitcode: -9 (pid: 54483)
  error_file: <N/A>
  msg: "Signal 9 (SIGKILL) received by PID 54483"
=================================================
Other Failures:
  <NO_OTHER_FAILURES>
*************************************************

@osession
Copy link
Author

I figured out the solution to the above error was to switch my head node to a type that had 122 GB instead of 30 GB of storage, and it seems to be working now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants