Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModleZooPyTorch的示例,MobileNetV3_large_100_for_PyTorch单卡训练没问题,8卡训练的途中主进程挂了卡住了,但是还占用了3块NPU, 5块NPU因为主进程挂了异常,然后被释放了 #5

Open
535205856 opened this issue Aug 2, 2023 · 0 comments

Comments

@535205856
Copy link

ModelZoo-PyTorch/PyTorch/contrib/cv/classification/MobileNetV3_large_100_for_PyTorch# bash ./test/train_full_8p.sh --data_path=./tiny-imagenet-200

Using NVIDIA APEX AMP. Training in mixed precision.

Using NVIDIA APEX DistributedDataParallel.

\

Scheduled epochs: 12

./tiny-imagenet-200/train

./tiny-imagenet-200/val

/

/

|

/

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 1 [ 23/24 (100%)] Loss: 317.145294 (162.0302) Time: 0.203s, 20132.34/s (0.204s, 20120.48/s) LR: 1

.875e-01 Data: 0.000 (0.000) FPS: 20120.483 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 2 [ 23/24 (100%)] Loss: 12993.997070 (6542.2793) Time: 0.201s, 20416.47/s (0.201s, 20367.03/s) LR

: 1.067e+00 Data: 0.000 (0.000) FPS: 20367.028 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 3 [ 23/24 (100%)] Loss: 13072.549805 (13072.2168) Time: 0.200s, 20458.09/s (0.201s, 20406.06/s) L

R: 1.000e-05 Data: 0.000 (0.000) FPS: 20406.057 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 4 [ 23/24 (100%)] Loss: 13074.948242 (13074.1235) Time: 0.200s, 20464.24/s (0.201s, 20388.30/s) L

R: 1.000e-05 Data: 0.000 (0.000) FPS: 20388.300 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 5 [ 23/24 (100%)] Loss: 13074.291016 (13074.4219) Time: 0.200s, 20436.34/s (0.200s, 20444.53/s) L

R: 1.000e-05 Data: 0.000 (0.000) FPS: 20444.529 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 6 [ 23/24 (100%)] Loss: 13075.957031 (13075.1313) Time: 0.201s, 20424.17/s (0.201s, 20404.91/s) L

R: 1.000e-05 Data: 0.000 (0.000) FPS: 20404.910 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-6.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 7 [ 23/24 (100%)] Loss: 13076.270508 (13076.5000) Time: 0.201s, 20398.41/s (0.200s, 20433.25/s) L

R: 1.000e-05 Data: 0.000 (0.000) FPS: 20433.251 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-6.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-7.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

Train: 8 [ 23/24 (100%)] Loss: 13076.963867 (13077.1807) Time: 0.201s, 20409.10/s (0.201s, 20424.42/s) L

R: 1.000e-05 Data: 0.000 (0.000) FPS: 20424.422 Batch_Size:512.0

Current checkpoints:

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-1.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-2.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-3.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-4.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-5.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-6.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-7.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-8.pth.tar', 100.0)

('./output/train/20230727-025210-mobilenetv3_large_100-224/checkpoint-0.pth.tar', 0.0)

-----------------------------------8卡 train_1.log 有报错

Traceback (most recent call last):

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

self.run() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run

self._target(*self._args, **self._kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 61, in wrapper

raise exp 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 58, in wrapper

func(*args, **kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 275, in task_distribute

key, func_name, detail = resource_proxy[TASK_QUEUE].get() 

File "", line 2, in get

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth

od

kind, result = conn.recv() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv

buf = self._recv_bytes() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b

ytes

buf = self._recv(4) 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv

raise EOFError 

EOFError

-----------------------------------8卡 train_2.log 有报错

Traceback (most recent call last):

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

self.run() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run

self._target(*self._args, **self._kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 61, in wrapper

raise exp 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 58, in wrapper

func(*args, **kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 275, in task_distribute

key, func_name, detail = resource_proxy[TASK_QUEUE].get() 

File "", line 2, in get

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth

od

kind, result = conn.recv() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv

buf = self._recv_bytes() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b

ytes

buf = self._recv(4) 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv

raise EOFError 

EOFError

/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semap

hore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown

len(cache))

-----------------------------------8卡 train_3.log 还在等待

-----------------------------------8卡 train_4.log 有报错

Traceback (most recent call last):

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

self.run() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run

self._target(*self._args, **self._kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 61, in wrapper

raise exp 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 58, in wrapper

func(*args, **kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 275, in task_distribute

key, func_name, detail = resource_proxy[TASK_QUEUE].get() 

File "", line 2, in get

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth

od

kind, result = conn.recv() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv

buf = self._recv_bytes() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b

ytes

buf = self._recv(4) 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv

raise EOFError 

EOFError

/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semap

hore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown

len(cache))

-----------------------------------8卡 train_5.log 有报错

Traceback (most recent call last):

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

self.run() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run

self._target(*self._args, **self._kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 61, in wrapper

raise exp 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 58, in wrapper

func(*args, **kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 275, in task_distribute

key, func_name, detail = resource_proxy[TASK_QUEUE].get() 

File "", line 2, in get

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth

od

kind, result = conn.recv() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv

buf = self._recv_bytes() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b

ytes

buf = self._recv(4) 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv

raise EOFError 

EOFError

-----------------------------------8卡 train_6.log 有报错

Traceback (most recent call last):

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

self.run() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/process.py", line 99, in run

self._target(*self._args, **self._kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 61, in wrapper

raise exp 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 58, in wrapper

func(*args, **kwargs) 

File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/route.py",

line 275, in task_distribute

key, func_name, detail = resource_proxy[TASK_QUEUE].get() 

File "", line 2, in get

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/managers.py", line 819, in _callmeth

od

kind, result = conn.recv() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 250, in recv

buf = self._recv_bytes() 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_b

ytes

buf = self._recv(4) 

File "/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/connection.py", line 383, in _recv

raise EOFError 

EOFError

/root/miniconda3/envs/torch-1.11.0/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semap

hore_tracker: There appear to be 91 leaked semaphores to clean up at shutdown

len(cache))

-----------------------------------8卡 train_7.log 等待中

more /root/ascend/log/debug/plog/plog-83032_20230

727024927917.log

[TRACE] GE(83032,python3):2023-07-27-02:49:27.848.095 [status:INIT] [ge_api.cc:200]83032 GEInitializeImpl:GEI

nitialize start

[TRACE] GE(83032,python3):2023-07-27-02:49:28.073.557 [status:RUNNING] [ge_api.cc:266]83032 GEInitializeImpl:

Initializing environment

[TRACE] GE(83032,python3):2023-07-27-02:49:36.094.724 [status:STOP] [ge_api.cc:309]83032 GEInitializeImpl:GEI

nitialize finished

[TRACE] GE(83032,python3):2023-07-27-02:49:36.095.523 [status:INIT] [ge_api.cc:200]83032 GEInitializeImpl:GEI

nitialize start

[TRACE] HCCL(83032,python3):2023-07-27-02:49:57.407.898 [status:init] [op_base.cc:267][hccl-83032-0-169042619

7-hccl_world_group][7]HcclCommInitRootInfo success,take time [2890202]us, rankNum[8], rank[7],rootInfo identi

fier[10.0.48.200%enp61s0f3_60000_0_1690426193976808], server[10.0.48.200%enp61s0f3], device[7]

几个都是正常的

最后时间的三个

(base) root@hw:/media/sda/datastore/dataset/detect_dataset# more /root/ascend/log/debug/plog/plog-84469_20230

727030241041.log

[ERROR] TBE(84469,python3):2023-07-27-03:02:41.035.597 [../../../../../../latest/python/site-packages/tbe/com

mon/repository_manager/utils/repository_manager_log.py:30][log] [../../../../../../latest/python/site-package

s/tbe/common/repository_manager/utils/common.py:100][repository_manager] The main process does not exist. We

would kill multiprocess manager process: 84068.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant