Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train issues #70

Open
FanYans opened this issue Mar 29, 2024 · 0 comments
Open

train issues #70

FanYans opened this issue Mar 29, 2024 · 0 comments

Comments

@FanYans
Copy link

FanYans commented Mar 29, 2024

When I train my data, the loss rate became "nan"

Epoch: 0 Step: 200 / 218 time: 1.800146 s init_v_loss: 0.11003795 mean_v_loss: 0.11003795
Epoch: 0 Step: 201 / 218 time: 1.800304 s init_v_loss: 0.00626686 mean_v_loss: 0.05815240
Epoch: 0 Step: 202 / 218 time: 1.804186 s init_v_loss: 0.10782523 mean_v_loss: 0.07471001
Epoch: 0 Step: 203 / 218 time: 1.807079 s init_v_loss: 0.02169361 mean_v_loss: 0.06145591
Epoch: 0 Step: 204 / 218 time: 1.794902 s init_v_loss: nan mean_v_loss: nan
Epoch: 0 Step: 205 / 218 time: 1.804291 s init_v_loss: 0.05617625 mean_v_loss: nan
Epoch: 0 Step: 206 / 218 time: 1.793242 s init_v_loss: nan mean_v_loss: nan
Epoch: 0 Step: 207 / 218 time: 1.798064 s init_v_loss: nan mean_v_loss: nan
Epoch: 0 Step: 208 / 218 time: 1.798309 s init_v_loss: 0.02277363 mean_v_loss: nan
Epoch: 0 Step: 209 / 218 time: 1.797415 s init_v_loss: 0.10808768 mean_v_loss: nan

who can help me?

and when I finished my first epoch, the training will be interrupted. I get errors when I start training again. like:
WARNING:tensorflow:From /home/luda403/AnimeGAN-master/tools/data_loader.py:76: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_one_shot_iterator(dataset).
[] Reading checkpoints...
[
] Success to read AnimeGAN.model-0
[*] Load SUCCESS
2024-03-28 12:13:54.255522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2024-03-28 12:13:54.573781: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2024-03-28 12:16:58.252940: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Resource exhausted: /tmp/tempfile-dcxx02-8ffd700-200527-614b77fefbdf8; No space left on device
Relying on driver to perform ptx compilation. This message will be only logged once.
2024-03-28 12:17:10.050582: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2024-03-28 12:17:10.050867: I tensorflow/stream_executor/stream.cc:4976] [stream=0x563a23f10970,impl=0x563a23f10400] did not memset GPU location; source: 0x7f4909ffcb10; size: 8388608; pattern: ffffffff
2024-03-28 12:17:10.050880: I tensorflow/stream_executor/stream.cc:4976] [stream=0x563a23f10970,impl=0x563a23f10400] did not memset GPU location; source: 0x7f4909ffcb30; size: 8388608; pattern: ffffffff
2024-03-28 12:17:10.050927: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_ops.cc:1006 : Not found: No algorithm worked!
Traceback (most recent call last):
File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3
[[{{node Tensordot/MatMul}}]]
[[generator/G_MODEL/Tanh/_1417]]
(1) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3
[[{{node Tensordot/MatMul}}]]
0 successful operations.
0 derived errors ignored.

  During handling of the above exception, another exception occurred:
  
  Traceback (most recent call last):
    File "train.py", line 100, in <module>
      main()
    File "train.py", line 94, in main
      gan.train()
    File "/home/luda403/AnimeGAN-master/AnimeGAN.py", line 258, in train
      self.Generator_loss, self.G_loss_merge], feed_dict = train_feed_dict)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
      run_metadata_ptr)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
      feed_dict_tensor, options, run_metadata)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
      run_metadata)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
      raise type(e)(node_def, op, message)
  tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
    (0) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3
           [[node Tensordot/MatMul (defined at /home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
           [[generator/G_MODEL/Tanh/_1417]]
    (1) Internal: Blas GEMM launch failed : a.shape=(196608, 3), b.shape=(3, 3), m=196608, n=3, k=3
           [[node Tensordot/MatMul (defined at /home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  0 successful operations.
  0 derived errors ignored.
  
  Original stack trace for 'Tensordot/MatMul':
    File "train.py", line 100, in <module>
      main()
    File "train.py", line 89, in main
      gan.build_model()
    File "/home/luda403/AnimeGAN-master/AnimeGAN.py", line 160, in build_model
      t_loss = self.con_weight * c_loss + self.sty_weight * s_loss + color_loss(self.real,self.generated) * self.color_weight
    File "/home/luda403/AnimeGAN-master/tools/ops.py", line 278, in color_loss
      con = rgb2yuv(con)
    File "/home/luda403/AnimeGAN-master/tools/ops.py", line 295, in rgb2yuv
      return tf.image.rgb_to_yuv(rgb)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/image_ops_impl.py", line 2930, in rgb_to_yuv
      return math_ops.tensordot(images, kernel, axes=[[ndims - 1], [0]])
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 4071, in tensordot
      ab_matmul = matmul(a_reshape, b_reshape)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
      return target(*args, **kwargs)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
      a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
      name=name)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
      op_def=op_def)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
      return func(*args, **kwargs)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
      attrs, op_def, compute_device)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
      op_def=op_def)
    File "/home/dcxx/local/miniconda3/envs/AnimeGANv2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
      self._traceback = tf_stack.extract_stack()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant