Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running yelp/train.py #16

Open
jiwoongim opened this issue Jul 24, 2018 · 9 comments
Open

Error when running yelp/train.py #16

jiwoongim opened this issue Jul 24, 2018 · 9 comments

Comments

@jiwoongim
Copy link

I followed README.md and ran
python train.py --data_path ./data

But then I got the following errors:

{'dropout': 0.0, 'lr_ae': 1, 'load_vocab': '', 'nlayers': 1, 'batch_size': 64, 'beta1': 0.5, 'gan_gp_lambda': 0.1, 'nhidden': 128, 'vocab_size': 30000, 'niters_gan_schedule': '', 'niters_gan_d': 5, 'lr_gan_d': 0.0001, 'grad_lambda': 0.01, 'sample': False, 'arch_classify': '128-128', 'clip': 1, 'hidden_init': False, 'cuda': True, 'log_interval': 200, 'device_id': '0', 'temp': 1, 'seed': 1111, 'maxlen': 25, 'lowercase': True, 'data_path': './data', 'lambda_class': 1, 'lr_classify': 0.0001, 'outf': 'yelp_example', 'noise_r': 0.1, 'noise_anneal': 0.9995, 'lr_gan_g': 0.0001, 'niters_gan_g': 1, 'arch_g': '128-128', 'z_size': 32, 'epochs': 25, 'niters_ae': 1, 'arch_d': '128-128', 'emsize': 128, 'niters_gan_ae': 1}
Original vocab 9599; Pruned to 9603
Number of sentences dropped from ./data/valid1.txt: 0 out of 38205 total
Number of sentences dropped from ./data/valid2.txt: 0 out of 25278 total
Number of sentences dropped from ./data/train1.txt: 0 out of 267314 total
Number of sentences dropped from ./data/train2.txt: 0 out of 176787 total
Vocabulary Size: 9603
382 batches
252 batches
4176 batches
2762 batches
Loaded data!
Seq2Seq2Decoder(
  (embedding): Embedding(9603, 128)
  (embedding_decoder1): Embedding(9603, 128)
  (embedding_decoder2): Embedding(9603, 128)
  (encoder): LSTM(128, 128, batch_first=True)
  (decoder1): LSTM(256, 128, batch_first=True)
  (decoder2): LSTM(256, 128, batch_first=True)
  (linear): Linear(in_features=128, out_features=9603, bias=True)
)
MLP_G(
  (layer1): Linear(in_features=32, out_features=128, bias=True)
  (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation1): ReLU()
  (layer2): Linear(in_features=128, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation2): ReLU()
  (layer7): Linear(in_features=128, out_features=128, bias=True)
)
MLP_D(
  (layer1): Linear(in_features=128, out_features=128, bias=True)
  (activation1): LeakyReLU(negative_slope=0.2)
  (layer2): Linear(in_features=128, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation2): LeakyReLU(negative_slope=0.2)
  (layer6): Linear(in_features=128, out_features=1, bias=True)
)
MLP_Classify(
  (layer1): Linear(in_features=128, out_features=128, bias=True)
  (activation1): ReLU()
  (layer2): Linear(in_features=128, out_features=128, bias=True)
  (bn2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation2): ReLU()
  (layer6): Linear(in_features=128, out_features=1, bias=True)
)
Training...
Traceback (most recent call last):
  File "train.py", line 574, in <module>
    train_ae(1, train1_data[niter], total_loss_ae1, start_time, niter)
  File "train.py", line 400, in train_ae
    output = autoencoder(whichdecoder, source, lengths, noise=True)
  File "/localhome/imd/anaconda2/envs/Pytorch/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/groups/branson/home/imd/Documents/project/ARAE/yelp/models.py", line 143, in forward
    hidden = self.encode(indices, lengths, noise)
  File "/groups/branson/home/imd/Documents/project/ARAE/yelp/models.py", line 160, in encode
    batch_first=True)
  File "/localhome/imd/anaconda2/envs/Pytorch/lib/python3.5/site-packages/torch/onnx/__init__.py", line 56, in wrapper
    if not might_trace(args):
  File "/localhome/imd/anaconda2/envs/Pytorch/lib/python3.5/site-packages/torch/onnx/__init__.py", line 130, in might_trace
    first_arg = args[0]
IndexError: tuple index out of range
@jakezhaojb
Copy link
Owner

Hmm, could you try maybe run with python3?

@vineetjohn
Copy link

I've run into the same issue.
Python 3.5.2
torch==0.4.1

Training...     
Traceback (most recent call last):
  File "train.py", line 574, in <module>
    train_ae(1, train1_data[niter], total_loss_ae1, start_time, niter)                                                       
  File "train.py", line 400, in train_ae
    output = autoencoder(whichdecoder, source, lengths, noise=True)
  File "/home/v2john/.pyenv/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__                   
    result = self.forward(*input, **kwargs)
  File "/home/v2john/ARAE/yelp/models.py", line 143, in forward
    hidden = self.encode(indices, lengths, noise)                                                                            
  File "/home/v2john/ARAE/yelp/models.py", line 160, in encode
    batch_first=True)
  File "/home/v2john/.pyenv/lib/python3.5/site-packages/torch/onnx/__init__.py", line 67, in wrapper                         
    if not might_trace(args):
  File "/home/v2john/.pyenv/lib/python3.5/site-packages/torch/onnx/__init__.py", line 141, in might_trace
    first_arg = args[0]                                                                                                      
IndexError: tuple index out of range

Python3 clearly isn't the fix. It seems like something about the PyTorch + ONNX interop is broken.
Is there a specific version of PyTorch that's needed to run this?

@vineetjohn
Copy link

@jiwoongim

You can try using my forked version of the repository to see if it fixes the issue for you.
I've verified it to be working for Python 3.5.2 and PyTorch 0.4.1
https://github.com/vineetjohn/arae

I've not identified the actual problem yet, but I've added a workaround that avoids having to deal with ONNX altogether. The pack_padded_sequence method in torch.nn.utils.rnn seems to be buggy.

@jakezhaojb
Copy link
Owner

Guys can you try python 3.6? @jiwoongim @vineetjohn

@rainyrainyguo
Copy link

@jiwoongim
You can try using my forked version of the repository, I have resolved the issue by doing several changes to the original code.
I have verified it to be working for python 3.6.5 and PyTorch 0.4.1
https://github.com/rainyrainyguo/ARAE

@vineetjohn
Copy link

@jakezhaojb

This doesn't look like a Python version issue.
The named arguments used in this project vs. those accepted by PyTorch 0.4.1 are inconsistent.

You should consider adding the version of PyTorch used to perform your experiments, to the project README.

@jakezhaojb
Copy link
Owner

@vineetjohn Good point! I used PyTorch 0.3.1. I'm adding this to the README

@dangvanthin
Copy link

@rainyrainyguo
I have run your forked version in python 3.6.5 with PyTorch 0.4.1 (Cudnn=7.1.3, Cudatoolkit=8.0) and I have a error as follow:
Training ....
run_oneb.py:256: UserWarning: torch.nn.utils.clip_grad_norm is now deprecated in favor of torch.nn.utils.clip_grad_norm_
.
run_oneb.py:259: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() t
o convert a 0-dim tensor to a Python number
run_oneb.py:263: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() t
o convert a 0-dim tensor to a Python number
| epoch 1 | 0/ 765 batches | ms/batch 0.61 | loss 0.05 | ppl 1.05 | acc 0.00
Traceback (most recent call last):
File "run_oneb.py", line 102, in
exec(open("train.py").read())
File "", line 434, in
File "", line 395, in train
File "", line 324, in train_gan_d
File "/home/thindv/anaconda3/envs/ARAE/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/thindv/anaconda3/envs/ARAE/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: invalid gradient at index 0 - expected shape [] but got [1]

Can you give me some advices?

@V-Enzo
Copy link

V-Enzo commented Mar 12, 2020

@dangvanthin Hi, I met the same problem. Do you have the solution right now? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants