Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: object of type 'int' has no len() #53

Open
masakuri opened this issue Dec 11, 2017 · 15 comments
Open

TypeError: object of type 'int' has no len() #53

masakuri opened this issue Dec 11, 2017 · 15 comments

Comments

@masakuri
Copy link

When I trained with English train/dev files, it worked.
But when I trained with Japanese train/dev files (and set pre-trained Japanese word embeddings file), I got the following error.

  File "build/bdist.linux-x86_64/egg/deepcrf/__init__.py", line 66, in train
  File "build/bdist.linux-x86_64/egg/deepcrf/main.py", line 98, in run
  File "build/bdist.linux-x86_64/egg/deepcrf/util.py", line 102, in read_conll_file
TypeError: object of type 'int' has no len()

I want to set pre-trained Japanese char embeddings file, but it looks like there is not --char_emb_file option.
I am wondering if this is the cause of the error.
Does it support Japanese train/dev file (or --char_emb_file option) ?
Thank you.

@masakuri
Copy link
Author

masakuri commented Dec 11, 2017

I'm sorry, I typed incorrect command.
The error was solved.
I still have same error...

@aonotas
Copy link
Owner

aonotas commented Dec 11, 2017

Ok, please let me know your command.

@masakuri
Copy link
Author

$ deep-crf train input_train_jp.txt --delimiter=" " --dev_file input_dev_jp.txt --save_dir save_jpmodel_dir --save_name bilstm-cnn-crf_adam_jp --optimizer adam --word_emb_file jp_word_emb300.txt --word_emb_vocab_type replace_only --gpu 0

Thank you.

@aonotas
Copy link
Owner

aonotas commented Dec 11, 2017

I think this error since your training file format input_train_jp.txt is wrong.
Invalid input feature sizes.

I just fix code, please use recent version and please let me know the result.
I think input_train_jp.txt should be:

彼 O
は O
オバマ大統領 S-PERSON
です O

彼 O
は O

@masakuri
Copy link
Author

I got the following error.
ValueError: Invalid input feature sizes: "3". Please check at line [1298]

I checked at line 1298 in input_train_jp.txt and I understood that the "word" has space like:

ほげ[space]ほげ[space]O

"ほげ[space]ほげ" is proper noun.

Thank you for your help to know this error cause.
Is it OK to solve this problem by using --delimiter="\t" and input_train_jp.txt format is like ほげ[space]ほげ[tab]O ?

@masakuri
Copy link
Author

masakuri commented Dec 11, 2017

I fix input_train_jp.txt format and I run the command ($ deep-crf train input_train_jp.txt --delimiter="\t" --dev_file input_dev_jp.txt --save_dir save_jpmodel_dir --save_name bilstm-cnn-crf_adam_jp --optimizer adam --word_emb_file jp_word_emb300.txt --word_emb_vocab_type replace_only --gpu 0), I got following error:

  File "build/bdist.linux-x86_64/egg/deepcrf/__init__.py", line 66, in train
  File "build/bdist.linux-x86_64/egg/deepcrf/main.py", line 102, in run
ValueError: Invalid training sizes: 0 sentences.

Any ideas?

@aonotas
Copy link
Owner

aonotas commented Dec 11, 2017

Is it OK to solve this problem by using --delimiter="\t" and input_train_jp.txt format is like ほげ[space]ほげ[tab]O ?

Yes! I think it is a good solution.

Each sentence must be split by a blank line (empty line \n) in input_train_jp.txt.

Note that you should put empty line (\n) between sentences. This format is called CoNLL format.

I mean if you have two sentences,

$ cat input_file.txt
Barack  B−PERSON 
Hussein I−PERSON 
Obama   E−PERSON
is      O 
a       O 
man     O 
.       O

Yuji   B−PERSON 
Matsumoto E−PERSON 
is     O 
a      O 
man    O 
.      O

@masakuri
Copy link
Author

My input_train_jp.txt file has blank line ("\n") between sentences (more precisely, between tweets) but I got the error...

@aonotas
Copy link
Owner

aonotas commented Dec 11, 2017

Now your input_train_jp.txt seems following?

あああ[tab]O

あ[tab]O
い[tab]O
う[tab]O

お[space]お[tab]O
お[tab]O

@masakuri
Copy link
Author

Now your input_train_jp.txt seems following?

あああ[tab]O

あ[tab]O
い[tab]O
う[tab]O

お[space]お[tab]O
お[tab]O

Yes.

@aonotas
Copy link
Owner

aonotas commented Dec 11, 2017

OK. Can you send me your input file via e-mail if you are ok.
nanigashi03[at] gmail.com

@aonotas
Copy link
Owner

aonotas commented Dec 11, 2017

Or, please try replace [tab] to [space] :

お[space]お   =>    お_お

[tab]   => [space]

and please use --delimiter=" ".

Maybe [tab] unicode causes this error?

@masakuri
Copy link
Author

replace [tab] to [space]:

お[space]お => お_お

[tab] => [space]
use --delimiter=" "

It worked!!!
Thank you very much for your help!!!

@aonotas
Copy link
Owner

aonotas commented Dec 11, 2017

OK.
It seems our code or input format with [tab] will cause that error.

@masakuri masakuri changed the title Does it support Japanese train/dev file (or --char_emb_file option) ? TypeError: object of type 'int' has no len() Dec 11, 2017
@masakuri
Copy link
Author

I see. Thank you very much.
I changed the issue title to know the content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants