NLP Problems #8

nbogda · 2020-04-24T15:35:19Z

I would like to label some text strings with a binary label. My data is a CSV file with two columns, one with the string of text, and another with human labels (for ground-truth purposes). The text fields are pretty long, with some reaching 4,000 characters. The file has 6,281 rows. Whenever I try to upload the CSV, I get the following error:

I figured it might have been an encoding problem, so I changed all string encoding in the file to UTF-8 and uploaded that version instead. Whenever I upload the UTF-8 version it hangs on "processing" for a long time, and opening the file reveals the image below. This is the first row of the data, truncated at 83 characters, and repeating 12 times. However, this particular string only appears in the data set twice.

I tried shortening the data set to only 50 rows and got the same behavior as above. Then I tried to shorten the actual text string to 50 characters because I figured it might be an issue with the string's length. The result of uploading the full file with all fields truncated at 50 characters results in the behavior below:

Then I tried shortening the text even more, to 10 characters, and found that it managed to upload the file! However, it is still stuck in "processing". I also discovered that the upload with only 10 characters works for both the original data and the UTF-8 encoded data, but longer text strings will throw the error shown in the first image in this issue.

My question is, is there a way to do NLP with the long text strings? Is there a limit on how long the text strings can be? Thanks in advance.

slrbl · 2020-04-28T18:08:30Z

You need to upgrade Tornado to the last version by making a git pull. NLP data should be 1 column CSV that contais text. You can have a look on https://www.youtube.com/watch?v=xcX-95iGKxY for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP Problems #8

NLP Problems #8

nbogda commented Apr 24, 2020

slrbl commented Apr 28, 2020

NLP Problems #8

NLP Problems #8

Comments

nbogda commented Apr 24, 2020

slrbl commented Apr 28, 2020