Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLP Problems #8

Open
nbogda opened this issue Apr 24, 2020 · 1 comment
Open

NLP Problems #8

nbogda opened this issue Apr 24, 2020 · 1 comment

Comments

@nbogda
Copy link

nbogda commented Apr 24, 2020

I would like to label some text strings with a binary label. My data is a CSV file with two columns, one with the string of text, and another with human labels (for ground-truth purposes). The text fields are pretty long, with some reaching 4,000 characters. The file has 6,281 rows. Whenever I try to upload the CSV, I get the following error:

image

I figured it might have been an encoding problem, so I changed all string encoding in the file to UTF-8 and uploaded that version instead. Whenever I upload the UTF-8 version it hangs on "processing" for a long time, and opening the file reveals the image below. This is the first row of the data, truncated at 83 characters, and repeating 12 times. However, this particular string only appears in the data set twice.

image

I tried shortening the data set to only 50 rows and got the same behavior as above. Then I tried to shorten the actual text string to 50 characters because I figured it might be an issue with the string's length. The result of uploading the full file with all fields truncated at 50 characters results in the behavior below:

image

Then I tried shortening the text even more, to 10 characters, and found that it managed to upload the file! However, it is still stuck in "processing". I also discovered that the upload with only 10 characters works for both the original data and the UTF-8 encoded data, but longer text strings will throw the error shown in the first image in this issue.

image

My question is, is there a way to do NLP with the long text strings? Is there a limit on how long the text strings can be? Thanks in advance.

@slrbl
Copy link
Owner

slrbl commented Apr 28, 2020

You need to upgrade Tornado to the last version by making a git pull. NLP data should be 1 column CSV that contais text. You can have a look on https://www.youtube.com/watch?v=xcX-95iGKxY for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants