Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotator not annotating some files #111

Open
LeoFrom opened this issue Jun 19, 2024 · 6 comments
Open

Annotator not annotating some files #111

LeoFrom opened this issue Jun 19, 2024 · 6 comments

Comments

@LeoFrom
Copy link

LeoFrom commented Jun 19, 2024

Selecting text on some .txt files does not annotate the selected text either on windows or web application
image

Adding the text for testing purpose
01.01.01.01.199.txt

@LeoFrom LeoFrom closed this as completed Jun 19, 2024
@LeoFrom LeoFrom reopened this Jun 19, 2024
@alvi-khan
Copy link
Collaborator

Screenshot 2024-06-20 at 00-35-24 NER Annotator for SpaCy

Hello @LeoFrom. Thank you for providing the text file, that was helpful. I just tried it on the web version and it seems to be working for me. Could you kindly provide some more information?

  1. What are you using as the text separator?
  2. What are you using as the annotation precision?
  3. If you're okay with doing so, please provide the tags file.

I'd like to get as close as possible to your setup.

@LeoFrom
Copy link
Author

LeoFrom commented Jun 19, 2024

Hello, thank you for your reply.

  • I'm using --- as my text separator
  • My annotation precision is word level
  • Here is my tag file : Tags.json

Maybe that can help you : It was fonctionnal at the beginning but it seems that when I play with the text separator sometimes or switch texts it freezes the process of tagging. I'll include two more different texts so you can maybe recreate the bug

01.01.01.01.55.txt
01.01.01.01.27.txt

@alvi-khan
Copy link
Collaborator

Thanks! I've managed to replicate it now. I'll take a look at why this is happening and try to get back to you soon.

@alvi-khan
Copy link
Collaborator

It seems the issue occurs if there are double quotes (") inside the text.

The Treebank Tokenizer we use is a JavaScript port of the one used by the NLTK Python library. The issue was reported (and fixed) by the NLTK team, so it seems we need to update the port.

@tecoholic it would be great if you could help with this one since I haven't looked into your port yet. I could try and give a PR if I can figure out what needs to be changed there.

In the mean time, @LeoFrom if it's an acceptable solution for your use case, you could try replacing all the double quotes with single quotes. I checked locally and it seemed to work alright.

@tecoholic
Copy link
Owner

@alvi-khan I will take a look.

@tecoholic
Copy link
Owner

tecoholic commented Jun 20, 2024

@alvi-khan Looking at the code, it looks like those fixes are already in the JS ported version. To confirm, I added the unit tests from the Python version and they are passing as expected. So, I think the issue might be elsewhere and not in the tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants