Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terrible Documentation #28

Open
rjuez00 opened this issue Apr 4, 2022 · 10 comments
Open

Terrible Documentation #28

rjuez00 opened this issue Apr 4, 2022 · 10 comments

Comments

@rjuez00
Copy link

rjuez00 commented Apr 4, 2022

Feature description

Improve the documentation, how is it possible that I cannot find documentation explaining the different classes? The tool can be as good as you like but I have to read directly the code to understand what features does it have or how to save the datasets once transformed...

@gilokip
Copy link

gilokip commented Apr 17, 2022

Have you found a way to save the datasets? I'm also having a lot of difficulty saving.

@rjuez00
Copy link
Author

rjuez00 commented Apr 17, 2022

Hi, yes!
When you load a JSONL dataset with read_jsonl then you can cycle through it and each entity is a document with its anotations.
With that document you choose the function to transform it you want to use.
And then what it returns it has an "id" and what is important in "data" you have the document formatted so you just need to write it into a file.

I leave you here an example:
(be careful with the encoding, you might have other encoding, and beware also of the "conll_03" function idk if I typed it correctly)

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open("datasets/test.dataset", "w", encoding = "utf-8") as file:
    for entry in read_jsonl(filepath='datasets/doccanoSplit/test_Anotados.jsonl', dataset=NERDataset, encoding='latin-1').conll_03(tokenizer=str.split):
        file.write(entry["data"] + "\n")

I don't want to brag but I really recommend you to use my fork of doccano_transformer I have solved several bugs very important (some anotations werent being transformed correctly and it didn't save them).

Plus I added some other transformers.
The only thing is that the spacy conversor doesn't work anymore and is pending fixing which I do not have time to do right now so mind that.

To install it and use it use:
pip install git+https://www.github.com/rjuez00/doccano-transformer

@gilokip
Copy link

gilokip commented Apr 18, 2022

I followed your instructions but I'm getting this error.
https://pastebin.com/FaVxBgY6
what could be the problem. All my annotations are okay

@gilokip
Copy link

gilokip commented Apr 18, 2022

I followed your instructions but I'm getting this error. https://pastebin.com/FaVxBgY6 what could be the problem. All my annotations are okay

NVM, I fixed it. Apparently, I have to change the 'label' key in my file to "labels". But your solution works

@littlestar502
Copy link

i encounter issue: KeyError: 'The file should includes either "labels" or "annotations", any suggestions on this?

@gilokip
Copy link

gilokip commented Aug 15, 2022

On your JSONL document check your keys ad change for the whole document. I guess it was an issue with the annotator where files are saved with the wrong key. So for example if the key is "label" change it to "labels" for the whole JSONL document.

@pdbang
Copy link

pdbang commented Oct 25, 2022

If someone is looking for the version adapted for camembert (french model), here you can find my version : https://github.com/pdbang/doccano-camembert-transformer

@AkimfromParis
Copy link

@rjuez00 Hello Rodrigo,

Thank you for your fork. I have many... ERROR NOT ALL TAGS WERE SAVED TO CONLL03...

My tags are correct on Doccano. When I start your script. I am missing a lot of BIO labels.
Seems that the tokenizer is not taking into account punctuation such as ",", ":", and ".".

Any idea? Thx!

@ghassenhed
Copy link

@AkimParis hello,
i am facing the same problem.
did you manage to find a solution ?

@AkimfromParis
Copy link

@ghassenhed
I made it work but still with a few errors in the output file.
Check the PR -> https://github.com/doccano/doccano-transformer/pull/38/files

And my version of Rjuez00...

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl

with open("train-final-888.txt", "w", encoding = "utf-8") as file:
for entry in read_jsonl(filepath='admin.jsonl', dataset=NERDataset, encoding='utf-8').to_conll2003(tokenizer=str.split):
file.write(entry["data"] + "\n")

Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants