Terrible Documentation #28

rjuez00 · 2022-04-04T14:25:38Z

Feature description

Improve the documentation, how is it possible that I cannot find documentation explaining the different classes? The tool can be as good as you like but I have to read directly the code to understand what features does it have or how to save the datasets once transformed...

gilokip · 2022-04-17T00:23:04Z

Have you found a way to save the datasets? I'm also having a lot of difficulty saving.

rjuez00 · 2022-04-17T02:44:51Z

Hi, yes!
When you load a JSONL dataset with read_jsonl then you can cycle through it and each entity is a document with its anotations.
With that document you choose the function to transform it you want to use.
And then what it returns it has an "id" and what is important in "data" you have the document formatted so you just need to write it into a file.

I leave you here an example:
(be careful with the encoding, you might have other encoding, and beware also of the "conll_03" function idk if I typed it correctly)

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open("datasets/test.dataset", "w", encoding = "utf-8") as file:
    for entry in read_jsonl(filepath='datasets/doccanoSplit/test_Anotados.jsonl', dataset=NERDataset, encoding='latin-1').conll_03(tokenizer=str.split):
        file.write(entry["data"] + "\n")

I don't want to brag but I really recommend you to use my fork of doccano_transformer I have solved several bugs very important (some anotations werent being transformed correctly and it didn't save them).

Plus I added some other transformers.
The only thing is that the spacy conversor doesn't work anymore and is pending fixing which I do not have time to do right now so mind that.

To install it and use it use:
pip install git+https://www.github.com/rjuez00/doccano-transformer

gilokip · 2022-04-18T04:26:30Z

I followed your instructions but I'm getting this error.
https://pastebin.com/FaVxBgY6
what could be the problem. All my annotations are okay

gilokip · 2022-04-18T04:51:54Z

I followed your instructions but I'm getting this error. https://pastebin.com/FaVxBgY6 what could be the problem. All my annotations are okay

NVM, I fixed it. Apparently, I have to change the 'label' key in my file to "labels". But your solution works

littlestar502 · 2022-08-15T05:45:48Z

i encounter issue: KeyError: 'The file should includes either "labels" or "annotations", any suggestions on this?

gilokip · 2022-08-15T10:43:36Z

On your JSONL document check your keys ad change for the whole document. I guess it was an issue with the annotator where files are saved with the wrong key. So for example if the key is "label" change it to "labels" for the whole JSONL document.

pdbang · 2022-10-25T18:46:34Z

If someone is looking for the version adapted for camembert (french model), here you can find my version : https://github.com/pdbang/doccano-camembert-transformer

AkimfromParis · 2023-05-06T09:55:23Z

@rjuez00 Hello Rodrigo,

Thank you for your fork. I have many... ERROR NOT ALL TAGS WERE SAVED TO CONLL03...

My tags are correct on Doccano. When I start your script. I am missing a lot of BIO labels.
Seems that the tokenizer is not taking into account punctuation such as ",", ":", and ".".

Any idea? Thx!

ghassenhed · 2023-06-06T16:05:35Z

@AkimParis hello,
i am facing the same problem.
did you manage to find a solution ?

AkimfromParis · 2023-06-07T10:03:22Z

@ghassenhed
I made it work but still with a few errors in the output file.
Check the PR -> https://github.com/doccano/doccano-transformer/pull/38/files

And my version of Rjuez00...

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl

with open("train-final-888.txt", "w", encoding = "utf-8") as file:
for entry in read_jsonl(filepath='admin.jsonl', dataset=NERDataset, encoding='utf-8').to_conll2003(tokenizer=str.split):
file.write(entry["data"] + "\n")

Good luck!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terrible Documentation #28

Terrible Documentation #28

rjuez00 commented Apr 4, 2022

gilokip commented Apr 17, 2022

rjuez00 commented Apr 17, 2022 •

edited

Loading

gilokip commented Apr 18, 2022

gilokip commented Apr 18, 2022

littlestar502 commented Aug 15, 2022

gilokip commented Aug 15, 2022

pdbang commented Oct 25, 2022

AkimfromParis commented May 6, 2023

ghassenhed commented Jun 6, 2023

AkimfromParis commented Jun 7, 2023

Terrible Documentation #28

Terrible Documentation #28

Comments

rjuez00 commented Apr 4, 2022

Feature description

gilokip commented Apr 17, 2022

rjuez00 commented Apr 17, 2022 • edited Loading

gilokip commented Apr 18, 2022

gilokip commented Apr 18, 2022

littlestar502 commented Aug 15, 2022

gilokip commented Aug 15, 2022

pdbang commented Oct 25, 2022

AkimfromParis commented May 6, 2023

ghassenhed commented Jun 6, 2023

AkimfromParis commented Jun 7, 2023

rjuez00 commented Apr 17, 2022 •

edited

Loading