Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert token to text #5

Open
vishalkatiyar007 opened this issue Feb 25, 2019 · 2 comments
Open

Convert token to text #5

vishalkatiyar007 opened this issue Feb 25, 2019 · 2 comments

Comments

@vishalkatiyar007
Copy link

Is there a way to convert the output (currently in the form of tokens) of the model to text for easy interpretation and testing?

@vishalkatiyar007 vishalkatiyar007 changed the title Model Testing Convert token to text Feb 26, 2019
@vishalkatiyar007
Copy link
Author

For example, the annotator marks the long answer using byte offsets, token offsets, and an index into the list of long answer candidates:
"long_answer": { "start_byte": 32, "end_byte": 106, "start_token": 5, "end_token": 22, "candidate_index": 0 }.
How to map these bytes and tokens to the text containing the answer.

@filbertphang
Copy link

filbertphang commented Mar 4, 2019

you might want to try something like this

import jsonlines

INPUT_FILE = "nq-train-sample.jsonl"
START_TOKEN = 3521
END_TOKEN = 3525
QAS_ID = 4549465242785278785
REMOVE_HTML = True


def get_span_from_token_offsets(f, start_token, end_token, qas_id,
                                remove_html):
    for obj in f:
        if obj["example_id"] != qas_id:
            continue

        if remove_html:
            answer_span = [
                item["token"]
                for item in obj["document_tokens"][start_token:end_token]
                if not item["html_token"]
            ]
        else:
            answer_span = [
                item["token"]
                for item in obj["document_tokens"][start_token:end_token]
            ]

        return " ".join(answer_span)


with jsonlines.open(INPUT_FILE) as f:
    result = get_span_from_token_offsets(f, START_TOKEN, END_TOKEN, QAS_ID,
                                         REMOVE_HTML)

print(result)
Output: March 18 , 2018

you can read your prediction file to get the various start_tokens, end_tokens, and example_ids, then iteratively call the function to get a list of the prediction spans (write to file or whatever)

hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants