Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to feed the document data into DocBert #24

Open
xdwang0726 opened this issue Jul 3, 2019 · 9 comments
Open

How to feed the document data into DocBert #24

xdwang0726 opened this issue Jul 3, 2019 · 9 comments

Comments

@xdwang0726
Copy link

Hi, I am wondering how do you feed documents into Bert. Did you treat a document as one sentence, i.e. [CLS] document1 [SEP]? Or you split documents into separate sentence? Thank you!

@tralfamadude
Copy link

tralfamadude commented Jul 6, 2019

It would be nice to have instructions for adding new data. I'm looking to use DocBERT for classification using 1-2 thousand tokens. Looking at the tsv file. In hedwig-data/datasets/Reuters/train.csv it seems to end lines with ^C and requires a string representing a bitmask as col. 1. It is not clear what that mask should be for "all the words in col. 2 are one sentence/doc")

(IMDB does not have a special char at end of line.)

@leslyarun
Copy link

@achyudh Can you please help us on this?

@achyudh
Copy link
Member

achyudh commented Aug 3, 2019

@xdwang0726 For BERT, we do treat the entire document as a single sentence. For the hierarchical version of BERT (H-BERT), we split the document into its constituent sentences.

@achyudh
Copy link
Member

achyudh commented Aug 3, 2019

@tralfamadude I am not sure how you would be able to use the pre-trained models for more than a thousand tokens. Since the maximum sequence length of the pre-trained models is only 512 tokens, we truncate the input. You can try the hierarchical version of BERT (H-BERT) from this repository and see if it works for your use case.

I don't think special characters like ^C would make a difference as they would be removed during the pre-processing stage.

@tralfamadude
Copy link

@achyudh Thank you for your clarifications and suggestion.

In sec. 5.2 of the DocBERT paper, it says "...we conclude that any amount of truncation is detrimental in document classification..." Perhaps a future work using BERT+BiLSTM might be able to integrate the embedding over the whole document if 512 tokens is not enough to get the required accuracy.

As it turns out, I'm getting 98% accuracy for a binary classifier using standard BERT with 512 tokens, and that will likely be sufficient for me.

@xdwang0726
Copy link
Author

@achyudh Thank you for your reply! I have a following question: for each document representation, did you just use the CLS for each document?

@achyudh
Copy link
Member

achyudh commented Aug 9, 2019

@xdwang0726 yes, if I understand your question correctly, that is the case for BERT

@amirmohammadkz
Copy link

amirmohammadkz commented Aug 16, 2019

@xdwang0726 For BERT, we do treat the entire document as a single sentence. For the hierarchical version of BERT (H-BERT), we split the document into its constituent sentences.

Can you please give me some instruction about how to run H-BERT? I don't see any readme link in the main page for it. And can you explain what is the differences between these two models?

@gimpong
Copy link

gimpong commented Dec 21, 2019

Hi, thank you for developing such a useful tool! I have a question: should I preprocess adocument (for example, flitering the stop words) before using docBERT to get a document embedding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants