-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to feed the document data into DocBert #24
Comments
It would be nice to have instructions for adding new data. I'm looking to use DocBERT for classification using 1-2 thousand tokens. Looking at the tsv file. In hedwig-data/datasets/Reuters/train.csv it seems to end lines with ^C and requires a string representing a bitmask as col. 1. It is not clear what that mask should be for "all the words in col. 2 are one sentence/doc") (IMDB does not have a special char at end of line.) |
@achyudh Can you please help us on this? |
@xdwang0726 For BERT, we do treat the entire document as a single sentence. For the hierarchical version of BERT (H-BERT), we split the document into its constituent sentences. |
@tralfamadude I am not sure how you would be able to use the pre-trained models for more than a thousand tokens. Since the maximum sequence length of the pre-trained models is only 512 tokens, we truncate the input. You can try the hierarchical version of BERT (H-BERT) from this repository and see if it works for your use case. I don't think special characters like ^C would make a difference as they would be removed during the pre-processing stage. |
@achyudh Thank you for your clarifications and suggestion. In sec. 5.2 of the DocBERT paper, it says "...we conclude that any amount of truncation is detrimental in document classification..." Perhaps a future work using BERT+BiLSTM might be able to integrate the embedding over the whole document if 512 tokens is not enough to get the required accuracy. As it turns out, I'm getting 98% accuracy for a binary classifier using standard BERT with 512 tokens, and that will likely be sufficient for me. |
@achyudh Thank you for your reply! I have a following question: for each document representation, did you just use the CLS for each document? |
@xdwang0726 yes, if I understand your question correctly, that is the case for BERT |
Can you please give me some instruction about how to run H-BERT? I don't see any readme link in the main page for it. And can you explain what is the differences between these two models? |
Hi, thank you for developing such a useful tool! I have a question: should I preprocess adocument (for example, flitering the stop words) before using docBERT to get a document embedding? |
Hi, I am wondering how do you feed documents into Bert. Did you treat a document as one sentence, i.e. [CLS] document1 [SEP]? Or you split documents into separate sentence? Thank you!
The text was updated successfully, but these errors were encountered: