How to feed the document data into DocBert #24

xdwang0726 · 2019-07-03T20:12:14Z

Hi, I am wondering how do you feed documents into Bert. Did you treat a document as one sentence, i.e. [CLS] document1 [SEP]? Or you split documents into separate sentence? Thank you!

tralfamadude · 2019-07-06T01:48:01Z

It would be nice to have instructions for adding new data. I'm looking to use DocBERT for classification using 1-2 thousand tokens. Looking at the tsv file. In hedwig-data/datasets/Reuters/train.csv it seems to end lines with ^C and requires a string representing a bitmask as col. 1. It is not clear what that mask should be for "all the words in col. 2 are one sentence/doc")

(IMDB does not have a special char at end of line.)

leslyarun · 2019-08-01T10:23:36Z

@achyudh Can you please help us on this?

achyudh · 2019-08-03T05:55:51Z

@xdwang0726 For BERT, we do treat the entire document as a single sentence. For the hierarchical version of BERT (H-BERT), we split the document into its constituent sentences.

achyudh · 2019-08-03T05:59:46Z

@tralfamadude I am not sure how you would be able to use the pre-trained models for more than a thousand tokens. Since the maximum sequence length of the pre-trained models is only 512 tokens, we truncate the input. You can try the hierarchical version of BERT (H-BERT) from this repository and see if it works for your use case.

I don't think special characters like ^C would make a difference as they would be removed during the pre-processing stage.

tralfamadude · 2019-08-06T21:51:54Z

@achyudh Thank you for your clarifications and suggestion.

In sec. 5.2 of the DocBERT paper, it says "...we conclude that any amount of truncation is detrimental in document classification..." Perhaps a future work using BERT+BiLSTM might be able to integrate the embedding over the whole document if 512 tokens is not enough to get the required accuracy.

As it turns out, I'm getting 98% accuracy for a binary classifier using standard BERT with 512 tokens, and that will likely be sufficient for me.

xdwang0726 · 2019-08-08T14:08:53Z

@achyudh Thank you for your reply! I have a following question: for each document representation, did you just use the CLS for each document?

achyudh · 2019-08-09T20:41:47Z

@xdwang0726 yes, if I understand your question correctly, that is the case for BERT

amirmohammadkz · 2019-08-16T17:32:13Z

@xdwang0726 For BERT, we do treat the entire document as a single sentence. For the hierarchical version of BERT (H-BERT), we split the document into its constituent sentences.

Can you please give me some instruction about how to run H-BERT? I don't see any readme link in the main page for it. And can you explain what is the differences between these two models?

gimpong · 2019-12-21T04:00:52Z

Hi, thank you for developing such a useful tool! I have a question: should I preprocess adocument (for example, flitering the stop words) before using docBERT to get a document embedding?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to feed the document data into DocBert #24

How to feed the document data into DocBert #24

xdwang0726 commented Jul 3, 2019

tralfamadude commented Jul 6, 2019 •

edited

Loading

leslyarun commented Aug 1, 2019

achyudh commented Aug 3, 2019

achyudh commented Aug 3, 2019

tralfamadude commented Aug 6, 2019

xdwang0726 commented Aug 8, 2019

achyudh commented Aug 9, 2019

amirmohammadkz commented Aug 16, 2019 •

edited

Loading

gimpong commented Dec 21, 2019

How to feed the document data into DocBert #24

How to feed the document data into DocBert #24

Comments

xdwang0726 commented Jul 3, 2019

tralfamadude commented Jul 6, 2019 • edited Loading

leslyarun commented Aug 1, 2019

achyudh commented Aug 3, 2019

achyudh commented Aug 3, 2019

tralfamadude commented Aug 6, 2019

xdwang0726 commented Aug 8, 2019

achyudh commented Aug 9, 2019

amirmohammadkz commented Aug 16, 2019 • edited Loading

gimpong commented Dec 21, 2019

tralfamadude commented Jul 6, 2019 •

edited

Loading

amirmohammadkz commented Aug 16, 2019 •

edited

Loading