Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemoryError while computing LDA model for large .mallet file #165

Open
pstroe opened this issue Jul 3, 2019 · 3 comments
Open

OutOfMemoryError while computing LDA model for large .mallet file #165

pstroe opened this issue Jul 3, 2019 · 3 comments

Comments

@pstroe
Copy link

pstroe commented Jul 3, 2019

hello there,

while training a model for a rather large data set, we get the following error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:577) at java.util.TreeSet.add(TreeSet.java:255) at cc.mallet.topics.ParallelTopicModel.getTopicDocuments(ParallelTopicModel.java:1743) at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1762)

we created a .mallet file for our data with the bulk-load function. in total, we have about 1.3 billion words in more or less 17 million articles. we compute on 59 cores and reserve 180g for mallet. the 1000 iterations to estimate 100 topics run through without any problem, it seems that writing the doctopics file aborts the process. any thoughts why this might be the case? or is there an other issue?

looking forward to reading your answer,

phillip

@JeloH
Copy link

JeloH commented Jul 10, 2019

hello there,

while training a model for a rather large data set, we get the following error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:577) at java.util.TreeSet.add(TreeSet.java:255) at cc.mallet.topics.ParallelTopicModel.getTopicDocuments(ParallelTopicModel.java:1743) at cc.mallet.topics.ParallelTopicModel.printTopicDocuments(ParallelTopicModel.java:1762)

we created a .mallet file for our data with the bulk-load function. in total, we have about 1.3 billion words in more or less 17 million articles. we compute on 59 cores and reserve 180g for mallet. the 1000 iterations to estimate 100 topics run through without any problem, it seems that writing the doctopics file aborts the process. any thoughts why this might be the case? or is there an other issue?

looking forward to reading your answer,

phillip

Hi Phillip, I had something similar to this problem, before. I think to get over this issue, can split the original dataset to small text files (1 MB). I hope it will be going well.

@jfelectron
Copy link

@pstroe yes, this is a problem preventing an otherwise great implementation being useful for practical data sets in the wild. I didn't want to load everthing into memory when creating the .mallet serialized data so I hacked it to iterate: #170.

It's not clear to me that multiple files would help, but that creates another problem. A directory of files, expects one instance per file. In my case, that would mean 100s of milllions of files, which isn't practical.

@pstroe
Copy link
Author

pstroe commented Jul 28, 2019

@JeloH thanks for the suggestions, but as @jfelectron explains, this would not help in our case.

also: thanks @jfelectron for your response. i would say our data is very practical, but it comes in new dimensions. our workaround was training the model (which is not a problem, so it can handle that large amount) and output an inferencer. we then did inference on the training data. so instead of writing out the data all at once, wouldn't it be possible to constantly append to the output file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants