Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel problem in MALLET LDA (gensim wrapper) #176

Open
thisray opened this issue Dec 18, 2019 · 3 comments
Open

Parallel problem in MALLET LDA (gensim wrapper) #176

thisray opened this issue Dec 18, 2019 · 3 comments

Comments

@thisray
Copy link

thisray commented Dec 18, 2019

Hi,

I use the gensim wrapper, LdaMallet() [link], to run MALLET.

Gensim library provide a parameter workers to assign the --num-threads argument in MALLET.
(Ref: Gensim Code - line274)

But I found the workers seems not working, here is the different setting and running time:

 `workers=1` -> run time: 7.32 sec   # <--
 `workers=2` -> run time: 2min 25s
 `workers=4` -> run time: 2min 38s
 `workers=16` -> run time: 3min 13s  # <--

No matter I run this on my computer:

openjdk version "1.8.0_162"
OpenJDK Runtime Environment (build 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.162-b12, mixed mode)

or on the Colab:

openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3)
OpenJDK 64-Bit Server VM (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3, mixed mode, sharing)

the results are similar, more workers spent more time.
(and I have also tried mallet-2.0.8 & mallet-2.0.7)

Dose it means I am not using a proper way to run MALLET LDA in parallel?

Thanks!


reference code:

# code in gensim (python)
# (i tried with different `workers`)

workers = 16
gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word, 
                                 optimize_interval=1, iterations=6000, workers=workers)
# the equivalent commands in mallet (key in shell, ignore the I/O setting):

$ bin/mallet train-topics --num-threads 16
@patelamalk
Copy link

I have the same problem, for 12077 files ~ 5 Gb it takes 4hrs. It doesn't seem to be utilizing all the cores.

@mimno
Copy link
Owner

mimno commented Jun 10, 2020

Unless this can be replicated in the java-only version there's not much to do here -- I'd check with gensim.

@d0nghyunkang
Copy link

@thisray This thread has been dormant for a while, but have you checked how many cores/threads you have in your computer? It could be that your number of cores/threads are less than 16, so 16 slows you down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants