Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a multi thread version of BatchTask #71

Open
GoogleCodeExporter opened this issue May 14, 2015 · 16 comments
Open

Create a multi thread version of BatchTask #71

GoogleCodeExporter opened this issue May 14, 2015 · 16 comments

Comments

@GoogleCodeExporter
Copy link

The BatchTask runs tasks only in a single thread. For experiments with many 
non-dependent tasks, a multithread version will improve run time.
We prepose to create a MUltiThreadBatchTask, which runs a number of tasks in 
parallel.

Original issue reported on code.google.com by [email protected] on 10 Apr 2015 at 10:43

@mwunderlich
Copy link
Contributor

Hi,
I am interested in working on this issue. Who is currently involved in it and what is the state of things right now?

Cheers,

Martin

@daxenberger
Copy link
Member

AFAIK, Ivan (@habernal) was the last person working on this issue. I think, work on this issue is quite advanced, but I'm not sure about where it got stuck. Ivan, could you help out here?

@habernal
Copy link
Contributor

We started implementing MultiThreadBatchTask and some tests in MultiThreadTaskPerformanceTest and MultiThreadBatchTaskTest, but we got stuck at correctly propagating errors (some tasks fail because of missed dependencies and should be re-scheduled while others fail "normally"). There is a test case for a large graph of dependent tasks, which we planned to use to prove that the multithread solution actually speed things up. We haven't touch that since, feel free to explore.

@mwunderlich
Copy link
Contributor

Thanks a lot, Johannes and Ivan. I will take a look. Multithreading would be a great addition to DKPro. Apart from that, has anyone ever thought about supporting cluster/grid solutions, such as Sun Grid Engine? This might be another option for boosting performance by allowing atomic jobs to be run in parallel in a cluster, but I am not sure how it would work for tasks such as Lucence n-gram meta-info, which need to write to the same Lucence index.

@daxenberger
Copy link
Member

In theory, each task uses its own context to write to and to read from. If multiple resources from different contexts need to access the same files/DBs etc., that is a separate problem and needs to be dealt with by those resources I would say. Hence: multi-threaded BatchTasks are a big steps towards supporting cluster solutions.
You might also be interested in having a look at DKPro BigData, which already offers support for large-scale processing for DKPro Core.

@mwunderlich
Copy link
Contributor

Thanks a lot, Johannes, for the hint. I will have a look at the BigData sub-project.
At this moment, I am running a 10x CV classification experiment on some 1500 binary CAS files and it is a bit sad to see only 1 core out of 24 under full load on the two machines that I am using. :) So, hopefully, I can contribute a bit towards making DKPro multi-threaded. However, I first need to study and understand the inner workings of the framework a bit better.

@daxenberger
Copy link
Member

I can feel your pain ;-) I have been using DKPro TC to classify up to a million of documents - split into smaller subsets which were processed by multiple java threads. However, automatic parallelization is definitely preferable to manual parallelization ...

@mwunderlich
Copy link
Contributor

On the UIMA list, a message was posted last week to announce this project here which allows multithreaded execution of tasks created by CAS multipliers in UIMAfit (if I understand it correctly):
https://github.com/brmson/yodaqa/tree/master/src/main/java/cz/brmlab/yodaqa/flow/asb

Might be interesting in the present context.

@mwunderlich
Copy link
Contributor

@habernal Hi Ivan, I've had some time to go through the code of the MultiThreadBatchTask classes. I don't have much experience with mutlithreading, so it took me a while, but I think in the end I figured out where the problem was. Just a minor bug fix really, especially considering the complexity of this task: exceptionsFromCurrentLoop needs to be reset whenever the outer loop reststarts, because otherwise the outer loop will run at most twice, potentially leaving a number of tasks un-executed.
I will send a pull request for the fix some time today.

Two questions, though:

  • Some of the commented-out code seems to be from a legacy solution that didn't use Futures. Do you mind, if I remove the commented bits that aren't needed anymore?
  • Second, I was wondering what the next step would be once the MultiThreadBatchTask is working ok. Would it be necessary to create multi-threaded versions of all the existing BatchTasks, such as "ExperimentTrainTest"? That seems a bit too much work. After all, AFAICS the only difference between BatchTask and MultiThreadBatchTask is in the way executeConfiguration(...) is implemented. Also, it would be handy, if the number of threads to use could be set as a parameter when instantiating the BatchTask. I will have a think about it. Sounds a bit like the factory pattern might be handy here or perhaps injecting some kind of BatchTaskExector object.
    What do you guys think?

@mwunderlich
Copy link
Contributor

PS: I ran a little test on a 24-core machine with some strange results. These are the total execution times for the performance tests with different numbers of threads n set for the executor:
(executor = Executors.newFixedThreadPool(n))

1 thread: Duration [ms]: 66,085
2 threads: Duration [ms]: 66,218
10 threads: Duration [ms]: 67,677
24 threads: Duration [ms]: 67,190

The values are suspiciously close together and don't show much of a performance boost. I have checked to make sure that the class got re-compiled alright with the new values for n, so that can't be problem.

@mwunderlich
Copy link
Contributor

Hmm, I think the issue might be with the following line:

future.get();

This waits for the result of the Future, which essentially turns it into a synchronous call and in consequence all tasks is executed sequentially and not in parallel. But I could be wrong...

@daxenberger
Copy link
Member

I couldn't investigate yet on the performance issue.

Regarding next steps in DKPro TC (but this should better be discussed on the respective mailing list): once we have a way to (maybe dynamically) set the number of threads in MultiThreadBatchTask, all ExperimentBatchTask in DKPro TC can probably inherit from this class (with a default thread number of 1).

@mwunderlich
Copy link
Contributor

I have made some more changes to the implementation of MultiThreadBatchTask and I think I have it nailed now. Here are the results from a first test:

1 thread: Total runtime [ms]: 54,003
10 threads: Total runtime [ms]: 50,720
24 threads: Total runtime [ms]: 52,537

Not so promising. Then I figured that the actual execution time of the DummyTask might be too fast in comparison to the overall runtime and the overhead of managing the threads/futures. So, I added a 10 second pause to the task to simluate some actual work and now the results show a clear improvement when using several threads:

1 thread: Total runtime [ms]: 4,205,190
10 threads: Total runtime [ms]: 426,010
24 threads: Total runtime [ms]: 189,808

I will have some time this evening to clean up the modified code and do some final checks before submitting this.

@daxenberger
Copy link
Member

That sounds indeed promising. Thanks for the investigations, I'm excited to test this in practice!

@mwunderlich
Copy link
Contributor

I've submitted the pull request with my changes. Please review this closely before merging. The basic structure is the same as before, but I have made some substantial changes to the code.

@daxenberger
Copy link
Member

That is really great, thanks a lot for the hard work. As you say, this request will need some deeper reviewing - we'll try to do that and merge asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants