-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subset verb learner tests are slow #852
Comments
@gabbard Per your comment I tried running the tests in Pypy on
It took some digging to find this but Pypy does guarantee deterministic dictionary iteration order of dictionaries:
|
@spigo900 : Can you make a PR to |
@gabbard I ran the subset verb learner tests in both CPython and PyPy on adam-dev. It looks like PyPy finishes the tests in about half the time, sometimes a bit less. Here are the test times: |
@spigo900 : Great! Can you write up directions for using PyPy and add them to the repository README? And can you see if we can get CI to use PyPy? We should probably re-profile with PyPy to see if the time spent in different spots is proportionally the same. |
@gabbard It looks like it should be possible to use PyPy with Travis. As far as profiling, it looks like PyPy can run cProfile, though "turning on cProfile can distort the result," so they recommend using |
Got it - try |
@gabbard I profiled it using vmprof; the visualization options don't seem to be working, so here's the text-based tree view:
|
Huh, so it seems like 25% of the remaining time at most is being spent in the actual guts of the matching operation. A substantial amount is in It also looks like there is a lot of time spent: in |
One thing which could help our performance globally: every time we construct a Now, in a proper language with private constructors we'd just enforce creation through two factory methods like |
@gabbard I was able to use
Here's the full tree from that:
Unfortunately it doesn't look like
I'll look into |
One thing which leaps out in Can you up the prune depth even further? I suspect the bulk of the time from |
@gabbard It's spending the majority of the time in
Here's the breakdown for what the perception generator is itself doing (with extra bits deleted):
It looks like object perception is the single biggest time sink within the perception generator, although the other steps add up to a significant amount of the time. |
@gabbard Here are the results for Within In Inside
|
It looks like to get line profiling results, you have to explicitly enable that when you do your profiling, so I'll try that and see if it gets us any more interesting results. |
Addendum: No line profiling for PyPy:
|
I was able to get flamegraphs of the overall profiling results. For the first I used |
@spigo900 : The lack of line-profiling for |
@spigo900 : I see constructing the object recognizer for the test is about 7% of the time. A quick thing to do would be to check if anything in our test suit is creating an object recognizer for each test instead of just creating and initializing a single shared one across all tests. |
@gabbard Yes, it's creating a new object recognizer every time it creates a new learner, i.e. once for every template in every test. |
Got it. Because the |
@gabbard Here are the updated flamegraphs: 07-14_06-39_verbtest_flamegraphs.zip |
@gabbard I looked into running the Travis tests in PyPy. Specifically, I tried running the tests using I think what we want to do here is just |
I tried running all tests using PyPy on adam-dev (excluding those that can't be run), and it was significantly slower than expected. (After about 1 hour running on adam-dev, testing had not finished.) I'll try benchmarking again to see what's going on. |
@joecummings : You can disable all the visualization tests altogether. The visualization is basically dead. |
@gabbard Does that include Also, I rebenchmarked the tests since I was concerned about PyPy possibly taking longer in total. (I think the last time I benchmarked the tests, it was purely on the verb learner. Also, when I attempted to run all tests in PyPy it took over an hour... because I was on an old branch that included the old-style learner tests, etc.) The results confirm that PyPy gives an overall speedup. (CPython takes 30 minutes 40 seconds total, while PyPy takes 12 minutes.) Results: 07-14_test_results.zip |
@spigo900 : You can also get rid of |
@gabbard It looks like somehow it takes Travis longer to run the tests in PyPy compared with CPython. See this PyPy build vs. this CPython build of the commit where the Travis branch diverged from master (so the tests being run should be almost identical). The PyPy build has taken 1 hour 36 minutes and is still going while the CPython build took 1 hour 1 minute. I examined the log and Travis is running the tests using PyPy, not CPython. (Although if it were using CPython and somehow getting much slower build times, that would raise perhaps even more questions.) I'm not sure how to square this with the adam-dev benchmarking results. It's not the steps before testing because the CPython build spent more time overall on those steps compared with the PyPy build. And it doesn't seem to be hanging, because (1) if so, Travis should have stopped the build by now, and (2) there was no hang when I ran the tests on adam-dev. Overall I am confused. |
Here are the single-test profiling results from last Friday: Here are the profiling results from running all subset verb learner tests: |
I investigated some of the code based on the single-test profiling results:
|
@spigo900 : Thanks. Can you attend to the unnecessary copies and then re-profile? |
@gabbard I haven't run profiling yet, but I did benchmarking over all tests. Eliminating the extra copies saves about 20 seconds at test time. (The total subset verb learner test time was 6 minutes 34 seconds with copies, 6 minutes 11 seconds without.) Full results: 07-21_benchmark_extra_copies_vs_not.zip |
Here are the profiling results: 07-21_08-25_profile_no_extra_copies.zip |
In response to the comment on #926, I replaced |
@gabbard Given the profiling results we've seen so far, do you think it still makes sense to build a matching server? The matching algorithm itself (or |
@spigo900 : I agree it doesn't make much sense unless we can drive the other portions of the code down further. An alternate thing to consider: since a lot of the work happens outside learning proper (e.g. in building situations and translating to perception graphs), can we use Pythion's multiprocessing to do all that translation in multiple cores in parallel, and do just the learning part proper (e.g. pattenr matching and hypothesis updating) on the main thread? |
@gabbard That's a good thought. I can look into that. I'm wondering how we would want to split things up over different cores then. The one example I can think of right off is splitting the frame translation work in I guess thinking about that is part of looking into using multiprocessing, though. |
To clarify, are we talking about using the multiprocessing module, the threading module, or something else? |
@spigo900 : Figuring out the best option, or at least the trade-offs, is part of the assignment. :-) |
@spigo900 : I think we want to do the split up at the highest level we can so that a little code as possible has to be aware of the multiprocessing. One possibility is to have a "instance generator" object which gets passed the curricula to generate instances for. The overall learner then has a loop when it asks the "instance generator" for its "next instance". The "instance generator", meanwhile, is busy in its own thread turning templates in concrete situations, and then farming out the templates across multiple threads for translation to perception graphs. |
I think the multiprocessing module is the winner. PyPy has a GIL, and this part of our code isn't doing I/O, so using threads doesn't make sense here. In terms of using multiprocessing,
We can easily present the same interface as One annoying thing about multiprocessing is that it doesn't work well in CPython to interrupt the program while it's doing multiprocessing. I do not know if it works better in PyPy. Multiprocessing is also not very debugging-friendly, I think, so we'll want to keep around the single-process instance group for that. Looking into profiling, it seems that |
It looks like profiling multiprocess programs is an issue with CPython and cProfile as well, and we might be able to work around it by adding a call to The |
I modified the flamegraph generation script to reformat the "all subset verb learner tests" output. It now shows the data aggregated over all calls to run_verb_test(), such that it should give a more useful view of the data. Here is the script together with the results: 07-24_rewritten_overall_profiling_data.zip Note that the data I used to generate this today is from the 17th, so I've also included the single-test profiling results for the 17th in the zip file. The results are almost identical, so I think that for now it makes sense to continue using the single-profiling test for performance checking. |
One complication with multiprocessing using Looking into it, however, the import times may be a problem. The single-test results for test_sit show imports as taking about 16.74% of the total running time, and for reference, the English plus Chinese parts take about 21 seconds to run, so that translates to about 3.5 seconds. Most of this comes from heavy imports not needed for situation generation -- the test utilities, and the object recognizer and alignment modules. The remainder (which situation generation is more likely to need) take about 300 ms of the time. I think ideally we want to generate as many situations as possible using (the same set of) other processes in the background so that we don't have to wait on situation generation while running the tests. I'll think more about how to do this. |
I've implemented this on a branch, however it causes the rest of the code to fail due to issues with pickling. Multiprocessing uses pickles to transfer data between the main process and its subprocesses. It turns out some of our objects, at least, are not pickle-stable. Situations are one such example: If you have a situation x, then As it turns out, the rest of the code is not robust to this. I tested the code without using multiprocessing at all -- doing the same thing I was in the multiprocessing version, but doing it synchronously. Without pickling, there was no error. I was able to reproduce the multiprocessing error by changing the code to pickle the situations before processing and to unpickle the results. This commit demonstrates the issue. For some situations the difference is trivial and the actual result is functionally the same. However, I am not sure this is true for all situations. While implementing the above test, I found that pickling and unpickling the inputs but not the outputs caused I am not sure yet exactly where the problem enters, however I noticed that one of the perception graphs generated by the multiprocessing version had two duplicate nodes. The duplicates were both axes: learner-back-to-front and gravitational-up. I suspect this is caused by pickling however I'm unsure why it happens. ETA: As an update, when I pickle the outputs but not the inputs, I get similar results to when I pickle the inputs but not the outputs. The test fails but it doesn't fail immediately. |
I've been working on a parallelized version that spawns two processes per template, one for the scenario curriculum and one for the training one. This version is quite slow. It takes at least five times as long as the non-parallel version. It also fails some tests. (I suspect this may have to do with the pickling issues detailed above.) On the other hand the version that generates the perceptions/language using a process pool doesn't work at all. The process pool version mysteriously never puts the second template instance in the queue (after 60 seconds), but also mysteriously doesn't call either the result or the error callback. I tried saving the async result and calling Pickling time doesn't seem to be the problem. I tried generating and pickling/unpickling the results on the main process and it only took a second or two for each result. Overall I am confused why the process pool version doesn't work. |
Update: Actually, it turns out the process version isn't as slow as I thought. It's only about half again as slow as the normal way of running the tests (takes ~6 minutes vs. ~4). The problem is instead that the subprocesses aren't all dying so the test process doesn't exit when the tests are complete. I suspect this is because when the tests fail the generator never finishes, so it never calls |
As an update, I figured out why the pool-based version was failing. It turns out that there is a deadlock. If you run a job that involves a queue using I was able to modify their example to use a pool and replicated the problem I've been having. The following code prints "oh no, queue was empty!" as-is and "obj is get!" when the code below the comment is moved into the "with" statement (i.e. before from multiprocessing import Pool, Queue
from time import sleep
import queue
def f():
print('inside f')
f.q.put('X' * 1000000)
def f_init(q):
f.q = q
if __name__ == '__main__':
q = Queue()
with Pool(initializer=f_init, initargs=[q]) as pool:
pool.apply_async(f, [])
# NOTE: With code below inside the `with` statement, it finishes.
sleep(5)
print('DONE with sleep; getting an object')
try:
obj = q.get(timeout=10)
print('obj is get!')
except queue.Empty:
print('oh no, queue was empty!') |
Joe and all,
Re our other pickle problems, check out
https://docs.python.org/2.4/lib/node66.html
which talks about what’s not pickleable.
Mitch
From: Joe Cecil <[email protected]>
Sent: Thursday, July 30, 2020 4:51 PM
To: isi-vista/adam <[email protected]>
Cc: Subscribed <[email protected]>
Subject: Re: [isi-vista/adam] Subset verb learner tests are slow (#852)
I've implemented this on a branch, however it causes the rest of the code to fail due to issues with pickling.
Multiprocessing uses pickles to transfer data between the main process and its subprocesses. It turns out some of our objects, at least, are not pickle-stable. Situations are one such example: If you have a situation x, then pickle.loads(pickle.dumps(x)) != x. That is, unserializing the serialized situation does not give you the same situation back. This happens with several other objects, too -- with Ontolog ies and AxisInfo s, for example.
As it turns out, the rest of the code is not robust to this. I tested the code without using multiprocessing at all -- doing the same thing I was in the multiprocessing version, but doing it synchronously. Without pickling, there was no error. I was able to reproduce the multiprocessing error by changing the code to pickle the situations before processing and to unpickle the results.
I am not sure yet exactly where the problem enters, however I noticed that one of the perception graphs generated by the multiprocessing version had two extra nodes. I suspect this is due to pickling however I'm unsure why it happens.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#852 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOJOM26RD452OYZKRPAA63R6HMHDANCNFSM4OMPXZ7A> . <https://github.com/notifications/beacon/AAOJOM7YBFFKRUUEHGTAU53R6HMHDA5CNFSM4OMPXZ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOE66NMWQ.gif>
|
@mitchmarcus Thank you for the link. To clarify, the problem is not that Python can't pickle the objects. Rather, it's that when it does unpickle them, it causes some strange issues. If you pickle and unpickle the train and test curricula in-memory during the verb learner unit tests before using them, it causes those tests to fail in strange ways. Sometimes they crash and sometimes they fail to produce the right description when ordinarily (without the pickle-unpickle step) they can. I suspect the root cause is something in our representation that specifies However, writing this up does give me an idea. It's possible we could work around this (in a rather ugly way) by also pickling a copy of every learner together with the curriculum. (So for example a pair I'll try experimenting with this in a simple script and see if that can work. I also seem to recall seeing a workaround or issue related to this and |
This may become an issue again in the future, but for now I'm closing this issue as the inputs to the learners will not be ADAM's symbolic perception for Phase 3 |
Our subset verb learner tests take a long time, around 7 minutes. We'd like them to take less time.
The test that takes the longest time to run is
test_sit
. It tests a relatively large number of examples (8 templates, for each of those 10 train), but not combinatorially many (only 88 samples).To profile
test_sit
, I ran the attached script using both cProfile and py-spy. From both profilers' results, it looks like most of the time is getting spent in the graph matching code via the object recognizer'smatch_objects()
. The second biggest time sink isPerceptionGraph.from_dynamic_perceptual_representation()
.cProfile
estimates 53.49% of the running time is inmatch_objects()
, 39.08% is inPatternMatching.matches()
, 43.54% in_internal_matches()
and 37.91% insubgraph_isomorphisms_iter()
. The second longest-running part of the code,PerceptionGraph.from_dynamic_perceptual_representation()
, took 25.53% of the running time. Most of that time is spent intranslate_frames()
. The time sinks within that call seem to becopy_with_temporal_scopes()
andtranslate_frame()
.(Note that these are all based on the PDF visualization. In that visualization, the test time is 97.35% of the total running time. The visualization includes the module loading and import time in its assessment, so the time spent executing the code is not 100% of the time.)
The script used for profiling and the profiling results (including the raw
pstats
file) are attached: 06-29_test_profiles.zip. The graph forcProfile
was generated using gprof2dot.The text was updated successfully, but these errors were encountered: