Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subset verb learner tests are slow #852

Closed
spigo900 opened this issue Jun 30, 2020 · 51 comments
Closed

Subset verb learner tests are slow #852

spigo900 opened this issue Jun 30, 2020 · 51 comments

Comments

@spigo900
Copy link
Collaborator

spigo900 commented Jun 30, 2020

Our subset verb learner tests take a long time, around 7 minutes. We'd like them to take less time.

The test that takes the longest time to run is test_sit. It tests a relatively large number of examples (8 templates, for each of those 10 train), but not combinatorially many (only 88 samples).

To profile test_sit, I ran the attached script using both cProfile and py-spy. From both profilers' results, it looks like most of the time is getting spent in the graph matching code via the object recognizer's match_objects(). The second biggest time sink is PerceptionGraph.from_dynamic_perceptual_representation().

cProfile estimates 53.49% of the running time is in match_objects(), 39.08% is in PatternMatching.matches(), 43.54% in _internal_matches() and 37.91% in subgraph_isomorphisms_iter(). The second longest-running part of the code, PerceptionGraph.from_dynamic_perceptual_representation(), took 25.53% of the running time. Most of that time is spent in translate_frames(). The time sinks within that call seem to be copy_with_temporal_scopes() and translate_frame().

(Note that these are all based on the PDF visualization. In that visualization, the test time is 97.35% of the total running time. The visualization includes the module loading and import time in its assessment, so the time spent executing the code is not 100% of the time.)

The script used for profiling and the profiling results (including the raw pstats file) are attached: 06-29_test_profiles.zip. The graph for cProfile was generated using gprof2dot.

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 9, 2020

@gabbard Per your comment I tried running the tests in Pypy on adam-dev, however it crashes while collecting the verb learner tests because immutablecollections doesn't know that Pypy's dictionaries have deterministic iteration order. Here's the error:

E   ValueError: ImmutableDicts can only be initialized from built-in dicts when the iteration of built-in dicts is guaranteed to be deterministic (Python 3.7+; CPython 3.6+)

It took some digging to find this but Pypy does guarantee deterministic dictionary iteration order of dictionaries:

Dictionaries and sets are ordered on PyPy. On CPython < 3.6 they are not; on CPython >= 3.6 dictionaries (but not sets) are ordered.

@gabbard
Copy link

gabbard commented Jul 9, 2020

@spigo900 : Can you make a PR to immutablecollections to fix this?

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 9, 2020

@gabbard Done, PR here.

@spigo900
Copy link
Collaborator Author

@gabbard I ran the subset verb learner tests in both CPython and PyPy on adam-dev. It looks like PyPy finishes the tests in about half the time, sometimes a bit less. Here are the test times:
07-10_cpython_results.txt
07-10_pypy_results.txt

@gabbard
Copy link

gabbard commented Jul 10, 2020

@spigo900 : Great! Can you write up directions for using PyPy and add them to the repository README? And can you see if we can get CI to use PyPy?

We should probably re-profile with PyPy to see if the time spent in different spots is proportionally the same.

@spigo900
Copy link
Collaborator Author

@gabbard It looks like it should be possible to use PyPy with Travis.

As far as profiling, it looks like PyPy can run cProfile, though "turning on cProfile can distort the result," so they recommend using vmprof.

@gabbard
Copy link

gabbard commented Jul 10, 2020

Got it - try vmprof then

@spigo900
Copy link
Collaborator Author

@gabbard I profiled it using vmprof; the visualization options don't seem to be working, so here's the text-based tree view:

(adam) [jcecil@adam-dev adam]$ vmprofshow --prune_percent 10 07-10_profiling.log
100.0%  <module>  100.0%  tests/learner/profile_sit.py:1
 93.8% .. run  93.8%  tests/learner/profile_sit.py:9
 93.8% .... test_sit  100.0%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/subset_verb_learner_test.py:168
 82.8% ...... run_verb_test  88.3%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/subset_verb_learner_test.py:92
 44.0% ........ observe  53.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/template_learner.py:54
 28.6% .......... _preprocess_scene_for_learning  65.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/verbs.py:142
 28.6% ............ match_objects_with_language_old  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:360
 27.8% .............. match_objects_with_language  97.2%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:383
 27.7% ................ match_objects  99.6%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:231
 19.2% .................. first  69.3%  /nas/home/jcecil/miniconda3/envs/adam/site-packages/more_itertools/more.py:120
 19.2% .................... matches  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1172
 19.2% ...................... _internal_matches  99.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1303
 15.2% ........................ subgraph_isomorphisms_iter  79.5%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/_matcher.py:542
 15.2% .......................... match  99.8%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/_matcher.py:278
 23.1% ........ observe  27.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/integrated_learner.py:82
 14.1% .......... enrich_during_learning  61.2%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/objects.py:488
 14.1% ............ match_objects_with_language  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:383
 14.1% .............. match_objects  99.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:231
 10.5% ................ first  74.4%  /nas/home/jcecil/miniconda3/envs/adam/site-packages/more_itertools/more.py:120
 10.5% .................. matches  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1172
 10.5% .................... _internal_matches  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1303

@gabbard
Copy link

gabbard commented Jul 10, 2020

Huh, so it seems like 25% of the remaining time at most is being spent in the actual guts of the matching operation. A substantial amount is in run_verb_test outside of the calls to observe. Is there any way to get more details on what that might be? py-spy might also work here - I think sometimes it can provide line-level profiling information, though I might be wrong about that.

It also looks like there is a lot of time spent: in _observe but outside _preprocess_scene_for_learning and in match_objects but outside the call to first(matches). Can you take a look there for low-hanging fruit?

@gabbard
Copy link

gabbard commented Jul 10, 2020

One thing which could help our performance globally: every time we construct a PerceptionGraph we copy the incoming graph. However, many times we know that the graph we are providing will not be referenced elsewhere and we can save the copy.

Now, in a proper language with private constructors we'd just enforce creation through two factory methods like PerceptionGraph.copy_from_graph and PerceptionGraph.from_adopted_graph (or something like that, with the latter not doing a copy). But in Python we can't shut off access to the constructor so if we make that change we need to carefully eliminate all direct constructions of PerceptionGraph and migrate them to one of the two factories without the type system helping us find anything we miss.

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 10, 2020

@gabbard vmprofshow has a --lines option, but unfortunately it doesn't seem to show number or percentage of hits by line. (It has headers for them in the tables for each file, but every row is blank.)

I was able to use prune_level to get a better picture of run_verb_tests. It looks like the other time sinks inside run_verb_test are:

  • instances
    • 9.5% of function time
  • describe
    • 7% of function time; that is,
      • 4.5 percentage points from the old-style learner,
      • 2.5 percentage points from the new-style one.
  • sampled
    • 2.4% of function time

Here's the full tree from that:

(adam) [jcecil@adam-dev adam]$ pypy3 -m vmprof.show --prune_percent 2 --prune_level 4 ../../07-10_profiling.log
100.0%  <module>  100.0%  tests/learner/profile_sit.py:1
  3.7% .. <module>  3.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/__init__.py:1
  3.3% .... <module>  89.2%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/alignments.py:1
  2.7% ...... <module>  80.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/__init__.py:1
 93.8% .. run  93.8%  tests/learner/profile_sit.py:9
 93.8% .... test_sit  100.0%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/subset_verb_learner_test.py:168
  7.2% ...... subset_verb_language_factory  7.7%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/subset_verb_learner_test.py:62
  7.2% ........ object_recognizer_factory  100.0%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/__init__.py:25
 82.8% ...... run_verb_test  88.3%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/subset_verb_learner_test.py:92
  2.0% ........ sampled  2.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/situation/templates/phase1_templates.py:317
  7.8% ........ instances  9.5%  /nas/home/jcecil/projects/adam/repos/adam/adam/curriculum/__init__.py:149
 44.0% ........ observe  53.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/template_learner.py:54
  3.7% ........ describe  4.5%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/template_learner.py:89
 23.1% ........ observe  27.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/integrated_learner.py:82
  2.1% ........ describe  2.5%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/integrated_learner.py:134
  3.7% ...... integrated_learner_factory  3.9%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/subset_verb_learner_test.py:70
  3.6% ........ object_recognizer_factory  98.9%  /nas/home/jcecil/projects/adam/repos/adam/tests/learner/__init__.py:25

Unfortunately it doesn't look like py-spy supports PyPy. I tried running it:

(adam) [jcecil@adam-dev adam]$ py-spy record -o profile.svg -- pypy3 tests/learner/profile_sit.py
Error: Failed to find python version from target process

I'll look into observe.

@gabbard
Copy link

gabbard commented Jul 10, 2020

One thing which leaps out in instances is that we should lift the generation of the perceptions out of the loop over linguistic realizations. However, I think we are currently only ever producing one linguistic realization, so this is more future-proofing than improving our current situation.

Can you up the prune depth even further? I suspect the bulk of the time from instances is being spent in the language generator or the perception generator, but we need to dig deeper to figure out which.

@spigo900
Copy link
Collaborator Author

@gabbard It's spending the majority of the time in instances inside the perception generator:

  7.8% ................ instances  9.5%  /nas/home/jcecil/projects/adam/repos/adam/adam/curriculum/__init__.py:149
  0.5% .................... generate_language  6.8%  /nas/home/jcecil/projects/adam/repos/adam/adam/language_specific/english/english_language_generator.py:136
  7.3% .................... generate_perception  93.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:118

Here's the breakdown for what the perception generator is itself doing (with extra bits deleted):

(adam) [jcecil@adam-dev adam]$ pypy3 -m vmprof.show --indent 4 --prune_level 12 --prune_percent 0.05 07-10_profiling.log
  7.3% .................... generate_perception  93.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:118
  7.3% ........................ do  99.6%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:217
  7.3% ............................ _real_do  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:225
  4.0% ................................ _perceive_objects  55.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:361
  0.1% .................................... instantiate_ontology_node  3.6%  /nas/home/jcecil/projects/adam/repos/adam/adam/situation/__init__.py:158
  3.9% .................................... _perceive_object  96.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:377
  3.8% ........................................ _instantiate_object_schema  96.8%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:1112
  1.7% ............................................ copy  44.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/geon.py:50
  1.3% ................................................ remap_axes  80.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/axes.py:262
  0.7% ................................ _perceive_property_assertions  10.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:459
  0.6% ................................ _perceive_colors  8.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:594
  0.2% ................................ _perceive_size_relative_to_learner  2.6%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:737
  0.1% ................................ _perceive_axis_info  1.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:771
  0.3% ................................ _perceive_action  3.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:853
  0.9% ................................ _perceive_ground_relations  12.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/high_level_semantics_situation_to_developmental_p...:776
  0.3% ................................ __init__  3.6%  <attrs generated init adam.perception.developmental_primitive_perception.DevelopmentalPrimitivePerceptionFr...:1

It looks like object perception is the single biggest time sink within the perception generator, although the other steps add up to a significant amount of the time.

@spigo900
Copy link
Collaborator Author

@gabbard Here are the results for observe.

Within observe, both the old- and new-style learners spend most of their time on learn_from (20-25%) and a decent amount on extracting the perception graph (11.7-13.4%). Most but not all of the time in learn_from is spent intersecting patterns.

In match_objects (inside observe), both learners spent most of their time in first (70-75%) and about 10% of their time replacing subgraphs. The other time sink for both learners was extracting candidate objects. The old-style learner spent more time (17.3% of its time in match_objects) on this and the new-style spent somewhat less (10.2%).

Inside observe but outside _preprocess_scene_for_learning / enrich_during_learning

Old learner

 44.0% ................ observe  53.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/template_learner.py:54
  5.9% .................... _extract_perception_graph  13.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/verbs.py:137
  5.9% ........................ from_dynamic_perceptual_representation  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:347
  5.9% ............................ translate_frames  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:2676
  2.0% ................................ translate_frame  34.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:2598
  1.5% ................................ copy_with_temporal_scopes  26.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:355
 28.6% .................... _preprocess_scene_for_learning  65.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/verbs.py:142
  9.1% .................... _learning_step  20.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/subset.py:44
  1.1% ........................ _hypothesis_from_perception  11.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/verbs.py:177
  1.1% ............................ from_graph  99.8%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/perception_graph_template.py:49
  1.0% ................................ from_graph  96.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:687
  0.9% .................................... _translate_graph  85.5%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:809
  8.0% ........................ _update_hypothesis  88.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/verbs.py:187
  8.0% ............................ intersection  99.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/perception_graph_template.py:96
  7.8% ................................ intersection  98.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:964
  7.8% .................................... relax_pattern_until_it_matches  99.5%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1235
  6.3% ........................................ first  81.4%  /nas/home/jcecil/miniconda3/envs/adam/site-packages/more_itertools/more.py:120
  6.3% ............................................ _internal_matches  99.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1303
  3.4% ................................................ subgraph_isomorphisms_iter  53.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/_matcher.py:542
  1.0% ................................................ __init__  15.1%  <attrs generated init adam.perception.perception_graph.PatternMatching.MatchFailure>:1
  1.4% ........................................ _relax_pattern  18.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1449

Integrated learner

 23.1% ................ observe  27.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/integrated_learner.py:82
  2.7% .................... _extract_perception_graph  11.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/integrated_learner.py:325
  2.7% ........................ from_dynamic_perceptual_representation  99.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:347
  2.7% ............................ translate_frames  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:2676
  1.0% ................................ translate_frame  37.6%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:2598
 14.1% .................... enrich_during_learning  61.2%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/objects.py:488
  5.8% .................... learn_from  24.9%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/template_learner.py:260
  5.2% ........................ _learning_step  90.2%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/subset.py:149
  4.6% ............................ <listcomp>  88.8%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/subset.py:205
  4.6% ................................ _update_hypothesis  100.0%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/verbs.py:253
  4.6% .................................... intersection  99.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/perception_graph_template.py:96
  4.4% ........................................ intersection  96.7%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:964
  4.4% ............................................ relax_pattern_until_it_matches  99.3%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:1235
  3.6% ................................................ first  82.0%  /nas/home/jcecil/miniconda3/envs/adam/site-packages/more_itertools/more.py:120

In match_objects, but outside first

Old-style learner

 27.7% ................................ match_objects  99.6%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:231
  2.8% .................................... replace_match_with_object_graph_node  10.3%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:664
  4.8% .................................... extract_candidate_objects  17.3%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:494
  2.6% ........................................ subgraph_by_nodes  54.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/perception/perception_graph.py:381
  2.3% ............................................ __init__  89.8%  <attrs generated init adam.perception.perception_graph.PerceptionGraph>:1
  2.3% ................................................ copy_digraph  97.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/utils/networkx_utils.py:69
 19.2% .................................... first  69.3%  /nas/home/jcecil/miniconda3/envs/adam/site-packages/more_itertools/more.py:120

Integrated learner

 14.1% ............................ match_objects  99.4%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:231
  1.6% ................................ replace_match_with_object_graph_node  11.1%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:664
  1.4% ................................ extract_candidate_objects  10.2%  /nas/home/jcecil/projects/adam/repos/adam/adam/learner/object_recognizer.py:494
 10.5% ................................ first  74.4%  /nas/home/jcecil/miniconda3/envs/adam/site-packages/more_itertools/more.py:120

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 13, 2020

It looks like to get line profiling results, you have to explicitly enable that when you do your profiling, so I'll try that and see if it gets us any more interesting results.

@spigo900
Copy link
Collaborator Author

Addendum: No line profiling for PyPy:

(adam) [jcecil@adam-dev adam]$ sh 07-13_profile.sh
Line profiling is currently unsupported for PyPy. Running without lines statistics.

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 13, 2020

I was able to get flamegraphs of the overall profiling results. For the first I used --prune_percent 0.1 and for the second, 07-13_profile_smaller.svg, I used --prune_percent 1. File: 07-13_verbtest_flamegraphs.zip

@gabbard
Copy link

gabbard commented Jul 13, 2020

@spigo900 : The lack of line-profiling for pypy is not too unsurprising since the JIT probably makes tracking line correspondences difficult (though not impossible; the JVM can do it pretty well).

@gabbard
Copy link

gabbard commented Jul 13, 2020

@spigo900 : I see constructing the object recognizer for the test is about 7% of the time. A quick thing to do would be to check if anything in our test suit is creating an object recognizer for each test instead of just creating and initializing a single shared one across all tests.

@spigo900
Copy link
Collaborator Author

@gabbard Yes, it's creating a new object recognizer every time it creates a new learner, i.e. once for every template in every test.

@gabbard
Copy link

gabbard commented Jul 13, 2020

Got it. Because the ObjectRecognizer is immutable, we can just create it once and reuse it in every test.

@spigo900
Copy link
Collaborator Author

@gabbard Here are the updated flamegraphs: 07-14_06-39_verbtest_flamegraphs.zip

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 14, 2020

@gabbard I looked into running the Travis tests in PyPy. Specifically, I tried running the tests using pypy3 -m pytest on adam-dev. There are some errors due to the visualization tests and experiment_test.py requiring NumPy, Torch, Panda3D, etc. NumPy is available for PyPy but I think the others are not, and the tests don't work without the others.

I think what we want to do here is just --ignore those tests and run them instead using CPython, although I'm not sure how to tell Pytest to run only those tests for the CPython end of things.

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 14, 2020

I tried running all tests using PyPy on adam-dev (excluding those that can't be run), and it was significantly slower than expected. (After about 1 hour running on adam-dev, testing had not finished.) I'll try benchmarking again to see what's going on.

@gabbard
Copy link

gabbard commented Jul 14, 2020

@joecummings : You can disable all the visualization tests altogether. The visualization is basically dead.

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 15, 2020

@gabbard Does that include experiment_test.py ? I assume so but wanted to confirm. (It doesn't seem to directly import any visualization-related code.)

Also, I rebenchmarked the tests since I was concerned about PyPy possibly taking longer in total. (I think the last time I benchmarked the tests, it was purely on the verb learner. Also, when I attempted to run all tests in PyPy it took over an hour... because I was on an old branch that included the old-style learner tests, etc.)

The results confirm that PyPy gives an overall speedup. (CPython takes 30 minutes 40 seconds total, while PyPy takes 12 minutes.) Results: 07-14_test_results.zip

@gabbard
Copy link

gabbard commented Jul 15, 2020

@spigo900 : You can also get rid of experiment_test - it's outdated

@spigo900
Copy link
Collaborator Author

@gabbard It looks like somehow it takes Travis longer to run the tests in PyPy compared with CPython. See this PyPy build vs. this CPython build of the commit where the Travis branch diverged from master (so the tests being run should be almost identical). The PyPy build has taken 1 hour 36 minutes and is still going while the CPython build took 1 hour 1 minute. I examined the log and Travis is running the tests using PyPy, not CPython. (Although if it were using CPython and somehow getting much slower build times, that would raise perhaps even more questions.)

I'm not sure how to square this with the adam-dev benchmarking results. It's not the steps before testing because the CPython build spent more time overall on those steps compared with the PyPy build. And it doesn't seem to be hanging, because (1) if so, Travis should have stopped the build by now, and (2) there was no hang when I ran the tests on adam-dev. Overall I am confused.

@spigo900
Copy link
Collaborator Author

Here are the single-test profiling results from last Friday:
07-17_10-26_profile.zip

Here are the profiling results from running all subset verb learner tests:
07-17_11-52_all_profile.zip

@spigo900
Copy link
Collaborator Author

I investigated some of the code based on the single-test profiling results:

  • extract_candidate_objects()
    • 8% of running time, 4.03 percentage points of which are outside subgraph_by_nodes()
      • That is, 4.03% of running time is spent in this function but outside subgraph_by_nodes()
    • It copies the scene digraph apparently without need.
      • I suspect this might be because match_objects needs to use the resulting candidate subgraphs and not have them when it replaces some of the subgraphs, in which case it wouldn't be without need, but I'm not sure.
    • There are two different BFSes which seem to visit the same nodes?
    • Given that we now create indirect subobject relations in _instantiate_object_schema(), I suspect we could optimize this some.
      • We don't need the first BFS anymore, I think, so we could drop it and just directly check for a "PART_OF" edge in the second one.
  • translate_frame()
    • 3.70% of running time
    • Unneeded copy at the end due to PerceptionGraph constructor copying the graph (as discussed above)
  • translate_frames() (multiple frames)
    • 10.63% of running time
    • Does seven unneeded graph copies
      • Twice for the two calls to translate_frame()
      • Twice for the two calls to copy_with_temporal_scopes()
      • Twice for the two calls to copy_as_digraph()
      • One last time when is returns the PerceptionGraph at the end
    • Small optimization: It could probably use the "before" frame graph copy as its base instead of creating a new graph and updating with the nodes and edges from the "before" frame graph.
  • copy_with_temporal_scopes()
    • 2.85% of running time
    • The only issue I noticed was the extra copy (because it returns a new PerceptionGraph)
  • _instantiate_object_schema()
    • 5.54% of running time
    • No obvious optimizations that I saw

@gabbard
Copy link

gabbard commented Jul 20, 2020

@spigo900 : Thanks. Can you attend to the unnecessary copies and then re-profile?

@spigo900
Copy link
Collaborator Author

@gabbard I haven't run profiling yet, but I did benchmarking over all tests. Eliminating the extra copies saves about 20 seconds at test time. (The total subset verb learner test time was 6 minutes 34 seconds with copies, 6 minutes 11 seconds without.) Full results: 07-21_benchmark_extra_copies_vs_not.zip

@spigo900
Copy link
Collaborator Author

Here are the profiling results: 07-21_08-25_profile_no_extra_copies.zip

@spigo900
Copy link
Collaborator Author

In response to the comment on #926, I replaced copy_digraph_with_temporal_scopes with add_temporal_scopes, then re-benchmarked and re-profiled the code. Here are the benchmark and the profiling results: 07-21_10-34_pypy_results_one_less_copy.txt and 07-21_11-09_profile_one_less_copy.zip

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 22, 2020

@gabbard Given the profiling results we've seen so far, do you think it still makes sense to build a matching server? The matching algorithm itself (or subgraph_monomorphisms_iter()) seems to consistently make up only about 17-20% of total time, not very much of the total as you pointed out, so handling it on a server should save at most about 1.2 minutes (and realistically less than that). Given the extra maintenance such a server would require I'm wondering if it still makes sense.

@gabbard
Copy link

gabbard commented Jul 22, 2020

@spigo900 : I agree it doesn't make much sense unless we can drive the other portions of the code down further.

An alternate thing to consider: since a lot of the work happens outside learning proper (e.g. in building situations and translating to perception graphs), can we use Pythion's multiprocessing to do all that translation in multiple cores in parallel, and do just the learning part proper (e.g. pattenr matching and hypothesis updating) on the main thread?

@spigo900
Copy link
Collaborator Author

@gabbard That's a good thought. I can look into that.

I'm wondering how we would want to split things up over different cores then. The one example I can think of right off is splitting the frame translation work in translate_frames() since those should be independent, but I suspect that wouldn't save us much time. There is probably a better way to split things up, maybe inside translate_frame().

I guess thinking about that is part of looking into using multiprocessing, though.

@spigo900
Copy link
Collaborator Author

To clarify, are we talking about using the multiprocessing module, the threading module, or something else?

@gabbard
Copy link

gabbard commented Jul 22, 2020

@spigo900 : Figuring out the best option, or at least the trade-offs, is part of the assignment. :-)

@gabbard
Copy link

gabbard commented Jul 22, 2020

@spigo900 : I think we want to do the split up at the highest level we can so that a little code as possible has to be aware of the multiprocessing. One possibility is to have a "instance generator" object which gets passed the curricula to generate instances for. The overall learner then has a loop when it asks the "instance generator" for its "next instance". The "instance generator", meanwhile, is busy in its own thread turning templates in concrete situations, and then farming out the templates across multiple threads for translation to perception graphs.

@spigo900
Copy link
Collaborator Author

I think the multiprocessing module is the winner. PyPy has a GIL, and this part of our code isn't doing I/O, so using threads doesn't make sense here.

In terms of using multiprocessing, Pool.imap() (documented here) seems most suited to implementing @gabbard's idea above, and probably useful for other places we might want to split up work. Key points:

  • imap lets you split up work over a fixed number of subprocesses.
  • It won't block until they are done; instead it returns an iterator.
    • However, it will continue working in the background until it has done all the work, rather than stopping at some fixed number until some have been read through the iterator.
    • So it should speed things up by getting the generation work done ahead of time.
  • It preserves the ordering of the results, so it shouldn't make our test results non-deterministic.

We can easily present the same interface as GeneratedFromSituationsInstanceGroup while doing the work in parallel using imap.

One annoying thing about multiprocessing is that it doesn't work well in CPython to interrupt the program while it's doing multiprocessing. I do not know if it works better in PyPy.

Multiprocessing is also not very debugging-friendly, I think, so we'll want to keep around the single-process instance group for that.

Looking into profiling, it seems that vmprof cannot profile the subprocesses, only the main process/'thread,' which may be a problem.

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 24, 2020

It looks like profiling multiprocess programs is an issue with CPython and cProfile as well, and we might be able to work around it by adding a call to vmprof in code that the subprocesses will call to profile them separately. Messy, but it may work.

The vmprof function we want to call is vmprof.enable (documented here).

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 24, 2020

I modified the flamegraph generation script to reformat the "all subset verb learner tests" output. It now shows the data aggregated over all calls to run_verb_test(), such that it should give a more useful view of the data. Here is the script together with the results: 07-24_rewritten_overall_profiling_data.zip

Note that the data I used to generate this today is from the 17th, so I've also included the single-test profiling results for the 17th in the zip file.

The results are almost identical, so I think that for now it makes sense to continue using the single-profiling test for performance checking.

@spigo900
Copy link
Collaborator Author

One complication with multiprocessing using imap this way is that we would be using multiprocessing only while testing a single template within a larger test. That is, we would spawn separate pools of extra processes for sit and sit on when running test_sit. Within each template/run pair there may be few enough situations that it's not worthwhile to do this and possibly counterproductive.

Looking into it, however, the import times may be a problem. The single-test results for test_sit show imports as taking about 16.74% of the total running time, and for reference, the English plus Chinese parts take about 21 seconds to run, so that translates to about 3.5 seconds. Most of this comes from heavy imports not needed for situation generation -- the test utilities, and the object recognizer and alignment modules. The remainder (which situation generation is more likely to need) take about 300 ms of the time.

I think ideally we want to generate as many situations as possible using (the same set of) other processes in the background so that we don't have to wait on situation generation while running the tests. I'll think more about how to do this.

@spigo900
Copy link
Collaborator Author

spigo900 commented Jul 30, 2020

I've implemented this on a branch, however it causes the rest of the code to fail due to issues with pickling.

Multiprocessing uses pickles to transfer data between the main process and its subprocesses. It turns out some of our objects, at least, are not pickle-stable. Situations are one such example: If you have a situation x, then pickle.loads(pickle.dumps(x)) != x. That is, unserializing the serialized situation does not give you the same situation back. This happens with several other objects, too -- with Ontolog ies and AxisInfo s, for example.

As it turns out, the rest of the code is not robust to this. I tested the code without using multiprocessing at all -- doing the same thing I was in the multiprocessing version, but doing it synchronously. Without pickling, there was no error. I was able to reproduce the multiprocessing error by changing the code to pickle the situations before processing and to unpickle the results. This commit demonstrates the issue.

For some situations the difference is trivial and the actual result is functionally the same. However, I am not sure this is true for all situations. While implementing the above test, I found that pickling and unpickling the inputs but not the outputs caused test_eat_simple to fail, but not immediately (as it does with multiprocessing as well as when the outputs are also pickled). Thus I suspect that for some situations there may be a material difference and not just a difference in object identities.

I am not sure yet exactly where the problem enters, however I noticed that one of the perception graphs generated by the multiprocessing version had two duplicate nodes. The duplicates were both axes: learner-back-to-front and gravitational-up. I suspect this is caused by pickling however I'm unsure why it happens.

ETA: As an update, when I pickle the outputs but not the inputs, I get similar results to when I pickle the inputs but not the outputs. The test fails but it doesn't fail immediately.

@spigo900
Copy link
Collaborator Author

spigo900 commented Aug 4, 2020

I've been working on a parallelized version that spawns two processes per template, one for the scenario curriculum and one for the training one. This version is quite slow. It takes at least five times as long as the non-parallel version. It also fails some tests. (I suspect this may have to do with the pickling issues detailed above.) On the other hand the version that generates the perceptions/language using a process pool doesn't work at all.

The process pool version mysteriously never puts the second template instance in the queue (after 60 seconds), but also mysteriously doesn't call either the result or the error callback. I tried saving the async result and calling .get(6) (6 seconds) on it. This timed out without giving me a remote execution error.

Pickling time doesn't seem to be the problem. I tried generating and pickling/unpickling the results on the main process and it only took a second or two for each result.

Overall I am confused why the process pool version doesn't work.

@spigo900
Copy link
Collaborator Author

spigo900 commented Aug 4, 2020

Update: Actually, it turns out the process version isn't as slow as I thought. It's only about half again as slow as the normal way of running the tests (takes ~6 minutes vs. ~4). The problem is instead that the subprocesses aren't all dying so the test process doesn't exit when the tests are complete. I suspect this is because when the tests fail the generator never finishes, so it never calls .join() on its corresponding process. In order to fix this I'd need to join the processes even when the test fails. Considering it's so much slower, though, I think it's not worth figuring out how to make that happen.

@spigo900
Copy link
Collaborator Author

spigo900 commented Aug 5, 2020

As an update, I figured out why the pool-based version was failing. It turns out that there is a deadlock.

If you run a job that involves a queue using Pool#apply_async(), and you then call Pool#close() on the pool before you call Queue#get() on the queue, you get a deadlock. As far as I can tell, this is because when you call close(), the Pool calls join() on its subprocesses, which causes a deadlock as detailed here.

I was able to modify their example to use a pool and replicated the problem I've been having. The following code prints "oh no, queue was empty!" as-is and "obj is get!" when the code below the comment is moved into the "with" statement (i.e. before close() is called on the pool):

from multiprocessing import Pool, Queue
from time import sleep
import queue

def f():
    print('inside f')
    f.q.put('X' * 1000000)

def f_init(q):
    f.q = q

if __name__ == '__main__':
    q = Queue()
    with Pool(initializer=f_init, initargs=[q]) as pool:
        pool.apply_async(f, [])
    # NOTE: With code below inside the `with` statement, it finishes.
    sleep(5)
    print('DONE with sleep; getting an object')
    try:
        obj = q.get(timeout=10)
        print('obj is get!')
    except queue.Empty:
        print('oh no, queue was empty!')

@mitchmarcus
Copy link
Collaborator

mitchmarcus commented Aug 13, 2020 via email

@spigo900
Copy link
Collaborator Author

@mitchmarcus Thank you for the link.

To clarify, the problem is not that Python can't pickle the objects. Rather, it's that when it does unpickle them, it causes some strange issues. If you pickle and unpickle the train and test curricula in-memory during the verb learner unit tests before using them, it causes those tests to fail in strange ways. Sometimes they crash and sometimes they fail to produce the right description when ordinarily (without the pickle-unpickle step) they can.

I suspect the root cause is something in our representation that specifies eq=False to attrs. (I think Jacob may have referenced this? My Zoom connection dropped after he mentioned attrs.) Most likely the problem is the Ontology and its parts. The issue I think is that the curriculum references the ontology, which gets pickled by value. Once a curriculum is unpickled its ontology no longer matches the GAILA_PHASE_1_ONTOLOGY which is the one the learner uses. (It's definitely true that the ontology fails to match up with the GAILA_... one but I haven't been able to confirm that this is actually the cause of the problem.)

However, writing this up does give me an idea. It's possible we could work around this (in a rather ugly way) by also pickling a copy of every learner together with the curriculum. (So for example a pair (generated_curriculum, dictionary_mapping_learner_names_to_learners).) However I also suspect pickle/attrs is not smart enough to pickle it in such a way that the pickled learner will reference the same ontology object as the pickled curriculum does.

I'll try experimenting with this in a simple script and see if that can work. I also seem to recall seeing a workaround or issue related to this and attrs -- I'll see if I can find such a thing.

@lichtefeld
Copy link
Collaborator

This may become an issue again in the future, but for now I'm closing this issue as the inputs to the learners will not be ADAM's symbolic perception for Phase 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants