Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tau memory_instrumentation (master) #5

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

naughtont3
Copy link
Owner

@naughtont3
Copy link
Owner Author

No, this is not valid. Things do not even compile. It seems there was a structural change to ompi_communicator_t that removed the opal_object_t c_base field and replaced with opal_infosubscriber_t super. I believe this was from 50aa143.

@naughtont3
Copy link
Owner Author

Need to look in more detail, but the reference to c_base was to get at the class name for the object, i.e., new_comm->c_base.obj_class->cls_name.

I wonder if we can now get at this through the super as something like, new_comm->super.s_base.obj_class->cls_name. (See also: opal/util/info_subscriber.h)

@naughtont3 naughtont3 force-pushed the tjn-tau-meminstr-master branch from 09230a4 to bd0a3e0 Compare January 5, 2019 01:39
@naughtont3
Copy link
Owner Author

The changes build and after pulling updates for OMPI master the basic startup worked (without tau_exec). I have to rebuild TAU for this build of OMPI due to errors that look like they may be related to library versioning issues.

@naughtont3
Copy link
Owner Author

naughtont3 commented Jan 5, 2019

OK, things worked fine after rebuild. I updated to TAU-2.28 and it worked too. Note, with OMPI master we now need to patch TAU (v2.27 or v2.28) to avoid using MPI_Handler_function (which was disabled/removed from OMPI). (See also: https://www.open-mpi.org/faq/?category=mpi-removed#mpi-1-mpi-handler-function)

A quick test with the ring example appears to work as expected. We should be a bit more careful to make sure we are tracking all the proper fields for allocations, but this PR appears to work with the latest master.

@naughtont3
Copy link
Owner Author

naughtont3 commented Jan 5, 2019

Attach my setup notes for testing on ELK.

And with that I use OMPI ring_c test and run post-processing script like this...

 orterun -np 2 -host node1,node2 tau_exec -T mpi,pdt ./ring_c
 tau2csv.sh . > ring_np2.csv

@naughtont3
Copy link
Owner Author

naughtont3 commented Jan 16, 2019

Note: now will enable OMPI memory trace bits at configure time by using --enable-mem-profile, otherwise the macros are no-ops.

Note: also using the more generic OPAL_MEMPROF_xxx macro names instead of inserting TAU specific lines in the core OMPI code. The idea being that this would be more generic and could be tailored for any other tools in the future (but not a priority for current efforts).

@naughtont3
Copy link
Owner Author

TODO:

  • Remove the track (non-hierarchical) calls from files (not used/needed)
  • Replace __attribute__((weak)) GCC-ism with generic macros used elsewhere in OMPI to specify weak linkage

@naughtont3
Copy link
Owner Author

naughtont3 commented Feb 26, 2019

TAU patch to avoid using MPI_Handler_function (tau-2.28 but will apply to tau-2.27.1 too).

(See also: https://www.open-mpi.org/faq/?category=mpi-removed#mpi-1-mpi-handler-function)

@naughtont3 naughtont3 mentioned this pull request Feb 26, 2019
@naughtont3
Copy link
Owner Author

Encountering errors when during init. The TAU message indicates we have overlapping allocations and it aborts(). Example error message...

  ERROR: Overlapping allocations. Found opal_cleanup_fn_item_t but pmix4x_opcaddy_t expected.

Reproducer varies, some cases only need -np 1 other machines encountered at say -np 4.

     cd ompi/examples/                                                                    
     make                                                                                 
     mpirun -np 1 tau_exec -T mpi,pdt ./hello_c                                           
       # NOTE: Sometime may work, increasing '-np 4' increased occurrence    

@naughtont3
Copy link
Owner Author

Example Usage: (With threading safety)

# Normal run
mpirun -np 4  ./a.out

# Instrumented run
mpirun -np 4 tau_exec -T mpi,pdt,pthread ./a.out

There was a restructuring of the comm/win/group objects
that shifted the fields down for getting at the class name.
This appears to relate to opal_infosubscriber_t changes from 50aa143.

NOTE: The opal_infosubscriber_t changes were not applied to the
oshmem_group_t structure, so no need for changes there.
Guard these using existing OPAL_ENABLE_MEM_PROFILE option
that is enabled via '--enable-mem-profile' configury.
@naughtont3 naughtont3 force-pushed the tjn-tau-meminstr-master branch from bdffe96 to 646c904 Compare May 30, 2019 17:01
@naughtont3
Copy link
Owner Author

I refreshed to latest upstream master (8961daa)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant